Paul Durrant [Tue, 28 Jan 2020 12:01:56 +0000 (12:01 +0000)]
docs/designs: Add a design document for migration of xenstore data
This patch details proposes extra migration data and xenstore protocol
extensions to support non-cooperative live migration of guests.
NOTE: doc/misc/xenstore.txt is also amended to replace the <mfn> term
for the INTRODUCE operation with the <gfn>, since this is what
it actually is.
Signed-off-by: Paul Durrant <paul@xen.org>
--- Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: George Dunlap <George.Dunlap@eu.citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Julien Grall <julien@xen.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Wei Liu <wl@xen.org>
v8:
- Addressed further comments form Julien
v7:
- Addressed further comments from Julien
- Switched migration records to defined structures instead of tuples
v6:
- Addressed comments from Julien
v5:
- Add QUIESCE
- Make semantics of <index> in GET_DOMAIN_WATCHES more clear
Paul Durrant [Mon, 27 Jan 2020 13:19:30 +0000 (13:19 +0000)]
docs/designs: Add a design document for non-cooperative live migration
It has become apparent to some large cloud providers that the current
model of cooperative migration of guests under Xen is not usable as it
relies on software running inside the guest, which is likely beyond the
provider's control.
This patch introduces a proposal for non-cooperative live migration,
designed not to rely on any guest-side software.
Signed-off-by: Paul Durrant <paul@xen.org>
--- Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: George Dunlap <George.Dunlap@eu.citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Julien Grall <julien@xen.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Wei Liu <wl@xen.org>
v8:
- Addressed comments from Julien on v6 that I missed
v6:
- Addressed comments from Julien
v5:
- Note that PV domain are not just expected to co-operate, they are
required to
v4:
- Fix issues raised by Wei
v2:
- Use the term 'non-cooperative' instead of 'transparent'
- Replace 'trust in' with 'reliance on' when referring to guest-side
software
Igor Druzhinin [Thu, 26 Mar 2020 11:49:42 +0000 (12:49 +0100)]
cpu: sync any remaining RCU callbacks before CPU up/down
During CPU down operation RCU callbacks are scheduled to finish
off some actions later as soon as CPU is fully dead (the same applies
to CPU up operation in case error path is taken). If in the same grace
period another CPU up operation is performed on the same CPU, RCU callback
will be called later on a CPU in a potentially wrong (already up again
instead of still being down) state leading to eventual state inconsistency
and/or crash.
In order to avoid it - flush RCU callbacks explicitly before starting the
next CPU up/down operation.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Thu, 26 Mar 2020 11:46:48 +0000 (12:46 +0100)]
rcu: add assertions to debug build
Xen's RCU implementation relies on no softirq handling taking place
while being in a RCU critical section. Add ASSERT()s in debug builds
in order to catch any violations.
For that purpose modify rcu_read_[un]lock() to use a dedicated percpu
counter additional to preempt_[en|dis]able() as this enables to test
that condition in __do_softirq() (ASSERT_NOT_IN_ATOMIC() is not
usable there due to __cpu_up() calling process_pending_softirqs()
while holding the cpu hotplug lock).
While at it switch the rcu_read_[un]lock() implementation to static
inline functions instead of macros.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Thu, 26 Mar 2020 11:46:11 +0000 (12:46 +0100)]
rcu: don't process callbacks when holding a rcu_read_lock()
Some keyhandlers are calling process_pending_softirqs() while holding
a rcu_read_lock(). This is wrong, as process_pending_softirqs() might
activate rcu calls which should not happen inside a rcu_read_lock().
For that purpose modify process_pending_softirqs() to not allow rcu
callback processing when a rcu_read_lock() is being held.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Thu, 26 Mar 2020 11:43:23 +0000 (12:43 +0100)]
rcu: don't use stop_machine_run() for rcu_barrier()
Today rcu_barrier() is calling stop_machine_run() to synchronize all
physical cpus in order to ensure all pending rcu calls have finished
when returning.
As stop_machine_run() is using tasklets this requires scheduling of
idle vcpus on all cpus imposing the need to call rcu_barrier() on idle
cpus only in case of core scheduling being active, as otherwise a
scheduling deadlock would occur.
There is no need at all to do the syncing of the cpus in tasklets, as
rcu activity is started in __do_softirq() called whenever softirq
activity is allowed. So rcu_barrier() can easily be modified to use
softirq for synchronization of the cpus no longer requiring any
scheduling activity.
As there already is a rcu softirq reuse that for the synchronization.
Remove the barrier element from struct rcu_data as it isn't used.
Finally switch rcu_barrier() to return void as it now can never fail.
Partially-based-on-patch-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
When using atomic variables for synchronization barriers are needed
to ensure proper data serialization. Introduce smp_mb__before_atomic()
and smp_mb__after_atomic() as in the Linux kernel for that purpose.
Use the same definitions as in the Linux kernel.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <jgrall@amazon.com>
Jan Beulich [Thu, 26 Mar 2020 11:36:30 +0000 (12:36 +0100)]
x86emul: vendor specific SYSENTER/SYSEXIT behavior in long mode
Intel CPUs permit both insns there while AMD ones don't.
While at it also
- drop the ring 0 check from SYSENTER handling - neither Intel's nor
AMD's insn pages have any indication of #GP(0) getting raised when
executed from ring 0, and trying it out in practice also confirms
the check shouldn't be there,
- move SYSENTER segment register writing until after the (in principle
able to fail) MSR reads.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 26 Mar 2020 11:27:36 +0000 (12:27 +0100)]
x86emul: add wrappers to check for AMD-like behavior
These are to aid readbility at their use sites, in particular because
we're going to gain more of them.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monné [Thu, 26 Mar 2020 11:25:40 +0000 (12:25 +0100)]
x86/nvmx: only update SVI when using Ack on exit
Check whether there's a valid interrupt in VM_EXIT_INTR_INFO in order
to decide whether to update SVI in nvmx_update_apicv. If Ack on exit
is not being used VM_EXIT_INTR_INFO won't have a valid interrupt and
hence SVI shouldn't be updated to signal the interrupt is currently in
service because it won't be Acked.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
The commit is wrong, as the whole point of nvmx_update_apicv is to
update the guest interrupt status field when the Ack on exit VMEXIT
control feature is enabled.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Juergen Gross [Thu, 26 Mar 2020 11:23:59 +0000 (12:23 +0100)]
sched: fix cpu offlining with core scheduling
Offlining a cpu with core scheduling active can result in a hanging
system. Reason is the scheduling resource and unit of the to be removed
cpus needs to be split in order to remove the cpu from its cpupool and
move it to the idle scheduler. In case one of the involved cpus happens
to have received a sched slave event due to a vcpu former having been
running on that cpu being woken up again, it can happen that this cpu
will enter sched_wait_rendezvous_in() while its scheduling resource is
just about to be split. It might wait for ever for the other sibling
to join, which will never happen due to the resources already being
modified.
This can easily be avoided by:
- resetting the rendezvous counters of the idle unit which is kept
- checking for a new scheduling resource in sched_wait_rendezvous_in()
after reacquiring the scheduling lock and resetting the counters in
that case without scheduling another vcpu
- moving schedule resource modifications (in schedule_cpu_rm()) and
retrieving (schedule(), sched_slave() is fine already, others are not
critical) into locked regions
Paul Durrant [Tue, 24 Mar 2020 16:40:50 +0000 (17:40 +0100)]
mm: add 'is_special_page' inline function...
... to cover xenheap and PGC_extra pages.
PGC_extra pages are intended to hold data structures that are associated
with a domain and may be mapped by that domain. They should not be treated
as 'normal' guest pages (i.e. RAM or page tables). Hence, in many cases
where code currently tests is_xen_heap_page() it should also check for
the PGC_extra bit in 'count_info'.
This patch therefore defines is_special_page() to cover both cases and
converts tests of is_xen_heap_page() (or open coded tests of PGC_xen_heap)
to is_special_page() where the page is assigned to a domain.
Signed-off-by: Paul Durrant <paul@xen.org> Acked-by: Tamas K Lengyel <tamas@tklengyel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien@xen.org>
Paul Durrant [Tue, 24 Mar 2020 16:40:09 +0000 (17:40 +0100)]
x86 / ioreq: use a MEMF_no_refcount allocation for server pages...
... now that it is safe to assign them.
This avoids relying on libxl (or whatever toolstack is in use) setting
max_pages up with sufficient 'slop' to allow all necessary ioreq server
pages to be allocated.
Signed-off-by: Paul Durrant <paul@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Tue, 24 Mar 2020 16:37:27 +0000 (17:37 +0100)]
mm: keep PGC_extra pages on a separate list
This patch adds a new page_list_head into struct domain to hold PGC_extra
pages. This avoids them getting confused with 'normal' domheap pages where
the domain's page_list is walked.
A new dump loop is also added to dump_pageframe_info() to unconditionally
dump the 'extra page list'.
Signed-off-by: Paul Durrant <paul@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien@xen.org>
Juergen Gross [Tue, 24 Mar 2020 16:36:44 +0000 (17:36 +0100)]
sched: fix onlining cpu with core scheduling active
When onlining a cpu cpupool_cpu_add() checks whether all siblings of
the new cpu are free in order to decide whether to add it to cpupool0.
In case the added cpu is not the last sibling to be onlined this test
is wrong as it only checks for all online siblings to be free. The
test should include the check for the number of siblings having
reached the scheduling granularity of cpupool0, too.
Pu Wen [Tue, 24 Mar 2020 09:56:22 +0000 (10:56 +0100)]
x86/mce: correct the machine check vendor for Hygon
Currently the xl dmesg output on Hygon platforms will be
"(XEN) CPU0: AMD Fam18h machine check reporting enabled",
which is misleading as AMD does not have family 18h (Hygon
negotiated with AMD to confirm that only Hygon has family 18h).
To correct this, add Hygon machine check type and vendor string.
Signed-off-by: Pu Wen <puwen@hygon.cn> Reviewed-by: Jan Beulich <jbeulich@suse.com>
David Woodhouse [Tue, 24 Mar 2020 09:43:51 +0000 (10:43 +0100)]
build: add -MP to CFLAGS along with -MMD
This causes gcc (yes, and clang) to emit phony targets for each dependency.
This means that when a header file is deleted, the C files which *used*
to include it will no longer stop building with bogus out-of-date
dependencies like this:
make[5]: *** No rule to make target
'/home/dwmw2/git/xen/xen/include/asm/hvm/svm/amd-iommu-proto.h',
needed by 'p2m.o'. Stop.
Based on -MP post-dating -MP by many years it is assumed that the
behavior of -MP isn't the defualt just out of extreme caution. We're
sufficiently convinced that there are no undue side effects of this.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Thu, 19 Mar 2020 13:54:19 +0000 (13:54 +0000)]
x86/ucode: Rationalise startup and family/model checks
Drop microcode_init_{intel,amd}(), export {intel,amd}_ucode_ops, and use a
switch statement in early_microcode_init() rather than probing each vendor in
turn. This allows the microcode_ops pointer to become local to core.c.
As there are no external users of microcode_ops, there is no need for
collect_cpu_info() to implement sanity checks. Move applicable checks to
early_microcode_init() so they are performed once, rather than repeatedly.
The Intel logic guarding the read of MSR_PLATFORM_ID is contrary to the SDM,
which states that the MSR has been architectural since the Pentium Pro
(06-01-xx), and lists no family/model restrictions in the pseudo-code for
microcode loading. Either way, Xen's 64bit-only nature already makes this
check redundant.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 18 Mar 2020 20:18:21 +0000 (20:18 +0000)]
x86/ucode: Move interface from processor.h to microcode.h
This reduces the complexity of processor.h, particularly the need to include
public/xen.h. Substitute processor.h includes for microcode.h in some
sources, and add microcode.h includes in others.
Only 4 of the function declarations are actually called externally. Move the
vendor init declarations to private.h
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Thu, 19 Mar 2020 11:47:48 +0000 (11:47 +0000)]
libxl: make creation of xenstore 'suspend event channel' node optional...
... and, if it is not created, make the top level 'device' node in
xenstore writable by the guest instead.
The purpose and semantics of the suspend event channel node are explained
in xenstore-paths.pandoc [1]. It was originally introduced in xend by
commit 17636f47a474 "Teach xc_save to use event-channel-based domain
suspend if available.". Note that, because, the top-level frontend
'device' node was created writable by the guest in xend, there was no
need to explicitly create the 'suspend-event-channel' node as a writable
node.
However, libxl creates the 'device' node as read-only by the guest and so
explicit creation of the 'suspend-event-channel' node is necessary to make
it usable. This unfortunately has the side-effect of making some old
Windows PV drivers [2] cease to function. This is because they scan the top
level 'device' node, find the 'suspend' node and expect it to contain the
usual sub-nodes describing a PV frontend. When this is found not to be the
case, enumeration ceases and (because the 'suspend' node is observed before
the 'vbd' node) no system disk is enumerated. Windows will then crash with
bugcheck code 0x7B (missing system disk).
This patch adds a boolean 'xend_suspend_evtchn_compat' field into
libxl_create_info and a similarly named option in xl.cfg to set it.
If the value is true then the xenstore node is not created. Instead the
old xend behaviour of making top level device node writable by the guest is
re-instated. If the value is false (the default) then the current libxl
behaviour persists.
xenstore-paths.pandoc is also modified to say that the suspend event
channel node may not exist and, if it does not exist, then the guest may
create it. A note is also added concerning the writability of the top
level device node.
NOTE: While adding the new LIBXL_HAVE_CREATEINFO_... definition into
libxl.h, this patch corrects the previous stanza which erroneously
implies libxl_domain_create_info is a function.
Signed-off-by: Paul Durrant <paul@xen.org> Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Paul Durrant [Thu, 19 Mar 2020 11:47:47 +0000 (11:47 +0000)]
libxl: create domain 'error' node in xenstore
Several PV drivers (both historically and currently [1]) report errors
by writing text into /local/domain/$DOMID/error. This patch creates the
node in libxl and makes it writable by the domain, and also adds some
text into xenstore-paths.pandoc to state what the node is for.
Due to recent reshuffling of header include paths mem_sharing no longer
compiles. Fix it by moving mem_sharing_domain declaration to location it
is used in.
Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Igor Druzhinin [Wed, 18 Mar 2020 11:55:54 +0000 (12:55 +0100)]
x86/shim: fix ballooning up the guest
args.preempted is meaningless here as it doesn't signal whether the
hypercall was preempted before. Use start_extent instead which is
correct (as long as the hypercall was invoked in a "normal" way).
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Tue, 17 Mar 2020 15:20:08 +0000 (16:20 +0100)]
libfdt: fix undefined behaviour in _fdt_splice()
Along the lines of commit d0b3ab0a0f46 ("libfdt: Fix undefined behaviour
in fdt_offset_ptr()"), _fdt_splice() similarly may not use pointer
arithmetic to do overflow checks.
David Gibson [Tue, 17 Mar 2020 15:18:57 +0000 (16:18 +0100)]
libfdt: Fix undefined behaviour in fdt_offset_ptr()
Using pointer arithmetic to generate a pointer outside a known object is,
technically, undefined behaviour in C. Unfortunately, we were using that
in fdt_offset_ptr() to detect overflows.
To fix this we need to do our bounds / overflow checking on the offsets
before constructing pointers from them.
Reported-by: David Binderman <dcb314@hotmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[upstream commit d0b3ab0a0f46ac929b4713da46f7fdcd893dd3bd] Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <jgrall@amazon.com>
Jan Beulich [Tue, 17 Mar 2020 15:17:20 +0000 (16:17 +0100)]
x86/HVM: reduce hvm.h include dependencies
Drop #include-s not needed by the header itself, and add smaller scope
ones instead. Put the ones needed into whichever other files actually
need them.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 17 Mar 2020 15:16:34 +0000 (16:16 +0100)]
x86/HVM: reduce io.h include dependencies
Drop #include-s not needed by the header itself as well as one include
of the header which isn't needed. Put the one needed into the file
actually requiring it.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 17 Mar 2020 15:14:57 +0000 (16:14 +0100)]
x86/HVM: reduce vioapic.h include dependencies
Drop an #include not needed by the header itself. While verifying the
header (now) builds standalone, I noticed an omission in a public header
which gets taken care of here as well.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Wed, 11 Mar 2020 18:22:37 +0000 (18:22 +0000)]
x86/vvmx: Fix deadlock with MSR bitmap merging
c/s c47984aabead "nvmx: implement support for MSR bitmaps" introduced a use of
map_domain_page() which may get used in the middle of context switch.
This is not safe, and causes Xen to deadlock on the mapcache lock:
(XEN) Xen call trace:
(XEN) [<ffff82d08022d6ae>] R _spin_lock+0x34/0x5e
(XEN) [<ffff82d0803219d7>] F map_domain_page+0x250/0x527
(XEN) [<ffff82d080356332>] F do_page_fault+0x420/0x780
(XEN) [<ffff82d08038da3d>] F x86_64/entry.S#handle_exception_saved+0x68/0x94
(XEN) [<ffff82d08031729f>] F __find_next_zero_bit+0x28/0x69
(XEN) [<ffff82d080321a4d>] F map_domain_page+0x2c6/0x527
(XEN) [<ffff82d08029eeb2>] F nvmx_update_exec_control+0x1d7/0x323
(XEN) [<ffff82d080299f5a>] F vmx_update_cpu_exec_control+0x23/0x40
(XEN) [<ffff82d08029a3f7>] F arch/x86/hvm/vmx/vmx.c#vmx_ctxt_switch_from+0xb7/0x121
(XEN) [<ffff82d08031d796>] F arch/x86/domain.c#__context_switch+0x124/0x4a9
(XEN) [<ffff82d080320925>] F context_switch+0x154/0x62c
(XEN) [<ffff82d080252f3e>] F common/sched/core.c#sched_context_switch+0x16a/0x175
(XEN) [<ffff82d080253877>] F common/sched/core.c#schedule+0x2ad/0x2bc
(XEN) [<ffff82d08022cc97>] F common/softirq.c#__do_softirq+0xb7/0xc8
(XEN) [<ffff82d08022cd38>] F do_softirq+0x18/0x1a
(XEN) [<ffff82d0802a2fbb>] F vmx_asm_do_vmentry+0x2b/0x30
Convert the domheap page into being a xenheap page.
Fixes: c47984aabead - nvmx: implement support for MSR bitmaps Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Jan Beulich [Mon, 16 Mar 2020 16:32:41 +0000 (17:32 +0100)]
x86/APIC: reduce rounding errors in calculations
Dividing by HZ/10 just to subsequently multiply by HZ again in all uses
of the respective variable is pretty pointlessly introducing rounding
(really: truncation) errors. While transforming the respective
expressions it became apparent that "result" would be left unused except
for its use as function return value. As the sole caller of the function
doesn't look at the returned value, simply convert the function to have
"void" return type.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 16 Mar 2020 16:31:35 +0000 (17:31 +0100)]
x86/time: reduce rounding errors in calculations
Plain (unsigned) integer division simply truncates the results. The
overall errors are smaller though if we use proper rounding. (Extend
this to the purely cosmetic aspect of time.c's freq_string(), which
before this change I've frequently observed to report e.g. NN.999MHz
HPET clock speeds.)
While adding the rounding logic, also switch to using an unsigned
constant for the other, original half of bus_cycle's calculation.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Juergen Gross [Mon, 16 Mar 2020 10:27:29 +0000 (11:27 +0100)]
spinlocks: fix placement of preempt_[dis|en]able()
In case Xen ever gains preemption support the spinlock coding's
placement of preempt_disable() and preempt_enable() should be outside
of the locked section.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Mon, 16 Mar 2020 10:26:10 +0000 (11:26 +0100)]
x86/APIC: adjust types and comments in calibrate_APIC_clock()
First and foremost the comment talking about potential underflow being
taken care of by using signed long type variables was true only on
32-bit, which we've not been supporting for quite some time. Drop the
comment and change all involved types to unsigned. Take the opportunity
and also replace bus_cycle's fixed width type.
Additionally there's no point using an "arbitrary (but long enough)
timeout" here. Just use the maximum possible value; Linux does so too,
just as an additional data point.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 16 Mar 2020 10:25:45 +0000 (11:25 +0100)]
kconfig: expose all{yes,no}config targets
Without having them at least at the xen/Makefile level they're (close
to?) inaccessible. As I'm uncertain about their utility at the top
level, I'm leaving it at that for now.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wl@xen.org> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 16 Mar 2020 10:24:29 +0000 (11:24 +0100)]
AMD/IOMMU: fix off-by-one in amd_iommu_get_paging_mode() callers
amd_iommu_get_paging_mode() expects a count, not a "maximum possible"
value. Prior to b4f042236ae0 dropping the reference, the use of our mis-
named "max_page" in amd_iommu_domain_init() may have lead to such a
misunderstanding. In an attempt to avoid such confusion in the future,
rename the function's parameter and - while at it - convert it to an
inline function.
Also replace a literal 4 by an expression tying it to a wider use
constant, just like amd_iommu_quarantine_init() does.
Fixes: ea38867831da ("x86 / iommu: set up a scratch page in the quarantine domain") Fixes: b4f042236ae0 ("AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Paweł Marczewski [Fri, 13 Mar 2020 11:25:10 +0000 (11:25 +0000)]
libxl: fix cleanup bug in initiate_domain_create()
In case of errors, we immediately call domcreate_complete()
which cleans up the console_xswait object. Make sure it is initialized
before we start cleanup.
Signed-off-by: Paweł Marczewski <pawel@invisiblethingslab.com> Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Roger Pau Monne [Fri, 13 Mar 2020 08:45:58 +0000 (09:45 +0100)]
libfsimage: fix parentheses in macro parameters
VERIFY_DN_TYPE and VERIFY_OS_TYPE should use parentheses when
accessing the type parameter. Note that none of the current usages
require this, it's just done for correctness.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Wei Liu <wl@xen.org>
Juergen Gross [Wed, 11 Mar 2020 12:17:41 +0000 (13:17 +0100)]
rcu: use rcu softirq for forcing quiescent state
As rcu callbacks are processed in __do_softirq() there is no need to
use the scheduling softirq for forcing quiescent state. Any other
softirq would do the job and the scheduling one is the most expensive.
So use the already existing rcu softirq for that purpose. For telling
apart why the rcu softirq was raised add a flag for the current usage.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 10 Mar 2020 16:06:57 +0000 (17:06 +0100)]
memaccess: reduce include dependencies
The common header doesn't itself need to include public/vm_event.h nor
public/memory.h. Drop their inclusion. This requires using the non-
typedef names in two prototypes and an inline function; by not changing
the callers and function definitions at the same time it'll remain
certain that the build would fail if the typedef itself was changed.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
Paul Durrant [Tue, 10 Mar 2020 16:06:09 +0000 (17:06 +0100)]
x86 / p2m: replace page_list check in p2m_alloc_table...
... with a check of domain_tot_pages().
The check of page_list prevents the prior allocation of PGC_extra pages,
whereas what the code is trying to verify is that the toolstack has not
already RAM for the domain.
Signed-off-by: Paul Durrant <paul@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 10 Mar 2020 14:38:25 +0000 (15:38 +0100)]
vmevent: reduce include dependencies
There's no need for virtually everything to include public/vm_event.h.
Move its inclusion out of sched.h. This requires using the non-typedef
name in p2m_mem_paging_resume()'s prototype; by not changing the
function definition at the same time it'll remain certain that the build
would fail if the typedef itself was changed.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Alexandru Isaila <aisaila@bitdefender.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
The current implementation of the hypervisor assisted flush for HAP is
extremely inefficient.
First of all there's no need to call paging_update_cr3, as the only
relevant part of that function when doing a flush is the ASID vCPU
flush, so just call that function directly.
Since hvm_asid_flush_vcpu is protected against concurrent callers by
using atomic operations there's no need anymore to pause the affected
vCPUs.
Finally the global TLB flush performed by flush_tlb_mask is also not
necessary, since we only want to flush the guest TLB state it's enough
to trigger a vmexit on the pCPUs currently holding any vCPU state, as
such vmexit will already perform an ASID/VPID update, and thus clear
the guest TLB.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wl@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Tue, 10 Mar 2020 14:29:24 +0000 (15:29 +0100)]
x86/paging: add TLB flush hook
Add shadow and hap implementation specific helpers to perform guest
TLB flushes. Note that the code for both is exactly the same at the
moment, and is copied from hvm_flush_vcpu_tlb. This will be changed by
further patches that will add implementation specific optimizations to
them.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wl@xen.org> Acked-by: Tim Deegan <tim@xen.org> Reviewed-by: Paul Durrant <pdurrant@amzn.com> [viridian] Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 10 Mar 2020 14:25:58 +0000 (15:25 +0100)]
AMD/IOMMU: without XT, x2APIC needs to be forced into physical mode
The wider cluster mode APIC IDs aren't generally representable. Convert
the iommu_intremap variable into a tristate, allowing the AMD IOMMU
driver to signal this special restriction to the apic_x2apic_probe().
(Note: assignments to the variable get adjusted, while existing
consumers - all assuming a boolean property - are left alone.)
While we are not aware of any hardware/firmware with this as a
restriction, it is a situation which could be created on fully x2apic-
capable systems via firmware settings.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
George Dunlap [Thu, 5 Mar 2020 11:34:07 +0000 (11:34 +0000)]
golang/xenlight: Fix handling of marshalling of empty elements for keyed unions
Keyed types in libxl_types.idl can have elements of type 'None'. The
golang type generator (correctly) don't implement any union types for
these empty elements. However, the toC and fromC helper generators
incorrectly treat these elements as invalid.
Consider for example, libxl_channelinfo. The idl contains the
following keyed element:
Jan Beulich [Mon, 9 Mar 2020 09:00:26 +0000 (10:00 +0100)]
VT-d: fix and extend RMRR reservation check
First of all in commit d6573bc6e6b7 ("VT-d: check all of an RMRR for
being E820-reserved") along with changing the function used, the enum-
like value passed should have been changed too (to E820_*). Do so now.
(Luckily the actual values of RAM_TYPE_RESERVED and E820_RESERVED
match, so the breakage introduced was "only" latent.)
Furthermore one of my systems surfaces RMRR in an ACPI NVS E820 range.
The purpose of the check is just to make sure there won't be "ordinary"
mappings of these ranges, and domains (including Dom0) won't want to
use the region to e.g. put PCI device BARs there. The two ACPI related
E820 types are good enough for this purpose, so allow them as well.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Roger Pau Monné [Fri, 6 Mar 2020 09:18:13 +0000 (10:18 +0100)]
x86/hvm: allow ASID flush when v != current
Current implementation of hvm_asid_flush_vcpu is not safe to use
unless the target vCPU is either paused or the currently running one,
as it modifies the generation without any locking.
Fix this by using atomic operations when accessing the generation
field, both in hvm_asid_flush_vcpu_asid and other ASID functions. This
allows to safely flush the current ASID generation. Note that for the
flush to take effect if the vCPU is currently running a vmexit is
required.
Compilers will normally do such writes and reads as a single
instruction, so the usage of atomic operations is mostly used as a
safety measure.
Note the same could be achieved by introducing an extra field to
hvm_vcpu_asid that signals hvm_asid_handle_vmenter the need to call
hvm_asid_flush_vcpu on the given vCPU before vmentry, this however
seems unnecessary as hvm_asid_flush_vcpu itself only sets two vCPU
fields to 0, so there's no need to delay this to the vmentry ASID
helper.
This is not a bugfix as no callers that would violate the assumptions
listed in the first paragraph have been found, but a preparatory
change in order to allow remote flushing of HVM vCPUs.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wl@xen.org> Acked-by: Jan Beulich <jbeulich@suse.com>
Anthony PERARD [Fri, 6 Mar 2020 09:15:49 +0000 (10:15 +0100)]
build: run targets csopes,tags,.. without Rules.mk
Those targets make use of $(all_sources) which depends on TARGET_ARCH,
so we just need to set TARGET_ARCH earlier and once.
XEN_TARGET_ARCH isn't expected to change during the build, so
TARGET_SUBARCH and TARGET_ARCH aren't going to change either. Set them
once and for all in the Xen root Makefile. This allows to run more
targets without Rules.mk.
XEN_TARGET_ARCH is actually changed in arch/x86/boot/build32.mk, but
it doesn't use the TARGET_{,SUB}ARCH variables either, and doesn't use
Rules.mk (it replaces it).
TARGET_{,SUB}ARCH are no longer overridden because that would have
no effect on the values that Rules.mk will use.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Anthony PERARD [Fri, 6 Mar 2020 09:14:33 +0000 (10:14 +0100)]
build: extract clean target from Rules.mk
Most of the code executed by Rules.mk isn't necessary for the clean
target, especially not the CFLAGS. This patch makes running make clean
much faster.
The patch extract the clean target into a different Makefile,
Makefile.clean.
Since Makefile.clean, doesn't want to include Config.mk, we need to
define the variables DEPS_INCLUDE and DEPS in a place common to
Rules.mk and Makefile.clean, this is Kbuild.include. DEPS_RM is only
needed in Makefile.clean so can be defined there.
Even so Rules.mk includes Config.mk, it includes Kbuild.include after,
so the effective definition of DEPS_INCLUDE is "xen/" one and the
same one as used by Makefile.clean.
This is inspired by Kbuild, with Makefile.clean partially copied from
Linux v5.4.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Anthony PERARD [Fri, 6 Mar 2020 09:11:23 +0000 (10:11 +0100)]
build: use obj-y += subdir/ instead of subdir-y
This is part of upgrading our build system and import more of Linux's
one.
In Linux, subdir-y in Makefiles is only used to descend into
subdirectory when there are no object to build, Xen doesn't have that
and all subdir have object to be included in the final binary.
To allow the new syntax, the "obj-y" and "subdir-*" calculation in
Rules.mk is changed and partially imported from Linux's Kbuild.
The command used to modify the Makefile was:
sed -i -r 's#^subdir-(.*)#obj-\1/#;' **/Makefile
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <jgrall@amazon.com>
Andrew Cooper [Thu, 5 Mar 2020 17:57:37 +0000 (17:57 +0000)]
x86/dom0: Fix build with clang
find_memory() isn't marked as __init, so if it isn't fully inlined, it ends up
tripping:
Error: size of dom0_build.o:.text is 0x0c1
Fixes: 73b47eea21 "x86/dom0: improve PVH initrd and metadata placement" Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Julien Grall [Tue, 25 Feb 2020 12:32:49 +0000 (12:32 +0000)]
xen/grant-table: Remove outdated warning in gnttab_grow_table()
One of the warning message in gnttab_grow_table() refers to a function
was removed in commit 6425f91c72 "xen/gnttab: Fold grant_table_{create,
set_limits}() into grant_table_init()".
Since the commit, gt->active will be allocated while initializing the
grant table at domain creation. Therefore gt-active will always be
valid.
Rather than replacing the warning by another one, drop the check
completely as we will likely not come back to a semi-initialized world.
Signed-off-by: Julien Grall <jgrall@amazon.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Julien Grall [Mon, 3 Feb 2020 23:57:05 +0000 (23:57 +0000)]
xen/x86: hap: Clean-up and harden hap_enable()
Unlike shadow_enable(), hap_enable() can only be called once during
domain creation and with the mode equal to
PG_external | PG_translate | PG_refcounts.
If it were called twice, then we might have some interesting problems
as the p2m tables would be re-allocated (and therefore all the mappings
would be lost).
Add code to sanity check the mode and that the function is only called
once. Take the opportunity to an if checking that PG_translate is set.
Signed-off-by: Julien Grall <jgrall@amazon.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Thu, 5 Mar 2020 09:43:46 +0000 (10:43 +0100)]
iommu: fix check for autotranslated hardware domain
The current position of the check_hwdom_reqs is wrong, as there's a
is_iommu_enabled at the top of the function that will prevent getting
to the check on systems without an IOMMU, because the hardware domain
won't have the XEN_DOMCTL_CDF_iommu flag set.
Move the position of the check so it's done before the
is_iommu_enabled one, and thus attempts to create a translated
hardware domain without an IOMMU can be detected.
Fixes: f89f555827a ('remove late (on-demand) construction of IOMMU page tables') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Thu, 5 Mar 2020 09:43:15 +0000 (10:43 +0100)]
x86/dom0: improve PVH initrd and metadata placement
Don't assume there's going to be enough space at the tail of the
loaded kernel and instead try to find a suitable memory area where the
initrd and metadata can be loaded.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Wei Liu [Thu, 5 Mar 2020 09:42:18 +0000 (10:42 +0100)]
x86/mm: switch to new APIs in arch_init_memory
The function will map and unmap pages on demand.
Since we now map and unmap Xen PTE pages, we would like to track the
lifetime of mappings so that 1) we do not dereference memory through a
variable after it is unmapped, 2) we do not unmap more than once.
Therefore, we introduce the UNMAP_DOMAIN_PAGE macro to nullify the
variable after unmapping, and ignore NULL.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Hongyan Xia <hongyxia@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
libxl: wait for console path before firing console_available
If the path doesn't become available after LIBXL_INIT_TIMEOUT
seconds, fail the domain creation.
If we skip the bootloader, the TTY path will be set by xenconsoled.
However, there is no guarantee that this will happen by the time we
want to call the console_available callback, so we have to wait.
Signed-off-by: Paweł Marczewski <pawel@invisiblethingslab.com> Reviewed-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Julien Grall [Mon, 17 Feb 2020 22:20:34 +0000 (22:20 +0000)]
xen/arm: Workaround clang/armclang support for register allocation
Clang 8.0 (see [1]) and by extent some of the version of armclang does
not support register allocation using the syntax rN.
Thankfully, both GCC [2] and clang are able to support the xN syntax for
Arm64. Introduce a new macro ASM_REG() and use in common code for
register allocation.
Jan Beulich [Tue, 3 Mar 2020 15:03:13 +0000 (16:03 +0100)]
MAINTAINERS: Paul to co-maintain vendor-independent IOMMU code
Having just a single maintainer is not helpful anywhere, and can be
avoided here quite easily, seeing that Paul has been doing quite a bit
of IOMMU work lately.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien@xen.org> Reviewed-by: Paul Durrant <pdurrant@amazon.com>
Juergen Gross [Tue, 3 Mar 2020 15:02:32 +0000 (16:02 +0100)]
sched: fix error path in cpupool_unassign_cpu_start()
In case moving away all domains from the cpu to be removed is failing
in cpupool_unassign_cpu_start() the error path is missing to release
sched_res_rculock.
The normal exit path is releasing domlist_read_lock instead (this is
currently no problem as the reference to the specific rcu lock is not
used by rcu_read_unlock()).
While at it indent the present error label by one space.
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
Jan Beulich [Tue, 3 Mar 2020 15:01:30 +0000 (16:01 +0100)]
credit2: avoid NULL deref in csched2_res_pick() when tracing
The issue here results from one of the downsides of using goto: The
early "goto out" and "goto out_up" in the function very clearly bypass
any possible initialization of min_rqd, yet the tracing code at the end
of the function consumes the value. There's even a comment regarding the
trace record not being accurate in this case.
Juergen Gross [Tue, 11 Feb 2020 09:31:22 +0000 (10:31 +0100)]
xen: do live patching only from main idle loop
One of the main design goals of core scheduling is to avoid actions
which are not directly related to the domain currently running on a
given cpu or core. Live patching is one of those actions which are
allowed taking place on a cpu only when the idle scheduling unit is
active on that cpu.
Unfortunately live patching tries to force the cpus into the idle loop
just by raising the schedule softirq, which will no longer be
guaranteed to work with core scheduling active. Additionally there are
still some places in the hypervisor calling check_for_livepatch_work()
without being in the idle loop.
It is easy to force a cpu into the main idle loop by scheduling a
tasklet on it. So switch live patching to use tasklets for switching to
idle and raising scheduling events. Additionally the calls of
check_for_livepatch_work() outside the main idle loop can be dropped.
As tasklets are only running on idle vcpus and stop_machine_run()
is activating tasklets on all cpus but the one it has been called on
to rendezvous, it is mandatory for stop_machine_run() to be called on
an idle vcpu, too, as otherwise there is no way for scheduling to
activate the idle vcpu for the tasklet on the sibling of the cpu
stop_machine_run() has been called on.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>