Jan Beulich [Fri, 1 Sep 2017 16:24:10 +0000 (10:24 -0600)]
domctl/x86: move vMSI related #define-s to public interface
Xen and qemu having identical #define-s (with different names) is a
strong hint that these should be part of the public interface, at the
same time making obvious that any change to the values in an interface
modification (and hence needs suitable care).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Chao Gao [Thu, 31 Aug 2017 05:01:49 +0000 (01:01 -0400)]
xl/libacpi: extend lapic_id() to uint32_t
This patch is to extend lapic_id() to support more vcpus.
Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Juergen Gross [Thu, 10 Aug 2017 11:24:28 +0000 (13:24 +0200)]
libxc: increase maximum migration stream record length
Today the maximum record lenth in a migration stream is 8MB. This
limits the size of a PV domain to a little bit less than 1TB in the
migration case, as the P2M frame list will exceed 8MB in this case.
Raising the record size limit by a factor of 16 allows for domain
sizes of nearly 16TB to be migrated. This ought to be enough.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Jan Beulich [Fri, 1 Sep 2017 09:07:31 +0000 (11:07 +0200)]
x86: mark the entire directmap NX
There's no reason for the first Mb to be excluded here. Enforce the
restriction right in the top level page table entries.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86/pvh: remove stale PVHv1 comment from public headers
From the vcpu_guest_context structure. PVHv2 uses it in the same exact
way as HVM guests, and from the hypervisor point of view PVHv2 is not
even a different guest type, so only mention HVM in the public
headers.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Boris Ostrovsky [Fri, 1 Sep 2017 09:06:21 +0000 (11:06 +0200)]
mm: don't request scrubbing until dom0 is running
There is no need to scrub pages freed during dom0 construction since
once dom0 is ready the heap will be scrubbed by scrub_heap_pages() anyway,
setting scrub_debug at the end.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Boris Ostrovsky [Fri, 1 Sep 2017 09:05:45 +0000 (11:05 +0200)]
mm: change boot_scrub_done definition
Rename it to the more appropriate scrub_debug and define as a macro
for !CONFIG_SCRUB_DEBUG. This will allow us to get rid of some
ifdefs (here and in the subsequent patch).
Suggested-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Igor Druzhinin [Fri, 1 Sep 2017 09:03:20 +0000 (11:03 +0200)]
hvmloader, libxl: use the correct ACPI settings depending on device model
We need to choose ACPI tables properly depending on the device
model version we are running. Previously, this decision was
made by BIOS type specific code in hvmloader, e.g. always load
QEMU traditional specific tables if it's ROMBIOS and always
load QEMU Xen specific tables if it's SeaBIOS.
This change saves this behavior (for compatibility) but adds
an additional way (xenstore key) to specify the correct
device model if we happen to run a non-default one. Toolstack
bit makes use of it.
The enforcement of BIOS type depending on QEMU version will
be lifted later when the rest of ROMBIOS compatibility fixes
are in place.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
When SR-IOV is enabled, 'Virtual Functions' of a 'Physical Function'
are under the scope of the same VT-d unit as the 'Physical Function'.
A 'Physical Function' can be a 'Traditional Function' or an ARI
'Extended Function'. And furthermore, 'Extended Functions' on an
endpoint are under the scope of the same VT-d unit as the 'Traditional
Functions' on the endpoint. To search VT-d unit for a VF, if its PF
isn't an extended function, the BDF of PF should be used. Otherwise
the BDF of a traditional function in the same device with the PF
should be used.
Current code uses PCI_SLOT() to recognize an ARI 'Extended Funcion'.
But it is conceptually wrong w/o checking whether PF is an extended
function and would lead to match VFs of a RC integrated PF to a wrong
VT-d unit.
This patch overrides VF 'is_extfn' field and uses this field to
indicate whether the PF of this VF is an extended function. The field
helps to use correct BDF to search VT-d unit.
Reported-by: Crawford, Eric R <Eric.R.Crawford@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Jan Beulich <jbeulich@suse.com> Tested-by: Crawford, Eric R <Eric.R.Crawford@intel.com>
Yi Sun [Thu, 31 Aug 2017 08:07:26 +0000 (16:07 +0800)]
x86: remove redundant checks in sysctl.c
In sysctl.c, the return value of 'psr_get_info' has been checked immediately.
So, it is redundant to check the return value again when copy the field to
guest.
Suggested-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Sergej Proskurin [Wed, 30 Aug 2017 11:19:14 +0000 (13:19 +0200)]
xen-access: Correct default value of write-to-CR4 switch
The current implementation configures the test environment to always
trap on writes to the CR4 control register, even on ARM. This leads to
issues as calling xc_monitor_write_ctrlreg on ARM with VM_EVENT_X86_CR4
will always fail.
Andrew Cooper [Wed, 30 Aug 2017 13:18:01 +0000 (14:18 +0100)]
x86/mm: Rearrange guest_get_eff_{,kern_}l1e() to not be void
Coverity complains that gl1e.l1 may be used while uninitialised in
map_ldt_shadow_page(). This isn't actually accurate as guest_get_eff_l1e()
will always write to its parameter.
However, having a void function which returns a 64bit value via pointer is
rather silly. Rearrange the functions to return l1_pgentry_t.
No functional change, but hopefully should help Coverity not to come to the
wrong conclusion.
Bloat-o-meter also reports a modest improvement:
add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-71 (-71)
function old new delta
guest_get_eff_l1e 82 75 -7
mmio_ro_do_page_fault 530 514 -16
map_ldt_shadow_page 501 485 -16
ptwr_do_page_fault 615 583 -32
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Dario Faggioli [Wed, 30 Aug 2017 11:06:22 +0000 (12:06 +0100)]
xen: RCU: avoid busy waiting until the end of grace period.
On the CPU where a callback is queued, cpu_is_haltable()
returns false (due to rcu_needs_cpu() being itself false).
That means the CPU would spin inside idle_loop(), continuously
calling do_softirq(), and, in there, continuously checking
rcu_pending(), in a tight loop.
Let's instead allow the CPU to really go idle, but make sure,
by arming a timer, that we periodically check whether the
grace period has come to an ended. As the period of the
timer, we pick a value that makes thing look like what
happens in Linux, with the periodic tick (as this code
comes from there).
Note that the timer will *only* be armed on CPUs that are
going idle while having queued RCU callbacks. On CPUs that
don't, there won't be any timer, and their sleep won't be
interrupted (and even for CPUs with callbacks, we only
expect an handful of wakeups at most, but that depends on
the system load, as much as from other things).
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Dario Faggioli [Wed, 30 Aug 2017 11:06:21 +0000 (12:06 +0100)]
xen: RCU: don't let a CPU with a callback go idle.
If a CPU has a callback queued, it must be ready to invoke
it, as soon as all the other CPUs involved in the grace period
has gone through a quiescent state.
But if we let such CPU go idle, we can't really tell when (if!)
it will realize that it is actually time to invoke the callback.
To solve this problem, a CPU that has a callback queued (and has
already gone through a quiescent state itself) will stay online,
until the grace period ends, and the callback can be invoked.
This is similar to what Linux does, and is the second and last
step for fixing the overly long (or infinite!) grace periods.
The problem, though, is that, within Linux, we have the tick,
so, all that is necessary is to not stop the tick for the CPU
(even if it has gone idle). In Xen, there's no tick, so we must
avoid for the CPU to go idle entirely, and let it spin on
rcu_pending(), consuming power and causing overhead.
In this commit, we implement the above, using rcu_needs_cpu(),
in a way similar to how it is used in Linux. This it correct,
useful and not wasteful for CPUs that participate in grace
period, but have not a callback queued. For the ones that
has callbacks, an optimization that avoids having to spin is
introduced in a subsequent change.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Dario Faggioli [Wed, 30 Aug 2017 11:06:21 +0000 (12:06 +0100)]
xen: RCU/x86/ARM: discount CPUs that were idle when grace period started.
Xen is a tickless (micro-)kernel, i.e., when a CPU becomes
idle there is no timer tick that will periodically wake the
CPU up.
OTOH, when we imported RCU from Linux, Linux was (on x86) a
ticking kernel, i.e., there was a periodic timer tick always
running, even on idle CPUs. This was bad for power consumption,
but, for instance, made it easy to monitor the quiescent states
of all the CPUs, and hence tell when RCU grace periods ended.
In Xen, that is impossible, and that's particularly problematic
when the system is very lightly loaded, as some CPUs may never
have the chance to tell the RCU core logic about their quiescence,
and grace periods could extend indefinitely!
This has led, on x86, to long (and unpredictable) delays between
RCU callbacks queueing and their actual invokation. On ARM, we've
even seen infinite grace periods (e.g., complate_domain_destroy()
never being actually invoked!). See here:
The first step for fixing this situation is for RCU to record,
at the beginning of a grace period, which CPUs are already idle.
In fact, being idle, they can't be in the middle of any read-side
critical section, and we don't have to wait for their quiescence.
This is tracked in a cpumask, in a similar way to how it was also
done in Linux (on s390, which was tickless already). It is also
basically the same approach used for making Linux x86 tickless,
in 2.6.21 on (see commit 79bf2bb3 "tick-management: dyntick /
highres functionality").
For correctness, wee also add barriers. One is also present in
Linux, (see commit c3f59023, "Fix RCU race in access of nohz_cpu_mask",
although, we change the code comment to something that makes better
sense for us). The other (which is its pair), is put in the newly
introduced function rcu_idle_enter(), right after updating the
cpumask. They prevent races between CPUs going idle during the
beginning of a grace period.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Dario Faggioli [Wed, 30 Aug 2017 11:06:20 +0000 (12:06 +0100)]
xen: ARM: suspend the tick (if in use) when going idle.
Since commit 964fae8ac ("cpuidle: suspend/resume scheduler
tick timer during cpu idle state entry/exit"), if a scheduler
has a periodic tick timer, we stop it when going idle.
This, however, is only true for x86. Make it true for ARM as
well.
Dario Faggioli [Wed, 30 Aug 2017 11:06:20 +0000 (12:06 +0100)]
xen: in do_softirq() sample smp_processor_id() only once.
In fact, right now, we read it at every iteration of the loop.
The reason it's done like this is how context switch was handled
on IA64 (see commit ae9bfcdc, "[XEN] Various softirq cleanups" [1]).
However:
1) we don't have IA64 any longer, and all the achitectures that
we do support, are ok with sampling once and for all;
2) sampling at every iteration (slightly) affect performance;
3) sampling at every iteration is misleading, as it makes people
believe that it is currently possible that SCHEDULE_SOFTIRQ
moves the execution flow on another CPU (and the comment,
by reinforcing this belief, makes things even worse!).
Therefore, let's:
- do the sampling only once, and remove the comment;
- leave an ASSERT() around, so that, if context switching
logic changes (in current or new arches), we will notice.
[1] Some more (historical) information here:
http://old-list-archives.xenproject.org/archives/html/xen-devel/2006-06/msg01262.html
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> Reviewed-by: Tim Deegan <tim@xen.org>
Andrew Cooper [Wed, 23 Aug 2017 17:49:31 +0000 (17:49 +0000)]
x86/pv: map_ldt_shadow_page() cleanup
Switch the return value from int to bool, to match its semantics. Switch its
parameter from a frame offset to a byte offset (simplifying the sole caller)
and allowing for an extra sanity check that the fault is within the LDT limit.
Drop the unnecessary gmfn and okay local variables, and correct the gva
parameter to be named linear. Rename l1e to gl1e, and simplify the
construction of the new pte by simply taking (the now validated) gl1e and
ensuring that _PAGE_RW is set.
Calculate the pte to be updated outside of the spinlock, which halves the size
of the critical region.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 23 Aug 2017 17:51:59 +0000 (17:51 +0000)]
x86/pv: Switch {fill,zap}_ro_mpt() to using mfn_t
And update all affected callers. Fix the fill_ro_mpt() prototype to be bool
like its implementation.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Boris Ostrovsky [Wed, 30 Aug 2017 09:05:02 +0000 (11:05 +0200)]
mm: don't hold heap lock in alloc_heap_pages() longer than necessary
Once pages are removed from the heap we don't need to hold the heap
lock. It is especially useful to drop it for an unscrubbed buddy since
we will be scrubbing it.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Alexandru Isaila [Wed, 30 Aug 2017 09:04:13 +0000 (11:04 +0200)]
x86/hvm: allow guest_request vm_events coming from userspace
In some introspection usecases, an in-guest agent needs to communicate
with the external introspection agent. An existing mechanism is
HVMOP_guest_request_vm_event, but this is restricted to kernel usecases
like all other hypercalls.
Introduce a mechanism whereby the introspection agent can whitelist the
use of HVMOP_guest_request_vm_event directly from userspace.
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
Roger Pau Monné [Wed, 30 Aug 2017 09:02:24 +0000 (11:02 +0200)]
x86/pt: add a MSI unmask flag to XEN_DOMCTL_bind_pt_irq
The flag is part of the gflags, and should be used to request the
unmask of a MSI interrupt once it's bound.
This is required for the device model in order to be capable of
binding MSIX interrupts that have the entry mask bit already unset at
bind time. Without this fix the interrupts would be left masked.
Note that this commit introduces a change to the domctl, which
requires a bump of the interface version. This is not done here
because the interface version has already been bumped in this release
cycle.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reported by: Andreas Kinzler <hfp@posteo.de> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 28 Aug 2017 15:46:05 +0000 (16:46 +0100)]
x86/pv: Fill all Xen slots in init_guest_l4_table()
There is a bug when using highmem-start= where some L4 directmap slots are not
audited in alloc_l4_table(), and not overwritten by init_guest_l4_table().
As highmem_start is only available in debug builds of the hypervisor, this
does not constitute a security issue.
Ensure that init_guest_l4_table() writes to all of the Xen slots.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Meng Xu [Thu, 3 Aug 2017 02:13:52 +0000 (22:13 -0400)]
xen: rtds: only tickle non-already tickled CPUs
When more than one idle VCPUs that have the same PCPU as their
previous running core invoke runq_tickle(), they will tickle the same
PCPU. The tickled PCPU will only pick at most one VCPU, i.e., the
highest-priority one, to execute. The other VCPUs will not be
scheduled for a period, even when there is an idle core, making these
VCPUs unnecessarily starve for one period.
Therefore, always make sure that we only tickle PCPUs that have not
been tickled already.
Daniel Sabogal [Fri, 25 Aug 2017 21:35:47 +0000 (17:35 -0400)]
libxl/arm: Fix build on arm64 + acpi
With musl, the build fails with the following errors:
actypes.h:202:2: error: #error unknown ACPI_MACHINE_WIDTH
#error unknown ACPI_MACHINE_WIDTH
^~~~~
actypes.h:207:9: error: unknown type name ‘acpi_native_uint’
typedef acpi_native_uint acpi_size;
^~~~~~~~~~~~~~~~
actypes.h:617:3: error: unknown type name ‘acpi_io_address’
acpi_io_address pblk_address;
^~~~~~~~~~~~~~~
This likely went undetected with glibc builds since glibc
indirectly pulls __BITS_PER_LONG from the linux headers
through a standard header. For musl, this is not the case.
Instead, use BITS_PER_LONG to fix the build.
Signed-off-by: Daniel Sabogal <dsabogalcc@gmail.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Roger Pau Monne [Tue, 29 Aug 2017 08:50:24 +0000 (09:50 +0100)]
acpi: set correct address of the control/event blocks in the FADT
Commit 149c6b unmasked an issue long present in Xen: the control/event
block addresses provided in the ACPI FADT table where hardcoded to the
V1 version. This was papered over because hvmloader would also always
set HVM_PARAM_ACPI_IOPORTS_LOCATION to 1 regardless of the BIOS
version.
The most notable issue caused by the above bug was that the QEMU
traditional GPE0 block was out of sync: the address provided in the
FADT didn't match the address QEMU was using.
Note that PM1a and TMR worked fine because the V1 address was
hardcoded in the FADT and HVM_PARAM_ACPI_IOPORTS_LOCATION was
unconditionally set to 1 by hvmloader.
Fix this by passing the address of the control/event blocks to
acpi_build_tables, so the values can be properly set in the FADT table
provided to the guest.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Basically, what happens is that runq_tickle() realizes
d0v13 should preempt d2v7, running on cpu 12, as it
has higher credits (10135529 vs. 2619231). It therefore
tickles cpu 12 [1], which, in turn, schedules [2].
But --surprise surprise-- d2v7 has run for less than the
ratelimit interval [3], and hence it is _not_ preempted,
and continues to run. This indeed looks fine. Actually,
this is what ratelimiting is there for. Note, however,
that:
1) we interrupted cpu 12 for nothing;
2) what if, say on cpu 8, there is a vcpu that has:
+ less credit than d0v13 (so d0v13 can well
preempt it),
+ more credit than d2v7 (that's why it was not
selected to be preempted),
+ run for more than the ratelimiting interval
(so it can really be scheduled out)?
With this patch, if we are in case 2), we'd realize
that tickling 12 would be pointless, and we'll continue
looking, eventually finding and tickling 8.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
Dario Faggioli [Tue, 29 Aug 2017 09:18:52 +0000 (10:18 +0100)]
xen: credit2: optimize runq_candidate() a little bit
By factoring into one (at the top) all the checks
to see whether current is the idle vcpu, and mark
it as unlikely().
In fact, if current is idle, all the logic for
dealing with yielding, context switching rate
limiting and soft-affinity, is just pure overhead,
and we better rush checking the runq and pick some
vcpu up.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Dario Faggioli [Tue, 29 Aug 2017 09:18:51 +0000 (10:18 +0100)]
xen: credit2: kick away vcpus not running within their soft-affinity
If, during scheduling, we realize that the current vcpu
is running outside of its own soft-affinity, it would be
preferable to send it somewhere else.
Of course, that may not be possible, and if we're too
strict, we risk having vcpus sit in runqueues, even if
there are idle pcpus (violating work-conservingness).
In fact, what about there are no pcpus, from the soft
affinity mask of the vcpu in question, where it can
run?
To make sure we don't fall in the above described trap,
only actually de-schedule the vcpu if there are idle and
not already tickled cpus from its soft affinity where it
can run immediately.
If there is (at least one) of such cpus, we let current
be preempted, so that csched2_context_saved() will put
it back in the runq, and runq_tickle() will wake (one
of) the cpu.
If there is not even one, we let current run where it is,
as running outside its soft-affinity is still better than
not running at all.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Dario Faggioli [Tue, 29 Aug 2017 09:18:51 +0000 (10:18 +0100)]
xen: credit2: soft-affinity awareness in csched2_cpu_pick()
We want to find the runqueue with the least average load,
and to do that, we scan through all the runqueues.
It is, therefore, enough that, during such scan:
- we identify the runqueue with the least load, among
the ones that have pcpus that are part of the soft
affinity of the vcpu we're calling pick on;
- we identify the same, but for hard affinity.
At this point, we can decide whether to go for the
runqueue with the least load among the ones with some
soft-affinity, or overall.
Therefore, at the price of some code reshuffling, we
can avoid the loop.
(Also, kill a spurious ';' in the definition of MAX_LOAD.)
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Signed-off-by: Justin T. Weaver <jtweaver@hawaii.edu> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Dario Faggioli [Tue, 29 Aug 2017 09:18:50 +0000 (10:18 +0100)]
xen: credit2: soft-affinity awareness in gat_fallback_cpu()
By, basically, moving all the logic of the function
inside the usual two steps (soft-affinity step and
hard-affinity step) loop.
While there, add two performance counters (in cpu_pick
and in get_fallback_cpu() itself), in order to be able
to tell how frequently it happens that we need to look
for a fallback cpu.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Signed-off-by: Justin T. Weaver <jtweaver@hawaii.edu> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
George Dunlap [Tue, 29 Aug 2017 09:18:49 +0000 (10:18 +0100)]
xen/credit2: soft-affinity awareness in runq_tickle()
Soft-affinity support is usually implemented by means
of a two step "balancing loop", where:
- during the first step, we consider soft-affinity
(if the vcpu has one);
- during the second (if we get to it), we consider
hard-affinity.
In runq_tickle(), we need to do that for checking
whether we can execute the waking vCPU on an pCPU
that is idle. In fact, we want to be sure that, if
there is an idle pCPU in the vCPU's soft affinity,
we'll use it.
If there are no such idle pCPUs, though, and we
have to check non-idle ones, we can avoid the loop
and to both hard and soft-affinity in one pass.
In fact, we can we scan runqueue and compute a
"score" for each vCPU which is running on each pCPU.
The idea is, since we may have to preempt someone:
- try to make sure that the waking vCPU will run
inside its soft-affinity,
- try to preempt someone that is running outside
of its own soft-affinity.
The value of the score is added to a trace record,
so xenalyze's code and tools/xentrace/formats are
updated accordingly.
Suggested-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Juergen Gross [Mon, 28 Aug 2017 14:49:30 +0000 (16:49 +0200)]
xen: fix boolean parameter handling
Commit 63e8a1e5ffa7a7fdbde887805f673fea7e8d2e94 ("xen: check parameter
validity when parsing command line") introduced a bug for the case
when a boolean parameter was specified by its keyword only (no value).
It would set just the wrong boolean value for that parameter.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Jan Beulich <jbeulich@suse.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Juergen Gross [Mon, 28 Aug 2017 07:35:00 +0000 (09:35 +0200)]
xen: add hypercall for setting parameters at runtime
Add a sysctl hypercall to support setting parameters similar to
command line parameters, but at runtime. The parameters to set are
specified as a string, just like the boot parameters.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Mon, 28 Aug 2017 07:35:00 +0000 (09:35 +0200)]
xen: add basic support for runtime parameter changing
Add the needed infrastructure for runtime parameter changing similar
to that used at boot time via cmdline. We are using the same parsing
functions as for cmdline parsing, but with a different array of
parameter definitions.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Mon, 28 Aug 2017 07:35:00 +0000 (09:35 +0200)]
xen: carve out a generic parsing function from _cmdline_parse()
In order to support generic parameter parsing carve out the parser from
_cmdline_parse(). As this generic function might be called after boot
remove the __init annotations from all called sub-functions.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
With _cmdline_parse() now issuing error messages in case of illegal
parameters signalled by parsing functions specified in custom_param()
the message issued by parse_credit2_runqueue() can be removed.
With _cmdline_parse() now issuing error messages in case of illegal
parameters signalled by parsing functions specified in custom_param()
the message issued by setup_ioapic_ack() can be removed.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Jan Beulich <jbeulich@suse.com>
With _cmdline_parse() now issuing error messages in case of illegal
parameters signalled by parsing functions specified in custom_param()
the message issued by parse_viridian_version() can be removed.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
With _cmdline_parse() now issuing error messages in case of illegal
parameters signalled by parsing functions specified in custom_param()
the message issued by mce_set_verbosity() can be removed.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Jan Beulich <jbeulich@suse.com>
With _cmdline_parse() now issuing error messages in case of illegal
parameters signalled by parsing functions specified in custom_param()
the message issued by apic_set_verbosity() can be removed.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Mon, 28 Aug 2017 07:34:00 +0000 (09:34 +0200)]
xen: check parameter validity when parsing command line
Where possible check validity of parameters in _cmdline_parse() and
issue a warning message in case of an error detected.
In order to make sure a custom parameter parsing function really
returns a value (error or success), don't use a void pointer for
storing the function address, but a proper typed function pointer.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Xiong Zhang [Mon, 28 Aug 2017 08:51:24 +0000 (10:51 +0200)]
hvmloader: use base instead of pci_mem_start for find_next_rmrr()
find_next_rmrr(base) is used to find the lowest RMRR ending above base
but below 4G. Current method couldn't cover the following situation:
a. two rmrr exist, small gap between them
b. pci_mem_start and mem_resource.base is below the first rmrr.base
c. find_next_rmrr(pci_mem_start) will find the first rmrr
d. After aligning mem_resource.base to bar size,
first_rmrr.end < new_base < second_rmrr.base and
new_base + bar_sz > second_rmrr.base.
So the new bar will overlap with the second rmrr and doesn't overlap
with the first rmrr.
But the next_rmrr point to the first rmrr, then check_overlap() couldn't
find the overlap. Finally assign a wrong address to bar.
This patch using aligned new base to find the next rmrr, could fix the
above case and find all the overlapped rmrr with new base.
Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Mon, 28 Aug 2017 08:50:29 +0000 (10:50 +0200)]
x86/EFI: warn about r/o sections requiring relocations
EFI implementations may write-protect r/o sections, but we need to
apply relocations. Eliminate the one present case of a r/o section
with relocations (.init.text, which is now being combined with
.init.data into just .init).
Also correct a few other format strings (to account for the possibly
missing NUL in section names) in mkreloc.c.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 28 Aug 2017 08:48:55 +0000 (10:48 +0200)]
passthrough: give XEN_DOMCTL_test_assign_device more sane semantics
So far callers of the libxc interface passed in a domain ID which was
then ignored in the hypervisor. Instead, make the hypervisor honor it
(accepting DOMID_INVALID to obtain original behavior), allowing to
query whether a device can be assigned to a particular domain. Do this
by folding the assign and test-assign paths.
Drop XSM's test_assign_{,dt}device hooks as no longer being
individually useful.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Fri, 25 Aug 2017 16:11:25 +0000 (18:11 +0200)]
xen: fix parse_bool() with empty string
parse_bool() should return -1 in case it is called with an empty
string. In order to allow boolean parameters in the cmdline without
specifying a value this case must be handled in _cmdline_parse() by
always passing a value string.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Wed, 23 Aug 2017 18:01:02 +0000 (19:01 +0100)]
x86/mm: Introduce and use l?e_{get,from}_mfn()
This avoids the explicit boxing/unboxing of mfn_t in relevant codepaths.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Andrew Cooper [Wed, 23 Aug 2017 18:01:02 +0000 (19:01 +0100)]
x86/mm: Replace opencoded forms of map_l?t_from_l?e()
No functional change (confirmed by diffing the disassembly).
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Andrew Cooper [Wed, 23 Aug 2017 18:01:02 +0000 (19:01 +0100)]
x86/mm: Replace opencoded forms of l?e_{get,from}_page()
No functional change (confirmed by diffing the disassembly).
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Andrew Cooper [Wed, 23 Aug 2017 16:47:42 +0000 (16:47 +0000)]
x86/pv: Minor improvements to guest_get_eff_{,kern}_l1e()
* These functions work in terms of linear addresses, not virtual addresses.
Update the comments and parameter names.
* Drop unnecessary inlines.
* Drop vcpu parameter from guest_get_eff_kern_l1e(). Its sole caller passes
current, and its callee strictly operates on current.
* Switch guest_get_eff_kern_l1e()'s parameter from void * to l1_pgentry_t *.
Both its caller and callee already use the correct type already.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Anthony PERARD [Fri, 25 Aug 2017 14:42:01 +0000 (16:42 +0200)]
x86/vlapic: apply change to TDCR right away to the timer
The description in the Intel SDM of how the divide configuration
register is used: "The APIC timer frequency will be the processor's bus
clock or core crystal clock frequency divided by the value specified in
the divide configuration register."
Observation of baremetal shown that when the TDCR is change, the TMCCT
does not change or make a big jump in value, but the rate at which it
count down change.
The patch update the emulation to APIC timer to so that a change to the
divide configuration would be reflected in the value of the counter and
when the next interrupt is triggered.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Anthony PERARD [Fri, 25 Aug 2017 14:41:37 +0000 (16:41 +0200)]
x86/vlapic: keep timer running when switching between one-shot and periodic mode
If we take TSC-deadline mode timer out of the picture, the Intel SDM
does not say that the timer is disable when the timer mode is change,
either from one-shot to periodic or vice versa.
After this patch, the timer is no longer disarmed on change of mode, so
the counter (TMCCT) keeps counting down.
So what does a write to LVTT changes ? On baremetal, the change of mode
is probably taken into account only when the counter reach 0. When this
happen, LVTT is use to figure out if the counter should restard counting
down from TMICT (so periodic mode) or stop counting (if one-shot mode).
This also mean that if the counter reach 0 and the mode is one-shot, a
change to periodic would not restart the timer. This is achieve by
setting vlapic->timer_last_update=0.
This patch is based on observation of the behavior of the APIC timer on
baremetal as well as check that they does not go against the description
written in the Intel SDM.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>