Dario Faggioli [Thu, 17 Aug 2017 11:05:57 +0000 (13:05 +0200)]
libxl/xl: allow to get and set cap on Credit2.
Note that a cap is considered valid only if
it is within the [1, nr_vcpus]% interval.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
--- Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Dario Faggioli [Thu, 17 Aug 2017 11:03:52 +0000 (13:03 +0200)]
xen: credit2: improve distribution of budget (for domains with caps)
Instead of letting the vCPU that for first tries to get
some budget take it all (although temporarily), allow each
vCPU to only get a specific quota of the total budget.
This improves fairness, allows for more parallelism, and
prevents vCPUs from not being able to get any budget (e.g.,
because some other vCPU always comes before and gets it all)
for one or more period, and hence starve (and cause troubles
in guest kernels, such as livelocks, triggering of whatchdogs,
etc.).
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
--- Cc: Anshul Makkar <anshulmakkar@gmail.com>
---
Changes from v1:
- typos;
- spurious hunk moved to previous patch.
Dario Faggioli [Thu, 17 Aug 2017 11:03:16 +0000 (13:03 +0200)]
xen: credit2: allow to set and get utilization cap
As cap is already present in Credit1, as a parameter, all
the wiring is there already for it to be percolate down
to csched2_dom_cntl() too.
In this commit, we actually deal with it, and implement
setting, changing or disabling the cap of a domain.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
--- Cc: George Dunlap <george.dunlap@eu.citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Anshul Makkar <anshulmakkar@gmail.com>
---
Changes from v1:
- check that cap is below 100*nr_vcpus;
- do multiplication first when computing the domain's budget, given the cap;
- when disabling cap, take the budget lock for manipulating the list of
parked vCPUs. Things would have been safe without it, but it's just
more linear, more robust and more future-proof, to "do thing properly".
Dario Faggioli [Thu, 17 Aug 2017 10:21:27 +0000 (12:21 +0200)]
xen: credit2: implement utilization cap
This commit implements the Xen part of the cap mechanism for
Credit2.
A cap is how much, in terms of % of physical CPU time, a domain
can execute at most.
For instance, a domain that must not use more than 1/4 of
one physical CPU, must have a cap of 25%; one that must not
use more than 1+1/2 of physical CPU time, must be given a cap
of 150%.
Caps are per domain, so it is all a domain's vCPUs, cumulatively,
that will be forced to execute no more than the decided amount.
This is implemented by giving each domain a 'budget', and
using a (per-domain again) periodic timer. Values of budget
and 'period' are chosen so that budget/period is equal to the
cap itself.
Budget is burned by the domain's vCPUs, in a similar way to
how credits are.
When a domain runs out of budget, its vCPUs can't run any
longer. They can gain, when the budget is replenishment by
the timer, which event happens once every period.
Blocking the vCPUs because of lack of budget happens by
means of a new (_VPF_parked) pause flag, so that, e.g.,
vcpu_runnable() still works. This is similar to what is
done in sched_rtds.c, as opposed to what happens in
sched_credit.c, where vcpu_pause() and vcpu_unpause()
(which means, among other things, more overhead).
Note that, while adding new fields to csched2_vcpu and
csched2_dom, currently existing members are being moved
around, to achieve best placement inside cache lines.
Note also that xenalyze and tools/xentrace/format are being
updated too.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
--- Cc: George Dunlap <george.dunlap@eu.citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Anshul Makkar <anshulmakkar@gmail.com>
---
Changed from v1:
* used has_cap() instead of open coding it in burn_credits();
* removed some of the unlikely() around has_cap(), as, although cap is not on
by default, it's up to the user to decide how many domains will have caps,
and we can't assume much about what users will actually do;
* tried to clarify the comment about (the non deterministic nature of the)
CPU capacity distribution between the vCPUs of a multi-vCPUs guest;
* clarified the comment about budget being replenished to nothing more than
top capacity, i.e., about the fact that budget is *not* being accumulated
across different period;
* fixed many style and typo issues in comments;
* added a comment about the budget distribution logic (to the vCPUs) being
subject to be refined in subsequent commits;
* renaming:
vcpu_try_to_get_budget() --> vcpu_grab_budget()
vcpu_give_back_budget() --> vcpu_return_budget()
repl_sdom_budget() --> replanish_domain_budget()
* change how replenishment logic deals with cases of overrun. In v1, we were
always doing multiple replenishment at once, until the domain's budget was
back into the black. Now, in cases of substantial overrun, we just do one
replenishment, and rely on future ones to bring back the budget into
being a positive number. This was agreed upon with George during v1's
review.
Basically, what happens is that runq_tickle() realizes
d0v13 should preempt d2v7, running on cpu 12, as it
has higher credits (10135529 vs. 2619231). It therefore
tickles cpu 12 [1], which, in turn, schedules [2].
But --surprise surprise-- d2v7 has run for less than the
ratelimit interval [3], and hence it is _not_ preempted,
and continues to run. This indeed looks fine. Actually,
this is what ratelimiting is there for. Note, however,
that:
1) we interrupted cpu 12 for nothing;
2) what if, say on cpu 8, there is a vcpu that has:
+ less credit than d0v13 (so d0v13 can well
preempt it),
+ more credit than d2v7 (that's why it was not
selected to be preempted),
+ run for more than the ratelimiting interval
(so it can really be scheduled out)?
With this patch, if we are in case 2), we'd realize
that tickling 12 would be pointless, and we'll continue
looking, eventually finding and tickling 8.
Dario Faggioli [Wed, 16 Aug 2017 16:55:38 +0000 (18:55 +0200)]
xen: credit2: optimize runq_candidate() a little bit
By factoring into one (at the top) all the checks
to see whether current is the idle vcpu, and mark
it as unlikely().
In fact, if current is idle, all the logic for
dealing with yielding, context switching rate
limiting and soft-affinity, is just pure overhead,
and we better rush checking the runq and pick some
vcpu up.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
--- Cc: Anshul Makkar <anshulmakkar@gmail.com>
---
Changes from v1:
- for George: about what you said in
<d3bf41b5-a152-8290-378f-3ff279b7e3ab@citrix.com>, I went for the "leave
unset at declaration and set explicitly on both paths" apprach, i.e., the
one you said you preferred (as I also like it better in this case). After
doing that, I've applied your Reviewed-by, as you said I could.
Dario Faggioli [Wed, 16 Aug 2017 16:55:29 +0000 (18:55 +0200)]
xen: credit2: kick away vcpus not running within their soft-affinity
If, during scheduling, we realize that the current vcpu
is running outside of its own soft-affinity, it would be
preferable to send it somewhere else.
Of course, that may not be possible, and if we're too
strict, we risk having vcpus sit in runqueues, even if
there are idle pcpus (violating work-conservingness).
In fact, what about there are no pcpus, from the soft
affinity mask of the vcpu in question, where it can
run?
To make sure we don't fall in the above described trap,
only actually de-schedule the vcpu if there are idle and
not already tickled cpus from its soft affinity where it
can run immediately.
If there is (at least one) of such cpus, we let current
be preempted, so that csched2_context_saved() will put
it back in the runq, and runq_tickle() will wake (one
of) the cpu.
If there is not even one, we let current run where it is,
as running outside its soft-affinity is still better than
not running at all.
Dario Faggioli [Wed, 16 Aug 2017 16:55:20 +0000 (18:55 +0200)]
xen: credit2: soft-affinity awareness in csched2_cpu_pick()
We want to find the runqueue with the least average load,
and to do that, we scan through all the runqueues.
It is, therefore, enough that, during such scan:
- we identify the runqueue with the least load, among
the ones that have pcpus that are part of the soft
affinity of the vcpu we're calling pick on;
- we identify the same, but for hard affinity.
At this point, we can decide whether to go for the
runqueue with the least load among the ones with some
soft-affinity, or overall.
Therefore, at the price of some code reshuffling, we
can avoid the loop.
(Also, kill a spurious ';' in the definition of MAX_LOAD.)
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Signed-off-by: Justin T. Weaver <jtweaver@hawaii.edu> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
--- Cc: Anshul Makkar <anshulmakkar@gmail.com>
Dario Faggioli [Wed, 16 Aug 2017 16:55:10 +0000 (18:55 +0200)]
xen: credit2: soft-affinity awareness in gat_fallback_cpu()
By, basically, moving all the logic of the function
inside the usual two steps (soft-affinity step and
hard-affinity step) loop.
While there, add two performance counters (in cpu_pick
and in get_fallback_cpu() itself), in order to be able
to tell how frequently it happens that we need to look
for a fallback cpu.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Signed-off-by: Justin T. Weaver <jtweaver@hawaii.edu>
--- Cc: Anshul Makkar <anshulmakkar@gmail.com> Cc: George Dunlap <george.dunlap@eu.citrix.com>
---
Changes from v1:
- as discussed during review, only consider hard-affinity for the last stand.
The idea is not moving the vcpu to a diffrent runqueue because of
soft-affinity, as a part of finding a fallback cpu;
- as discussed during review, added the performance counters;
- BUG_ON(1) turned into ASSERT_UNREACHABLE(), as suggested during review;
- return something same and random enough, at the end of the function (in
case we somehow manage to get there).
Dario Faggioli [Thu, 15 Jun 2017 10:25:33 +0000 (12:25 +0200)]
xen/tools: credit2: soft-affinity awareness in runq_tickle()
Soft-affinity support is usually implemented by means
of a two step "balancing loop", where:
- during the first step, we consider soft-affinity
(if the vcpu has one);
- during the second (if we get to it), we consider
hard-affinity.
In runq_tickle(), we need to do that for checking
whether we can execute the waking vCPU on an pCPU
that is idle. In fact, we want to be sure that, if
there is an idle pCPU in the vCPU's soft affinity,
we'll use it.
If there are no such idle pCPUs, though, and we
have to check non-idle ones, we can avoid the loop
and to both hard and soft-affinity in one pass.
In fact, we can we scan runqueue and compute a
"score" for each vCPU which is running on each pCPU.
The idea is, since we may have to preempt someone:
- try to make sure that the waking vCPU will run
inside its soft-affinity,
- try to preempt someone that is running outside
of its own soft-affinity.
The value of the score is added to a trace record,
so xenalyze's code and tools/xentrace/formats are
updated accordingly.
Suggested-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
--- Cc: Anshul Makkar <anshulmakkar@gmail.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Wei Liu <wei.liu2@citrix.com>
Julien Grall [Wed, 16 Aug 2017 10:27:22 +0000 (12:27 +0200)]
x86/mm: don't check alloc_boot_pages return
The only way alloc_boot_pages will return 0 is during the error case.
Although, Xen will panic in the error path. So the check in the caller
is pointless.
Looking at the loop, my understanding is it will try to allocate in
smaller chunk if a bigger chunk fail. Given that alloc_boot_pages can
never check, the loop seems unecessary.
Signed-off-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Wed, 16 Aug 2017 08:56:23 +0000 (10:56 +0200)]
gnttab: drop useless locking
Holding any lock while accessing the maptrack entry fields is
pointless, as these entries are protected by their associated active
entry lock (which is being acquired later, before re-validating the
fields read without holding the lock).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Wei Liu [Mon, 14 Aug 2017 15:46:28 +0000 (16:46 +0100)]
xen: lift hypercall_cancel_continuation to sched.h
The function is the same on both x86 and arm. Lift it to sched.h to
save a function call, make it take a pointer to vcpu to avoid
resolving current every time it gets called.
Take the chance to change one of its callers to only use one current
in code.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 20 Jun 2017 09:40:56 +0000 (10:40 +0100)]
common/gnttab: Correct error handling for gnttab_setup_table()
Simplify the error labels to just "unlock" and "out". This fixes an erroneous
path where a failure of rcu_lock_domain_by_any_id() still results in
rcu_unlock_domain() being called.
This is only not an XSA by luck. rcu_unlock_domain() is a nop other than
decrementing the preempt count, and nothing reads the preempt count outside of
a debug build.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 31 May 2017 13:56:26 +0000 (14:56 +0100)]
x86/hpet: Improve handling of timer_deadline
timer_deadline is only ever updated via this_cpu() in timer_softirq_action(),
so is not going to change behind the back of the currently running cpu.
Update hpet_broadcast_{enter,exit}() to cache the value in a local variable to
avoid the repeated RELOC_HIDE() penalty.
handle_hpet_broadcast() reads the timer_deadlines of remote cpus, but there is
no need to force the read for cpus which are not present in the mask. One
requirement is that we only sample the value once (which happens as a side
effect of RELOC_HIDE()), but is made more explicit with ACCESS_ONCE().
Bloat-o-meter shows a modest improvement:
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-144 (-144)
function old new delta
hpet_broadcast_exit 335 313 -22
hpet_broadcast_enter 327 278 -49
handle_hpet_broadcast 572 499 -73
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 15 Aug 2017 13:08:03 +0000 (15:08 +0200)]
gnttab: correct pin status fixup for copy
Regardless of copy operations only setting GNTPIN_hst*, GNTPIN_dev*
also need to be taken into account when deciding whether to clear
_GTF_{read,writ}ing. At least for consistency with code elsewhere the
read part better doesn't use any mask at all.
This is XSA-230.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 15 Aug 2017 13:07:25 +0000 (15:07 +0200)]
gnttab: split maptrack lock to make it fulfill its purpose again
The way the lock is currently being used in get_maptrack_handle(), it
protects only the maptrack limit: The function acts on current's list
only, so races on list accesses are impossible even without the lock.
Otoh list access races are possible between __get_maptrack_handle() and
put_maptrack_handle(), due to the invocation of the former for other
than current from steal_maptrack_handle(). Introduce a per-vCPU lock
for list accesses to become race free again. This lock will be
uncontended except when it becomes necessary to take the steal path,
i.e. in the common case there should be no meaningful performance
impact.
When in get_maptrack_handle adds a stolen entry to a fresh, empty,
freelist, we think that there is probably no concurrency. However,
this is not a fast path and adding the locking there makes the code
clearly correct.
Also, while we are here: the stolen maptrack_entry's tail pointer was
not properly set. Set it.
This is CVE-2017-12136 / XSA-228.
Reported-by: Ian Jackson <ian.jackson@eu.citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Andrew Cooper [Tue, 15 Aug 2017 13:06:45 +0000 (15:06 +0200)]
x86/grant: disallow misaligned PTEs
Pagetable entries must be aligned to function correctly. Disallow attempts
from the guest to have a grant PTE created at a misaligned address, which
would result in corruption of the L1 table with largely-guest-controlled
values.
This is CVE-2017-12137 / XSA-227.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Mon, 14 Aug 2017 15:17:44 +0000 (17:17 +0200)]
grant_table: include mm.h in xen/grant_table.h
While re-ordering the include alphabetically in arch/arm/domain.c, I got
a complitation error because grant_table.h is using gfn_t before been
defined:
In file included from domain.c:14:0:
xen/xen/include/xen/grant_table.h:153:29: error: unknown type name \91gfn_t\92
gfn_t *gfn, uint16_t *status);
^
Fix it by including xen/mm.h in it.
Signed-off-by: Julien Grall <julien.grall@arm.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Fri, 11 Aug 2017 13:35:48 +0000 (14:35 +0100)]
x86/config: Fix stale documentation concerning virtual layout
The hypercall argument translation area lives in the per-domain mappings in
PML4 slot 260. Nothing currently resides in the lower canonical half above
the 4GB boundary in a 32bit PV guest.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 26 Jul 2017 09:18:02 +0000 (10:18 +0100)]
common/domain_page: Drop domain_mmap_cache infrastructure
This infrastructure is used exclusively by the x86 do_mmu_update() hypercall.
Mapping and unmapping domain pages is probably not the slow part of that
function, but even with an opencoded caching implementation, Bloat-o-meter
reports:
function old new delta
do_mmu_update 6815 6573 -242
The !CONFIG_DOMAIN_PAGE stub code has a mismatch between mapping and
unmapping, which is a latent bug.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 10 Aug 2017 10:37:24 +0000 (12:37 +0200)]
x86/HVM: fix boundary check in hvmemul_insn_fetch() (again)
Commit 5a992b670b ("x86/hvm: Fix boundary check in
hvmemul_insn_fetch()") went a little too far in its correction to
commit 0943a03037 ("x86/hvm: Fixes to hvmemul_insn_fetch()"): Keep the
start offset check, but restore the original end offset one.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
cpufreq: only stop ondemand governor if already started
On CPUFREQ_GOV_STOP in cpufreq_governor_dbs, shortcut to
return success if the governor is already stopped.
Avoid executing dbs_timer_exit, to prevent tripping an assertion
within a call to kill_timer on a timer that has not been prepared
with init_timer, if the CPUFREQ_GOV_START case has not
run beforehand.
kill_timer validates timer state:
* itself, via BUG_ON(this_cpu(timers).running == timer);
* within active_timer, ASSERTing timer->status is within bounds;
* within list_del, which ASSERTs timer inactive list membership.
Patch is synonymous to an OpenXT patch produced at Citrix prior to
June 2014.
Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/tboot: disable interrupts after map_pages_to_xen() in tboot_shutdown()
Move the point where interrupts are disabled in tboot_shutdown
to slightly later, to after the call to map_pages_to_xen.
This patch originated in OpenXT with the following report:
"Disabling interrupts early causes debug assertions.
This is only seen with debug builds but since it causes assertions it is
probably a bigger problem. It clearly says in map_pages_to_xen that it
should not be called with interrupts disabled. Moved disabling to just
after that call."
The Xen code comment ahead of map_pages_to_xen notes that the CPU cache
flushing in map_pages_to_xen differs depending on whether interrupts are
enabled or not. The flush logic with interrupts enabled is more
conservative, flushing all CPUs' TLBs/caches, rather than just local.
This is just before the tboot memory integrity MAC calculation is performed
in the case of entering S3.
Original patch author credit: Ross Philipson.
Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 10 Aug 2017 10:34:21 +0000 (12:34 +0200)]
AMD IOMMU: drop amd_iommu_setup_hwdom_device()
By moving its bridge special casing to amd_iommu_add_device(), we can
pass the latter to setup_hwdom_pci_devices() and at once consistently
handle bridges discovered at boot time as well as such reported by Dom0
later on.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
User-Mode Instruction Prevention (UMIP) is a security feature present in
new Intel Processors. With this feature, when the UMIP bit in CR4 set,
the following instructions cannot be executed if CPL > 0: SGDT, SIDT,
SLDT, SMSW, and STR. An attempt at such execution causes a general-
protection exception (#GP).
This patch simply adds necessary definitions to expose this feature to
hvm guests.
Signed-off-by: Boqun Feng (Intel) <boqun.feng@gmail.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Chao Gao [Thu, 10 Aug 2017 10:32:16 +0000 (12:32 +0200)]
VT-d PI: disable VT-d PI when CPU-side PI isn't enabled
From the context calling pi_desc_init(), we can conclude the current
implementation of VT-d PI depends on CPU-side PI. If we enable VT-d PI
and disable CPU-side PI by disabling APICv explicitly in xen boot
command line, we would get an assertion failure.
This patch clears iommu_intpost once finding CPU-side PI won't be enabled.
It is safe for this is done before this flag starts taking effect. Also
take this chance to remove the useless check of "acknowledge interrupt on
exit", which is a minimal requirement which has been checked earlier.
Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Olaf Hering [Fri, 23 Jun 2017 17:35:04 +0000 (19:35 +0200)]
vtpmmgr: make inline functions static
gcc7 is more strict with functions marked as inline. They are not
automatically inlined. Instead a function call is generated, but the
actual code is not visible by the linker.
Do a mechanical change and mark every 'inline' as 'static inline'. For
simpler review the static goes into an extra line.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Tested-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Yi Sun [Mon, 7 Aug 2017 01:50:49 +0000 (09:50 +0800)]
x86: adjust place of an ASSERT to avoid crash when destroy a domain.
In 'psr_free_cos', we should not use 'ASSERT(socket_info)' at the beginning
because the 'socket_info' is allocated only if 'psr' boot parameter is set.
So adjust its place to avoid crash.
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
libxl: do not start dom0 qemu for stubdomain when not needed
Do not setup vfb+vkb when no access method was configured. Then check if
qemu is really needed.
The only not configurable thing forcing qemu running in dom0 after this
change are consoles used to save/restore. But even in that case, there
is much smaller part of qemu exposed.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Rusty Bird [Thu, 3 Aug 2017 10:40:25 +0000 (12:40 +0200)]
VT-d: don't panic/warn on iommu=no-igfx
When operating on an Intel graphics device, iommu_enable_translation()
panicked (force_iommu==1) or warned (force_iommu==0) about the BIOS if
is_igd_vt_enabled_quirk() returned 0. That's good if the actual BIOS
problem has been detected. But since commit 1463411, returning 0 could
also happen if the user simply passed "iommu=no-igfx", in which case
bailing out with an info message (instead of a panic/warning) would be
more appropriate.
The panic broke the combination "iommu=force,no-igfx", and also the case
where "iommu=no-igfx" is passed but force_iommu=1 is set automatically
by x2apic_bsp_setup().
Move the iommu_igfx check from is_igd_vt_enabled_quirk() into its only
caller iommu_enable_translation(), and tweak the logic.
Signed-off-by: Rusty Bird <rustybird@openmailbox.org> Acked-by: Kevin Tian <kevin.tian@intel.com>
Yi Sun [Tue, 1 Aug 2017 09:05:00 +0000 (11:05 +0200)]
tools: L2 CAT: support set cbm for L2 CAT.
This patch implements the xl/xc changes to support set CBM
for L2 CAT.
The new level option is introduced to original CAT setting
command in order to set CBM for specified level CAT.
- 'xl psr-cat-set' is updated to set cache capacity bitmasks(CBM)
for a domain according to input cache level.
root@:~$ xl psr-cat-set -l2 1 0x7f
root@:~$ xl psr-cat-show -l2 1
Socket ID : 0
Default CBM : 0xff
ID NAME CBM
1 ubuntu14 0x7f
Signed-off-by: He Chen <he.chen@linux.intel.com> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Yi Sun [Tue, 1 Aug 2017 09:05:00 +0000 (11:05 +0200)]
tools: L2 CAT: support show cbm for L2 CAT.
This patch implements changes in xl/xc changes to support
showing CBM of L2 CAT.
The new level option is introduced to original CAT showing
command in order to show CBM for specified level CAT.
- 'xl psr-cat-show' is updated to show CBM of a domain
according to input cache level.
Examples:
root@:~$ xl psr-cat-show -l2 1
Socket ID : 0
Default CBM : 0xff
ID NAME CBM
1 ubuntu14 0x7f
Signed-off-by: He Chen <he.chen@linux.intel.com> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Yi Sun [Tue, 1 Aug 2017 09:05:00 +0000 (11:05 +0200)]
tools: L2 CAT: support get HW info for L2 CAT.
This patch implements xl/xc changes to support get HW info
for L2 CAT.
'xl psr-hwinfo' is updated to show both L3 CAT and L2 CAT
info.
Example(on machine which only supports L2 CAT):
Cache Monitoring Technology (CMT):
Enabled : 0
Cache Allocation Technology (CAT): L2
Socket ID : 0
Maximum COS : 3
CBM length : 8
Default CBM : 0xff
Signed-off-by: He Chen <he.chen@linux.intel.com> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Yi Sun [Tue, 1 Aug 2017 09:04:00 +0000 (11:04 +0200)]
x86: refactor psr: L3 CAT: set value: implement cos id picking flow.
Continue from previous patch:
'x86: refactor psr: L3 CAT: set value: implement cos finding flow.'
If fail to find a COS ID, we need pick a new COS ID for domain. Only COS ID
that ref[COS_ID] is 1 or 0 can be picked to input a new set feature values.
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Yi Sun [Tue, 1 Aug 2017 09:04:00 +0000 (11:04 +0200)]
x86: refactor psr: L3 CAT: set value: assemble features value array.
Only can one COS ID be used by one domain at one time. That means all enabled
features' COS registers at this COS ID are valid for this domain at that time.
When user updates a feature's value, we need make sure all other features'
values are not affected. So, we firstly need gather an array which contains
all features current values and replace the setting feature's value in array
to new value.
Then, we can try to find if there is a COS ID on which all features' COS
registers values are same as the array. If we can find, we just use this COS
ID. If fail to find, we need pick a new COS ID.
This patch implements value array assembling flow.
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Yi Sun [Tue, 1 Aug 2017 09:04:00 +0000 (11:04 +0200)]
x86: refactor psr: L3 CAT: set value: implement framework.
As set value flow is the most complicated one in psr, it will be
divided to some patches to make things clearer. This patch
implements the set value framework to show a whole picture firstly.
It also changes domctl interface to make it more general.
To make the set value flow be general and can support multiple features
at same time, it includes below steps:
1. Test and set dom_ids bit corresponding to the domain. If the old bit is 0
which means the domain's COS ID is invalid, restore COS ID to 0. If the
COS ID is valid, get the COS ID that current domain is using.
2. Gather a value array to store all features current value
into it and replace the current value of the feature which is
being set to the new input value.
3. Find if there is already a COS ID on which all features'
values are same as the array. Then, we can reuse this COS
ID.
4. If fail to find, we need pick an available COS ID. Only COS ID which ref
is 0 or 1 can be picked.
5. Write the feature's MSRs according to the COS ID.
6. Update ref according to COS ID.
7. Save the COS ID into current domain's psr_cos_ids[socket] so that we
can know which COS the domain is using on the socket.
So, some functions are abstracted and the callback functions will be
implemented in next patches.
Here is an example to understand the process. The CPU supports
two featuers, e.g. L3 CAT and L2 CAT. User wants to set L3 CAT
of Dom1 to 0x1ff.
1. At the initial time, the old_cos of Dom1 is 0. The COS registers values
are below at this time.
-------------------------------
| COS 0 | COS 1 | COS 2 | ... |
-------------------------------
L3 CAT | 0x7ff | 0x7ff | 0x7ff | ... |
-------------------------------
L2 CAT | 0xff | 0xff | 0xff | ... |
-------------------------------
2. Gather the value array and insert new value into it:
val[0]: 0x1ff
val[1]: 0xff
3. It cannot find a matching COS.
4. Pick COS 1 to store the value set.
5. Write the L3 CAT COS 1 registers. The COS registers values are
changed to below now.
-------------------------------
| COS 0 | COS 1 | COS 2 | ... |
-------------------------------
L3 CAT | 0x7ff | 0x1ff | ... | ... |
-------------------------------
L2 CAT | 0xff | 0xff | ... | ... |
-------------------------------
6. The ref[1] is increased to 1 because Dom1 is using it now.
7. Save 1 to Dom1's psr_cos_ids[socket].
Then, user wants to set L3 CAT of Dom2 to 0x1ff too. The old_cos
of Dom2 is 0 too. Repeat above flow.
The val array assembled is:
val[0]: 0x1ff
val[1]: 0xff
So, it can find a matching COS, COS 1. Then, it can reuse COS 1
for Dom2.
The ref[1] is increased to 2 now because both Dom1 and Dom2 are
using this COS ID. Set 1 to Dom2's psr_cos_ids[socket].
There is one thing need to emphasize that we need restore domain's COS ID to
0 when socket is offline. Otherwise, a wrong COS ID will be used when the
socket is online again. That may cause user see the wrong CBM shown. But it
takes much time to iterate all domains to restore COS ID to 0. So, we define
a 'dom_ids[]' to represents all domains, one bit corresponds to one domain.
If the bit is 0 when entering 'psr_ctxt_switch_to', that means this is the
first time the domain is switched to this socket or domain's COS ID has not
been set since the socket is online. So, the COS ID set to ASSOC register on
this socket should be default value, 0. If not, that means the domain's COS
ID has been set when the socket was online. So, this COS ID is valid and we
can directly use it. We restore the domain's COS ID to 0 if the bit
corresponding to the domain is 0 but the domain's COS ID is not 0 when
'psr_get_val' and 'psr_set_val' is called. This can avoid CPU serialization
if restoring action is exectued in 'psr_ctxt_switch_to'.
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
This patch implements the Domain init/free and schedule flows.
- When domain init, its psr resource should be allocated.
- When domain free, its psr resource should be freed too.
- When domain is scheduled, its COS ID on the socket should be
set into ASSOC register to make corresponding COS MSR value
work.
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Yi Sun [Tue, 1 Aug 2017 09:04:00 +0000 (11:04 +0200)]
x86: refactor psr: L3 CAT: implement main data structures, CPU init and free flows.
To construct an extendible framework, we need analyze PSR features
and abstract the common things and feature specific things. Then,
encapsulate them into different data structures.
By analyzing PSR features, we can get below map.
+------+------+------+
--------->| Dom0 | Dom1 | ... |
| +------+------+------+
| |
|Dom ID | cos_id of domain
| V
| +-----------------------------------------------------------------------------+
User --------->| PSR |
Socket ID | +--------------+---------------+---------------+ |
| | Socket0 Info | Socket 1 Info | ... | |
| +--------------+---------------+---------------+ |
| | cos_id=0 cos_id=1 ... |
| | +-----------------------+-----------------------+-----------+ |
| |->Ref : | ref 0 | ref 1 | ... | |
| | +-----------------------+-----------------------+-----------+ |
| | +-----------------------+-----------------------+-----------+ |
| |->L3 CAT: | cos 0 | cos 1 | ... | |
| | +-----------------------+-----------------------+-----------+ |
| | +-----------------------+-----------------------+-----------+ |
| |->L2 CAT: | cos 0 | cos 1 | ... | |
| | +-----------------------+-----------------------+-----------+ |
| | +-----------+-----------+-----------+-----------+-----------+ |
| |->CDP : | cos0 code | cos0 data | cos1 code | cos1 data | ... | |
| +-----------+-----------+-----------+-----------+-----------+ |
+-----------------------------------------------------------------------------+
So, we need define a socket info data structure, 'struct
psr_socket_info' to manage information per socket. It contains a
reference count array according to COS ID and a feature array to
manage all features enabled. Every entry of the reference count
array is used to record how many domains are using the COS registers
according to the COS ID. For example, L3 CAT and L2 CAT are enabled,
Dom1 uses COS_ID=1 registers of both features to save CBM values, like
below.
+-------+-------+-------+-----+
| COS 0 | COS 1 | COS 2 | ... |
+-------+-------+-------+-----+
L3 CAT | 0x7ff | 0x1ff | ... | ... |
+-------+-------+-------+-----+
L2 CAT | 0xff | 0xff | ... | ... |
+-------+-------+-------+-----+
If Dom2 has same CBM values, it can reuse these registers which COS_ID=1.
That means, both Dom1 and Dom2 use same COS registers(ID=1) to keep same
L3/L2 values. So, the value of ref[1] is 2 which means 2 domains are using
COS_ID 1.
To manage a feature, we need define a feature node data structure,
'struct feat_node', to manage feature's specific HW info, and an array of all
COS registers values of this feature.
To manage feature properties, we need define a feature property data structure,
'struct feat_props', to manage common properties (callback functions - all
feature's specific behaviors are encapsulated into these callback functions,
and generic values - e.g. the cos_max), the feature independent values.
CDP is a special feature which uses two entries of the array
for one COS ID. So, the number of CDP COS registers is the half of L3
CAT. E.g. L3 CAT has 16 COS registers, then CDP has 8 COS registers if
it is enabled. CDP uses the COS registers array as below.
For more details, please refer SDM and patches to implement 'get value' and
'set value'.
This patch also implements the CPU init and free flow including L3 CAT
initialization and some resources free. It includes below flows:
1. presmp init:
- parse command line parameter.
- allocate socket info for every socket.
- allocate feature resource.
- initialize socket info, get feature info and add feature into feature
array per cpuid result.
- free resources allocated if error happens.
- register cpu notifier to handle cpu events.
2. cpu notifier:
- handle cpu online events, if initialization work has been done before,
do nothing.
- handle cpu offline events, if it is the last cpu offline, free some
socket resources.
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Yi Sun [Tue, 1 Aug 2017 09:04:00 +0000 (11:04 +0200)]
x86: refactor psr: remove L3 CAT/CDP codes.
The current cache allocation codes in psr.c do not consider
future features addition and are not friendly to extend.
To make psr.c be more flexible to add new features and fulfill
the program principle, open for extension but closed for
modification, we have to refactor the psr.c:
1. Analyze cache allocation features and abstract general data
structures.
2. Analyze the init and all other functions flow, abstract all
steps that different features may have different implementations.
Make these steps be callback functions and register feature
specific fuctions. Then, the main processes will not be changed
when introducing a new feature.
Because the quantity of refactor codes is big and the logics are
changed a lot, it will cause reviewers confused if just change
old codes. Reviewers have to understand both old codes and new
implementations. After review iterations from V1 to V3, Jan has
proposed to remove all old cache allocation codes firstly, then
implement new codes step by step. This will help to make codes
be more easily reviewable.
There is no construction without destruction. So, this patch
removes all current L3 CAT/CDP codes in psr.c. The following
patches will introduce the new mechanism.
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Yi Sun [Tue, 1 Aug 2017 09:04:00 +0000 (11:04 +0200)]
docs: create Cache Allocation Technology (CAT) and Code and Data Prioritization (CDP) feature document
This patch creates CAT and CDP feature document in doc/features/. It describes
key points to implement L3 CAT/CDP and L2 CAT which is described in details in
Intel SDM "INTEL® RESOURCE DIRECTOR TECHNOLOGY (INTEL® RDT) ALLOCATION FEATURES".
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Move pre-existing PAGE_(SHIFT|SIZE|MASK|ALIGN)_(4K|64K) and
introduce corresponding defines for 16K page granularity to/in a
common place in xen/page-defs.h to allow later commits to use the
consolidated defines.
Signed-off-by: Sergej Proskurin <proskurin@sec.in.tum.de> Acked-by: Jan Beulich <jbeulich@suse.com>
Praveen Kumar [Thu, 3 Aug 2017 10:24:25 +0000 (12:24 +0200)]
rbtree: changes to align the code with Linux tree
The patch aligns the code of rbtree related files with Linux tree.
This will minimize the conflicts during any future porting from Linux tree.
Linux commit till f4b477c47332367d35686bd2b808c2156b96d7c7 for rbtree.h
This includes addition of commented inline functions in rbtree.h, to have
complete replica from Linux tree.
Olaf Hering [Wed, 26 Jul 2017 14:39:50 +0000 (16:39 +0200)]
docs: add pod variant of xl-numa-placement
Convert source for xl-numa-placement.7 from markdown to pod.
This removes the buildtime requirement for pandoc, and subsequently the
need for ghc, in the chain for BuildRequires of xen.rpm.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Olaf Hering [Wed, 26 Jul 2017 14:39:49 +0000 (16:39 +0200)]
docs: add pod variant of xl-network-configuration.5
Convert source for xl-network-configuration.5 from markdown to pod.
This removes the buildtime requirement for pandoc, and subsequently the
need for ghc, in the chain for BuildRequires of xen.rpm.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Wei Liu <wei.liu2@citrix.com>
Olaf Hering [Wed, 26 Jul 2017 14:39:48 +0000 (16:39 +0200)]
docs: add pod variant of xen-pv-channel.7
Convert source for xen-pv-channel.7 from markdown to pod.
This removes the buildtime requirement for pandoc, and subsequently the
need for ghc, in the chain for BuildRequires of xen.rpm.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Wei Liu <wei.liu2@citrix.com>
Running "make uninstall" does not remove all installed files, a
situation which might cause link related issues if xen is re-installed
in a different location.
In order to make uninstall correctly remove the files it is best
the process should be done recursively by mirroring each "install"
target with an "uninstall" who removes the installed files.
An exception to this rule is uninstalling the files produced by
"qemu-xen-dir-remote" and "qemu-xen-traditional-dir", which are external
to the project. These projects do not implement an "uninstall" target so
the files have to be removed manually.
Signed-off-by: Petre Pircalabu <ppircalabu@bitdefender.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
If xc_gntshr_open failed the only thing to cleanup is free allocated
memory. So instead of calling libxenvchan_close (which assume
valid calculated buffers being mmaped already) free memory and return.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Fri, 30 Jun 2017 12:24:19 +0000 (12:24 +0000)]
x86/svm: Alias the VMCB segment registers as an array
This allows svm_{get,set}_segment_register() to access the user segments by
array index, as the x86_seg_* constants match the hardware encoding.
While making this alteration, add some newlines for clarity, switch an int for
a bool, and make the functions fail safe in a release build, rather than
crashing Xen.
Bloat-o-meter reports some modest improvements:
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-130 (-130)
function old new delta
svm_set_segment_register 662 653 -9
svm_get_segment_register 409 288 -121
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>