Jan Beulich [Mon, 18 Mar 2013 16:13:32 +0000 (17:13 +0100)]
x86/HPET: mask interrupt while changing affinity
While being unable to reproduce the "No irq handler for vector ..."
messages observed on other systems, the change done by 5dc3fd2 ('x86:
extend diagnostics for "No irq handler for vector" messages') appears
to point at the lack of masking - at least I can't see what else might
be wrong with the HPET MSI code that could trigger these warnings.
While at it, also adjust the message printed by aforementioned commit
to not pointlessly insert spaces - we don't need aligned tabular output
here.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
Roger Pau Monne [Wed, 13 Mar 2013 17:42:17 +0000 (17:42 +0000)]
xl: add vif.default.script
Replace vifscript with vif.default.script. The old config option is
kept for backwards compatibility.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: George Dunlap <george.dunlap@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Roger Pau Monne [Wed, 13 Mar 2013 17:42:17 +0000 (17:42 +0000)]
xl: add vif.default.bridge
This is a replacement for defaultbridge xl.conf option. The now
deprecated defaultbridge is still supported.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: George Dunlap <George.Dunlap@eu.citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Roger Pau Monne [Wed, 13 Mar 2013 17:42:17 +0000 (17:42 +0000)]
xl: allow specifying a default gatewaydev in xl.conf
This adds a new global option in the xl configuration file called
"vif.default.gatewaydev", that is used to specify the default
gatewaydev to use when none is passed in the vif specification.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Tested-by: Ulf Kreutzberg <ulf.kreutzberg@hosteurope.de> Cc: Ulf Kreutzberg <ulf.kreutzberg@hosteurope.de> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Roger Pau Monne [Wed, 13 Mar 2013 17:42:17 +0000 (17:42 +0000)]
xl/libxl: add gatewaydev/netdev to vif specification
This option is used by the vif-route hotplug script. A new more
descriptive name is used, "gatewaydev", but "netdev" is also supported
as a deprecated backwards compatible option.
This option was supported in the past, according to
http://wiki.xen.org/wiki/Vif-route, so we should also support it in
libxl.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Tested-by: Ulf Kreutzberg <ulf.kreutzberg@hosteurope.de> Cc: Ulf Kreutzberg <ulf.kreutzberg@hosteurope.de> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
libxl_create.c: In function ‘libxl__domain_build_info_setdefault’:
libxl_create.c:109: error: ‘info’ undeclared (first use in this function)
libxl_create.c:109: error: (Each undeclared identifier is reported only once
libxl_create.c:109: error: for each function it appears in.)
cc1: warnings being treated as errors
libxl_create.c:108: error: suggest explicit braces to avoid ambiguous ‘else’
libxl_create.c: At top level:
libxl_create.c:141: error: expected identifier or ‘(’ before ‘if’
...
Fix is to insert the missing opening brace and s/info/b_info/ in one spot.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Jan Beulich [Thu, 14 Mar 2013 11:10:53 +0000 (12:10 +0100)]
x86: extend diagnostics for "No irq handler for vector" messages
By storing the inverted IRQ number in vector_irq[], we may be able to
spot which IRQ a vector was used for most recently, thus hopefully
permitting to understand why these messages trigger on certain systems.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
Tim Deegan [Thu, 7 Mar 2013 13:22:32 +0000 (13:22 +0000)]
x86/ept: check for errors in a few callers of ept_set_entry.
AFAICT in all these cases we have the p2m lock and have just checked
that the p2m trie is populated so the call should succeed. Make it
explicit with ASSERT() rather than just ignoring the result.
Signed-off-by: Tim Deegan <tim@xen.org> Acked-by: Jan Beulich <jbeulich@suse.com>
libxl: use qemu-xen (upstream QEMU) as device model by default
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Fri, 1 Mar 2013 17:17:04 +0000 (17:17 +0000)]
libxl: move check for existence of qemuu device model
The stat in libxl__domain_build_info_setdefault's default device model
logic works to fall back to qemu-xen-traditional whenever the
executable for qemu-xen is not found.
We are going to use qemu-xen-traditional in more cases, so break this
check out into its own if statement.
Also add a pair of braces to make the if() statement symmetrical.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Roger Pau Monne [Tue, 5 Mar 2013 17:06:29 +0000 (17:06 +0000)]
libxl: don't launch more than one tapdisk process for each disk
When adding a disk don't launch multiple tapdisk instances for the
same disk, if transaction fails in device_disk_add reuse the same
tapdisk for further tries instead of creating a new instance each
time a transaction fails.
Reported-by: Darren Shepherd <darren.s.shepherd@gmail.com> Signed-off-by: Roger Pau Monne <roger.pau@citrix.com> Tested-by: Darren Shepherd <darren.s.shepherd@gmail.com>
I initially added hypervisor-new and confirmed via /proc/device-model
that the content is the same before changing it to drop and replace
an existing node.
NB: There is an ambiguity in the compatibility property.
linux/arch/arm/boot/dts/xenvm-4.2.dts says "xen,xen-4.2" while
Documentation/devicetree/bindings/arm/xen.txt says "xen,xen-4.3". I have
used the actual hypervisor version as discussed in
http://marc.info/?l=xen-devel&m=135963416631423
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Ian Campbell [Mon, 18 Feb 2013 15:20:34 +0000 (15:20 +0000)]
xen: arm: parse modules from DT during early boot.
The bootloader should populate /chosen/modules/module@<N>/ for each
module it wishes to pass to the hypervisor. The content of these nodes
is described in docs/misc/arm/device-tree/booting.txt
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Jan Beulich [Tue, 12 Mar 2013 14:53:30 +0000 (15:53 +0100)]
x86/MCA: suppress bank clearing for certain injected events
As the bits indicating validity of the ADDR and MISC bank MSRs may be
injected in a way that isn't consistent with what the underlying
hardware implements (while the bank must be valid for injection to
work, the auxiliary MSRs may not be implemented - and hence cause #GP
upon access - if the hardware never sets the corresponding valid bits.
Consequently we need to do the clearing writes only if no value was
interposed for the respective MSR (which also makes sense the other way
around: there's no point in clearing a hardware register when all data
read came from software). Of course this all requires the injection
tool to do things in a consistent way (but that had been a requirement
before already).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Ren Yongjie <yongjie.ren@intel.com> Acked-by: Liu Jinsong <jinsong.liu@intel.com>
Dietmar Hahn [Tue, 12 Mar 2013 14:37:45 +0000 (15:37 +0100)]
vpmu intel: pass through cpuid bits when BTS is enabled
This patch passes the orginal cpuid bits for X86_FEATURE_DTES64 (64-bit
DS Area) and X86_FEATURE_DSCPL (CPL Qualified Debug Store) to the guest
when the BTS feature is switched on. I forgot this when I did this BTS
emulation.
Try to fix the the issue that "some AMD systems may round the
frequencies in ACPI tables to 100MHz boundaries. We can obtain the real
frequencies from MSRs, so add a quirk to fix these frequencies up
on AMD systems." (from f594065..)
In discussion (around 9855d8..) "it turned out that indeed real
HW/BIOSes may choose to not set the valid bit and thus mark the
P-state as invalid. So this could be considered a fix for broken
BIOSes." (from 9855d8..)
which is great for Linux. Unfortunatly the Linux kernel, when
it tries to do the RDMSR under Xen it fails to get the right
value (it gets zero) as Xen traps it and returns zero. Hence
when dom0 uploads the P-states they will be unmodified and
we should take care of updating the frequencies with the right
values.
I've tested it under Dell Inc. PowerEdge T105 /0RR825, BIOS 1.3.2
08/20/2008 where this quirk can be observed (x86 == 0x10, model == 2).
Also on other AMD (x86 == 0x12, A8-3850; x86 = 0x14, AMD E-350) to
make sure the quirk is not applied there.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: stefan.bader@canonical.com
Do the MSR access here (and while at it, also the one reading
MSR_PSTATE_CUR_LIMIT) on the target CPU, and bound the loop over
amd_fixup_frequency() by max_hw_pstate (matching the one in
powernow_cpufreq_cpu_init()).
Dan Magenheimer [Mon, 11 Mar 2013 16:13:42 +0000 (16:13 +0000)]
mmu: Introduce XENMEM_claim_pages (subop of memory ops)
When guests memory consumption is volatile (multiple guests
ballooning up/down) we are presented with the problem of
being able to determine exactly how much memory there is
for allocation of new guests without negatively impacting
existing guests. Note that the existing models (xapi, xend)
drive the memory consumption from the tool-stack and assume
that the guest will eventually hit the memory target. Other
models, such as the dynamic memory utilized by tmem, do this
differently - the guest drivers the memory consumption (up
to the d->max_pages ceiling). With dynamic memory model, the
guest frequently can balloon up and down as it sees fit.
This presents the problem to the toolstack that it does not
know atomically how much free memory there is (as the information
gets stale the moment the d->tot_pages information is provided
to the tool-stack), and hence when starting a guest can fail
during the memory creation process. Especially if the process
is done in parallel. In a nutshell what we need is a atomic
value of all domains tot_pages during the allocation of guests.
Naturally holding a lock for such a long time is unacceptable.
Hence the goal of this hypercall is to attempt to atomically and very
quickly determine if there are sufficient pages available in the
system and, if so, "set aside" that quantity of pages for future
allocations by that domain. Unlike an existing hypercall such as
increase_reservation or populate_physmap, specific physical
pageframes are not assigned to the domain because this
cannot be done sufficiently quickly (especially for very large
allocations in an arbitrarily fragmented system) and so the
existing mechanisms result in classic time-of-check-time-of-use
(TOCTOU) races. One can think of claiming as similar to a
"lazy" allocation, but subsequent hypercalls are required
to do the actual physical pageframe allocation.
Note that one of effects of this hypercall is that from the
perspective of other running guests - suddenly there is
a new guest occupying X amount of pages. This means that when
we try to balloon up they will hit the system-wide ceiling of
available free memory (if the total sum of the existing d->max_pages
>= host memory). This is OK - as that is part of the overcommit.
What we DO NOT want to do is dictate their ceiling should be
(d->max_pages) as that is risky and can lead to guests OOM-ing.
It is something the guest needs to figure out.
In order for a toolstack to "get" information about whether
a domain has a claim and, if so, how large, and also for
the toolstack to measure the total system-wide claim, a
second subop has been added and exposed through domctl
and libxl (see "xen: XENMEM_claim_pages: xc").
== Alternative solutions ==
There has been a variety of discussion whether the problem
hypercall is solving can be done in user-space, such as:
- For all the existing guest, set their d->max_pages temporarily
to d->tot_pages and create the domain. This forces those
domains to stay at their current consumption level (fyi, this is what
the tmem freeze call is doing). The disadvantage of this is
that needlessly forces the guests to stay at the memory usage
instead of allowing it to decide the optimal target.
- Account only using d->max_pages of how much free memory there is.
This ignores ballooning changes and any over-commit scenario. This
is similar to the scenario where the sum of all d->max_pages (and
the one to be allocated now) on the host is smaller than the available
free memory. As such it ignores the over-commit problem.
- Provide a ring/FIFO along with event channel to notify an userspace
daemon of guests memory consumption. This daemon can then provide
up-to-date information to the toolstack of how much free memory
there is. This duplicates what the hypervisor is already doing and
introduced latency issues and catching breath for the toolstack as there
might be millions of these updates on heavily used machine. There might
not be any quiescent state ever and the toolstack will heavily consume
CPU cycles and not ever provide up-to-date information.
It has been noted that this claim mechanism solves the
underlying problem (slow failure of domain creation) for
a large class of domains but not all, specifically not
handling (but also not making the problem worse for) PV
domains that specify the "superpages" flag, and 32-bit PV
domains on large RAM systems. These will be addressed at a
later time.
Code overview:
Though the hypercall simply does arithmetic within locks,
some of the semantics in the code may be a bit subtle.
The key variables (d->unclaimed_pages and total_unclaimed_pages)
starts at zero if no claim has yet been staked for any domain.
(Perhaps a better name is "claimed_but_not_yet_possessed" but that's
a bit unwieldy.) If no claim hypercalls are executed, there
should be no impact on existing usage.
When a claim is successfully staked by a domain, it is like a
watermark but there is no record kept of the size of the claim.
Instead, d->unclaimed_pages is set to the difference between
d->tot_pages and the claim. When d->tot_pages increases or decreases,
d->unclaimed_pages atomically decreases or increases. Once
d->unclaimed_pages reaches zero, the claim is satisfied and
d->unclaimed pages stays at zero -- unless a new claim is
subsequently staked.
The systemwide variable total_unclaimed_pages is always the sum
of d->unclaimed_pages, across all domains. A non-domain-
specific heap allocation will fail if total_unclaimed_pages
exceeds free (plus, on tmem enabled systems, freeable) pages.
Claim semantics could be modified by flags. The initial
implementation had three flag, which discerns whether the
caller would like tmem freeable pages to be considered
in determining whether or not the claim can be successfully
staked. This in later patches was removed and there are no
flags.
A claim can be cancelled by requesting a claim with the
number of pages being zero.
A second subop returns the total outstanding claimed pages
systemwide.
Note: Save/restore/migrate may need to be modified,
else it can be documented that all claims are cancelled.
This patch of the proposed XENMEM_claim_pages hypercall/subop, takes
into account review feedback from Jan and Keir and IanC and Matthew Daley,
plus some fixes found via runtime debugging.
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Keir Fraser <keir@xen.org>
George Dunlap [Mon, 11 Mar 2013 08:57:11 +0000 (09:57 +0100)]
credit2: Reset until the front of the runqueue is positive
Under normal circumstances, snext->credit should never be less than
-CSCHED_MIN_TIMER. However, under some circumstances, a vcpu with low
credits may be allowed to run long enough that its credits are
actually less than -CSCHED_CREDIT_INIT.
(Instances have been observed, for example, where a vcpu with 200us of
credit was allowed to run for 11ms, giving it -10.8ms of credit. Thus
it was still negative even after the reset.)
If this is the case for snext, we simply want to keep moving everyone
up until it is in the black again. This fair because none of the
other vcpus want to run at the moment.
Rather than loop, just detect how many times we want to add
CSCHED_CREDIT_INIT. Try to avoid integer divides and multiplies in
the common case.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
x86/vPMU: add missing Merom, Westmere, and Nehalem models
Mainly 22 (Merom-L); 30 (Nehelem); and 37, 44 (Westmere).
A comprehensive list is available at:
http://software.intel.com/en-us/articles/intel-architecture-and-processor-identification-with-cpuid-model-and-family-numbers
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Jun Nakajima <jun.nakajima@intel.com>
Jan Beulich [Fri, 8 Mar 2013 13:05:34 +0000 (14:05 +0100)]
x86/MSI: add mechanism to fully protect MSI-X table from PV guest accesses
This adds two new physdev operations for Dom0 to invoke when resource
allocation for devices is known to be complete, so that the hypervisor
can arrange for the respective MMIO ranges to be marked read-only
before an eventual guest getting such a device assigned even gets
started, such that it won't be able to set up writable mappings for
these MMIO ranges before Xen has a chance to protect them.
This also addresses another issue with the code being modified here,
in that so far write protection for the address ranges in question got
set up only once during the lifetime of a device (i.e. until either
system shutdown or device hot removal), while teardown happened when
the last interrupt was disposed of by the guest (which at least allowed
the tables to be writable when the device got assigned to a second
guest [instance] after the first terminated).
George Dunlap [Fri, 8 Mar 2013 08:43:40 +0000 (09:43 +0100)]
sched: always ask the scheduler to re-place the vcpu when the affinity changes
It's probably a good idea to re-evaluate placement whenever the
affinity changes.
This additionally has the benefit of removing scheduler-specific
exceptions introduced in git c/s e6a6fd63.
The conditionals surrounding vcpu_migrate() are left pending a re-work
of the logic to avoid the common case calling vcpu_migrate() twice (once
here, and once in context_saved().
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Matthew Daley [Wed, 6 Mar 2013 16:10:26 +0000 (17:10 +0100)]
fix domain unlocking in some xsm error paths
A couple of xsm error/access-denied code paths in hypercalls neglect to
unlock a previously locked domain. Fix by ensuring the domains are
unlocked correctly.
Signed-off-by: Matthew Daley <mattjd@gmail.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
change arguments of do_kexec_op and compat_set_timer_op prototypes
... to match the actual functions.
Signed-off-by: Robbie VanVossen <robert.vanvossen@dornerworks.com>
Also make sure the source files defining these symbols include the
header declaring them (had we done so, the problem would have been
noticed long ago).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
Jan Beulich [Tue, 5 Mar 2013 07:51:10 +0000 (08:51 +0100)]
x86/shadow: don't use PV LDT area for cross-pages access emulation
As of 703ac3a ("x86: introduce create_perdomain_mapping()"), the page
tables for this range don't get set up anymore for non-PV guests. And
the way this was done was marked as a hack rather than a proper
mechanism anyway. Use vmap() instead.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
Olaf Hering [Mon, 4 Mar 2013 12:42:17 +0000 (13:42 +0100)]
xentrace: fix off-by-one in calculate_tbuf_size
Commit "xentrace: reduce trace buffer size to something mfn_offset can
reach" contains an off-by-one bug. max_mfn_offset needs to be reduced by
exactly the value of t_info_first_offset.
If the system has two cpus and the number of requested trace pages is
very large, the final number of trace pages + the offset will not fit
into a short. As a result the variable offset in alloc_trace_bufs() will
wrap while allocating buffers for the second cpu. Later
share_xen_page_with_privileged_guests() will be called with a wrong page
and the ASSERT in this function triggers. If the ASSERT is ignored by
running a non-dbg hypervisor the asserts in xentrace itself trigger
because "cons" is not aligned because the very last trace page for the
second cpu is a random mfn.
Thanks to Jan for the quick analysis.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
George Dunlap [Mon, 4 Mar 2013 12:39:19 +0000 (13:39 +0100)]
credit2: track residual from divisions done during accounting
This should help with under-accounting of vCPU-s running for extremly
short periods of time, but becoming runnable again at a high frequency.
Don't bother subtracting the residual from the runtime, as it can only ever
add up to one nanosecond, and will end up being debited during the next
reset interval anyway.
Original-patch-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
George Dunlap [Mon, 4 Mar 2013 12:38:45 +0000 (13:38 +0100)]
credit2: Avoid extra c2t calcuation in csched_runtime
csched_runtime() needs to call the ct2() function to change credits
into time. The c2t() function, however, is expensive, as it requires
an integer division.
c2t() was being called twice, once for the main vcpu's credit and once
for the difference between its credit and the next in the queue. But
this is unnecessary; by calculating in "credit" first, we can make it
so that we just do one conversion later in the algorithm.
This also adds more documentation describing the intended algorithm,
along with a relevant assertion..
The effect of the new code should be the same as the old code.
Spotted-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
George Dunlap [Mon, 4 Mar 2013 12:37:39 +0000 (13:37 +0100)]
credit1: Use atomic bit operations for the flags structure
The flags structure is not protected by locks (or more precisely,
it is protected using an inconsistent set of locks); we therefore need
to make sure that all accesses are atomic-safe. This is particulary
important in the case of the PARKED flag, which if clobbered while
changing the YIELD bit will leave a vcpu wedged in an offline state.
Using the atomic bitops also requires us to change the size of the "flags"
element.
Spotted-by: Igor Pavlikevich <ipavlikevich@gmail.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Jan Beulich [Mon, 4 Mar 2013 09:25:24 +0000 (10:25 +0100)]
x86: make x86_mcinfo_reserve() clear its result buffer
... instead of all but one of its callers.
Also adjust the corresponding sizeof() expressions to specify the
pointed-to type of the result variable rather than the literal type
(so that a type change of the variable will imply the size to get
adjusted too).
Suggested-by: Ian Campbell <Ian.Campbell@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Mon, 4 Mar 2013 09:20:57 +0000 (10:20 +0100)]
x86: don't rely on __softirq_pending to be the first field in irq_cpustat_t
This is even more so as the field doesn't have a comment to that effect
in the structure definition.
Once modifying the respective assembly code, also convert the
IRQSTAT_shift users to do a 32-bit shift only (as we won't support 48M
CPUs any time soon) and use "cmpl" instead of "testl" when checking the
field (both reducing code size).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
Jan Beulich [Mon, 4 Mar 2013 09:19:34 +0000 (10:19 +0100)]
x86: defer processing events on the NMI exit path
Otherwise, we may end up in the scheduler, keeping NMIs masked for a
possibly unbounded period of time (until whenever the next IRET gets
executed). Enforce timely event processing by sending a self IPI.
Of course it's open for discussion whether to always use the straight
exit path from handle_ist_exception.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
Jan Beulich [Mon, 4 Mar 2013 09:17:52 +0000 (10:17 +0100)]
SEDF: avoid gathering vCPU-s on pCPU0
The introduction of vcpu_force_reschedule() in 14320:215b799fa181 was
incompatible with the SEDF scheduler: Any vCPU using
VCPUOP_stop_periodic_timer (e.g. any vCPU of half way modern PV Linux
guests) ends up on pCPU0 after that call. Obviously, running all PV
guests' (and namely Dom0's) vCPU-s on pCPU0 causes problems for those
guests rather sooner than later.
So the main thing that was clearly wrong (and bogus from the beginning)
was the use of cpumask_first() in sedf_pick_cpu(). It is being replaced
by a construct that prefers to put back the vCPU on the pCPU that it
got launched on.
However, there's one more glitch: When reducing the affinity of a vCPU
temporarily, and then widening it again to a set that includes the pCPU
that the vCPU was last running on, the generic scheduler code would not
force a migration of that vCPU, and hence it would forever stay on the
pCPU it last ran on. Since that can again create a load imbalance, the
SEDF scheduler wants a migration to happen regardless of it being
apparently unnecessary.
Of course, an alternative to checking for SEDF explicitly in
vcpu_set_affinity() would be to introduce a flags field in struct
scheduler, and have SEDF set a "always-migrate-on-affinity-change"
flag.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
Jan Beulich [Mon, 4 Mar 2013 09:16:04 +0000 (10:16 +0100)]
x86: make certain memory sub-ops return valid values
When a domain's shared info field "max_pfn" is zero,
domain_get_maximum_gpfn() so far returned ULONG_MAX, which
do_memory_op() in turn converted to -1 (i.e. -EPERM). Make the former
always return a sensible number (i.e. zero if the field was zero) and
have the latter no longer truncate return values.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
Juergen Gross [Thu, 28 Feb 2013 14:56:45 +0000 (14:56 +0000)]
Avoid stale pointer when moving domain to another cpupool
When a domain is moved to another cpupool the scheduler private data pointers
in vcpu and domain structures must never point to an already freed memory
area.
While at it, simplify sched_init_vcpu() by using DOM2OP instead VCPU2OP.
Tim Deegan [Thu, 28 Feb 2013 12:42:15 +0000 (12:42 +0000)]
vmx: fix handling of NMI VMEXIT.
Call do_nmi() directly and explicitly re-enable NMIs rather than
raising an NMI through the APIC. Since NMIs are disabled after the
VMEXIT, the raised NMI would be blocked until the next IRET
instruction (i.e. the next real interrupt, or after scheduling a PV
guest) and in the meantime the guest will spin taking NMI VMEXITS.
Also, handle NMIs before re-enabling interrupts, since if we handle an
interrupt (and therefore IRET) before calling do_nmi(), we may end up
running the NMI handler with NMIs enabled.
Signed-off-by: Tim Deegan <tim@xen.org> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Matthew Daley [Thu, 28 Feb 2013 05:16:04 +0000 (18:16 +1300)]
x86/mm: fix invalid unlinking of nested p2m tables
Commit 90805dc (c/s 26387:4056e5a3d815) ("EPT: Make ept data stucture or
operations neutral") makes nested p2m tables be unlinked from the host
p2m table before their destruction (in p2m_teardown_nestedp2m).
However, by this time the host p2m table has already been torn down,
leading to a possible race condition where another allocation between
the two kinds of table being torn down can lead to a linked list
assertion with debug=y builds or memory corruption on debug=n ones.
Fix by swapping the order the two kinds of table are torn down in. While
at it, remove the condition in p2m_final_teardown, as it is already
checked identically in p2m_teardown_hostp2m itself.
Signed-off-by: Matthew Daley <mattjd@gmail.com> Acked-by: Tim Deegan <tim@xen.org>
Jan Beulich [Thu, 28 Feb 2013 10:08:13 +0000 (11:08 +0100)]
x86: introduce create_perdomain_mapping()
... as well as free_perdomain_mappings(), and use them to carry out the
existing per-domain mapping setup/teardown. This at once makes the
setup of the first sub-range PV domain specific (with idle domains also
excluded), as the GDT/LDT mapping area is needed only for those.
Also fix an improperly scaled BUILD_BUG_ON() expression in
mapcache_domain_init().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
libxl: Made it possible to use 'timer='delay_for_missed_ticks'
The assertion only allows values of 1 (no_delay_for_missed_ticks)
up to 3 (one_missed_tick_pending). It should be possible to
use the value of 0 (delay_for_missed_ticks) for the timer configuration
option.
Acked-by: Ian Campbell <ian.cambell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Ian Campbell [Tue, 26 Feb 2013 10:12:46 +0000 (10:12 +0000)]
tools: foreign: ensure 64 bit values are properly aligned for arm
When building the foreign headers on x86_32 we use '#pragma pack(4)' and
therefore need to explicitly align types which should be aligned to 8-byte
boundaries.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Jan Beulich [Tue, 26 Feb 2013 09:15:56 +0000 (10:15 +0100)]
x86: fix CMCI injection
This fixes the wrong use of literal vector 0xF7 with an "int"
instruction (invalidated by 25113:14609be41f36) and the fact that doing
the injection via a software interrupt was never valid anyway (because
cmci_interrupt() acks the LAPIC, which does the wrong thing if the
interrupt didn't get delivered though it).
In order to do latter, the patch introduces send_IPI_self(), at once
removing two opend coded uses of "genapic" in the IRQ handling code.
The IOMMU may stop processing page translations due to a perceived lack
of credits for writing upstream peripheral page service request (PPR)
or event logs. If the L2B miscellaneous clock gating feature is enabled
the IOMMU does not properly register credits after the log request has
completed, leading to a potential system hang.
BIOSes are supposed to disable L2B micellaneous clock gating by setting
L2_L2B_CK_GATE_CONTROL[CKGateL2BMiscDisable](D0F2xF4_x90[2]) = 1b. This
patch corrects that for those which do not enable this workaround.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 26 Feb 2013 09:12:57 +0000 (10:12 +0100)]
AMD IOMMU: cover all functions of a device even if ACPI only tells us of func 0
This ought to work as all functions of a device have the same place in
the bus topology, i.e. use the same IOMMU.
Also fix the type of ivrs_bdf_entries (when it's 'unsigned short' and
the last device found on a segment is ff:1f.x, it would otherwise end
up being zero).
And drop the bogus 'last_bdf' static variable, which conflicted anyway
with various functions' parameters.
Daniel De Graaf [Wed, 13 Feb 2013 16:07:05 +0000 (16:07 +0000)]
flask/policy: rework policy build system
This adds the ability to define security classes and access vectors in
FLASK policy not defined by the hypervisor, for the use of stub domains
or applications without their own security policies.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Daniel De Graaf [Wed, 13 Feb 2013 16:06:57 +0000 (16:06 +0000)]
flask/policy: sort dom0 accesses
For the example policy shipped with Xen, it makes sense to allow dom0
access to all system calls so that policy does not need to be updated
for each new hypervisor or toolstack feature used.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Andrei Lifchits [Wed, 20 Feb 2013 16:54:03 +0000 (16:54 +0000)]
build: Fix distclean when repo location changes
If the path to xen-unstable.hg changes (i.e. you move the repo), the symlinks
inside xen-unstable.hg/stubdom/libxc-x86_[32|64]/ all become broken, which
breaks distclean because make attempts to clean inside those first and fails to
find Makefile (which is also a symlink).
Signed-off-by: Andrei Lifchits <andrei.lifchits@citrix.com>
Olaf Hering [Thu, 14 Feb 2013 17:18:56 +0000 (17:18 +0000)]
xend: Only add cpuid and cpuid_check to sexpr once
tools/xend: Only add cpuid and cpuid_check to sexpr once
When converting a XendConfig object to sexpr, cpuid and cpuid_check
were being emitted twice in the resulting sexpr. The first conversion
writes incorrect sexpr, causing parsing of the sexpr to fail when xend
is restarted and domain sexpr files in /var/lib/xend/domains/<dom-uuid>
are read and parsed.
This patch skips the first conversion, and uses only the custom
cpuid{_check} conversion methods called later. It is not pretty, but
is the least invasive fix in this complex code.
Ian Campbell [Fri, 22 Feb 2013 08:58:25 +0000 (08:58 +0000)]
xen: arm: implement cpuinfo
Use to:
- Only context switch ThumbEE state if the processor implements it. In
particular the ARMv8 FastModels do not.
- Detect the generic timer, and therefore call identify_cpu before
init_xen_time.
Also improve the boot time messages a bit.
I haven't added decoding for all of the CPUID words, it seems like overkill
for the moment.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Cc: stefano.stabellini@citrix.com
Ian Campbell [Fri, 22 Feb 2013 08:58:22 +0000 (08:58 +0000)]
xen: arm: Explicitly setup VPIDR & VMPIDR at start of day
These are supposed to reset to the value of the underlying hardware
but appears not to be on at least some v8 models. There's no harm in
setting them explicitly.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org>
Ian Campbell [Fri, 22 Feb 2013 08:58:09 +0000 (08:58 +0000)]
xen: arm: guest context switching.
One side effect of this is that we now save the full 64-bit
TTBR[0,1] even on a 32-bit hypervisor. This is needed anyway to
support LPAE guests (although this patch doesn't implement anything
other than the context switch).
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org>