x86/vPMU: add missing Merom, Westmere, and Nehalem models
Mainly 22 (Merom-L); 30 (Nehelem); and 37, 44 (Westmere).
A comprehensive list is available at:
http://software.intel.com/en-us/articles/intel-architecture-and-processor-identification-with-cpuid-model-and-family-numbers
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Jun Nakajima <jun.nakajima@intel.com>
Jan Beulich [Fri, 8 Mar 2013 13:05:34 +0000 (14:05 +0100)]
x86/MSI: add mechanism to fully protect MSI-X table from PV guest accesses
This adds two new physdev operations for Dom0 to invoke when resource
allocation for devices is known to be complete, so that the hypervisor
can arrange for the respective MMIO ranges to be marked read-only
before an eventual guest getting such a device assigned even gets
started, such that it won't be able to set up writable mappings for
these MMIO ranges before Xen has a chance to protect them.
This also addresses another issue with the code being modified here,
in that so far write protection for the address ranges in question got
set up only once during the lifetime of a device (i.e. until either
system shutdown or device hot removal), while teardown happened when
the last interrupt was disposed of by the guest (which at least allowed
the tables to be writable when the device got assigned to a second
guest [instance] after the first terminated).
George Dunlap [Fri, 8 Mar 2013 08:43:40 +0000 (09:43 +0100)]
sched: always ask the scheduler to re-place the vcpu when the affinity changes
It's probably a good idea to re-evaluate placement whenever the
affinity changes.
This additionally has the benefit of removing scheduler-specific
exceptions introduced in git c/s e6a6fd63.
The conditionals surrounding vcpu_migrate() are left pending a re-work
of the logic to avoid the common case calling vcpu_migrate() twice (once
here, and once in context_saved().
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Matthew Daley [Wed, 6 Mar 2013 16:10:26 +0000 (17:10 +0100)]
fix domain unlocking in some xsm error paths
A couple of xsm error/access-denied code paths in hypercalls neglect to
unlock a previously locked domain. Fix by ensuring the domains are
unlocked correctly.
Signed-off-by: Matthew Daley <mattjd@gmail.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
change arguments of do_kexec_op and compat_set_timer_op prototypes
... to match the actual functions.
Signed-off-by: Robbie VanVossen <robert.vanvossen@dornerworks.com>
Also make sure the source files defining these symbols include the
header declaring them (had we done so, the problem would have been
noticed long ago).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
Jan Beulich [Tue, 5 Mar 2013 07:51:10 +0000 (08:51 +0100)]
x86/shadow: don't use PV LDT area for cross-pages access emulation
As of 703ac3a ("x86: introduce create_perdomain_mapping()"), the page
tables for this range don't get set up anymore for non-PV guests. And
the way this was done was marked as a hack rather than a proper
mechanism anyway. Use vmap() instead.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
Olaf Hering [Mon, 4 Mar 2013 12:42:17 +0000 (13:42 +0100)]
xentrace: fix off-by-one in calculate_tbuf_size
Commit "xentrace: reduce trace buffer size to something mfn_offset can
reach" contains an off-by-one bug. max_mfn_offset needs to be reduced by
exactly the value of t_info_first_offset.
If the system has two cpus and the number of requested trace pages is
very large, the final number of trace pages + the offset will not fit
into a short. As a result the variable offset in alloc_trace_bufs() will
wrap while allocating buffers for the second cpu. Later
share_xen_page_with_privileged_guests() will be called with a wrong page
and the ASSERT in this function triggers. If the ASSERT is ignored by
running a non-dbg hypervisor the asserts in xentrace itself trigger
because "cons" is not aligned because the very last trace page for the
second cpu is a random mfn.
Thanks to Jan for the quick analysis.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
George Dunlap [Mon, 4 Mar 2013 12:39:19 +0000 (13:39 +0100)]
credit2: track residual from divisions done during accounting
This should help with under-accounting of vCPU-s running for extremly
short periods of time, but becoming runnable again at a high frequency.
Don't bother subtracting the residual from the runtime, as it can only ever
add up to one nanosecond, and will end up being debited during the next
reset interval anyway.
Original-patch-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
George Dunlap [Mon, 4 Mar 2013 12:38:45 +0000 (13:38 +0100)]
credit2: Avoid extra c2t calcuation in csched_runtime
csched_runtime() needs to call the ct2() function to change credits
into time. The c2t() function, however, is expensive, as it requires
an integer division.
c2t() was being called twice, once for the main vcpu's credit and once
for the difference between its credit and the next in the queue. But
this is unnecessary; by calculating in "credit" first, we can make it
so that we just do one conversion later in the algorithm.
This also adds more documentation describing the intended algorithm,
along with a relevant assertion..
The effect of the new code should be the same as the old code.
Spotted-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
George Dunlap [Mon, 4 Mar 2013 12:37:39 +0000 (13:37 +0100)]
credit1: Use atomic bit operations for the flags structure
The flags structure is not protected by locks (or more precisely,
it is protected using an inconsistent set of locks); we therefore need
to make sure that all accesses are atomic-safe. This is particulary
important in the case of the PARKED flag, which if clobbered while
changing the YIELD bit will leave a vcpu wedged in an offline state.
Using the atomic bitops also requires us to change the size of the "flags"
element.
Spotted-by: Igor Pavlikevich <ipavlikevich@gmail.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Jan Beulich [Mon, 4 Mar 2013 09:25:24 +0000 (10:25 +0100)]
x86: make x86_mcinfo_reserve() clear its result buffer
... instead of all but one of its callers.
Also adjust the corresponding sizeof() expressions to specify the
pointed-to type of the result variable rather than the literal type
(so that a type change of the variable will imply the size to get
adjusted too).
Suggested-by: Ian Campbell <Ian.Campbell@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Mon, 4 Mar 2013 09:20:57 +0000 (10:20 +0100)]
x86: don't rely on __softirq_pending to be the first field in irq_cpustat_t
This is even more so as the field doesn't have a comment to that effect
in the structure definition.
Once modifying the respective assembly code, also convert the
IRQSTAT_shift users to do a 32-bit shift only (as we won't support 48M
CPUs any time soon) and use "cmpl" instead of "testl" when checking the
field (both reducing code size).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
Jan Beulich [Mon, 4 Mar 2013 09:19:34 +0000 (10:19 +0100)]
x86: defer processing events on the NMI exit path
Otherwise, we may end up in the scheduler, keeping NMIs masked for a
possibly unbounded period of time (until whenever the next IRET gets
executed). Enforce timely event processing by sending a self IPI.
Of course it's open for discussion whether to always use the straight
exit path from handle_ist_exception.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
Jan Beulich [Mon, 4 Mar 2013 09:17:52 +0000 (10:17 +0100)]
SEDF: avoid gathering vCPU-s on pCPU0
The introduction of vcpu_force_reschedule() in 14320:215b799fa181 was
incompatible with the SEDF scheduler: Any vCPU using
VCPUOP_stop_periodic_timer (e.g. any vCPU of half way modern PV Linux
guests) ends up on pCPU0 after that call. Obviously, running all PV
guests' (and namely Dom0's) vCPU-s on pCPU0 causes problems for those
guests rather sooner than later.
So the main thing that was clearly wrong (and bogus from the beginning)
was the use of cpumask_first() in sedf_pick_cpu(). It is being replaced
by a construct that prefers to put back the vCPU on the pCPU that it
got launched on.
However, there's one more glitch: When reducing the affinity of a vCPU
temporarily, and then widening it again to a set that includes the pCPU
that the vCPU was last running on, the generic scheduler code would not
force a migration of that vCPU, and hence it would forever stay on the
pCPU it last ran on. Since that can again create a load imbalance, the
SEDF scheduler wants a migration to happen regardless of it being
apparently unnecessary.
Of course, an alternative to checking for SEDF explicitly in
vcpu_set_affinity() would be to introduce a flags field in struct
scheduler, and have SEDF set a "always-migrate-on-affinity-change"
flag.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
Jan Beulich [Mon, 4 Mar 2013 09:16:04 +0000 (10:16 +0100)]
x86: make certain memory sub-ops return valid values
When a domain's shared info field "max_pfn" is zero,
domain_get_maximum_gpfn() so far returned ULONG_MAX, which
do_memory_op() in turn converted to -1 (i.e. -EPERM). Make the former
always return a sensible number (i.e. zero if the field was zero) and
have the latter no longer truncate return values.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
Juergen Gross [Thu, 28 Feb 2013 14:56:45 +0000 (14:56 +0000)]
Avoid stale pointer when moving domain to another cpupool
When a domain is moved to another cpupool the scheduler private data pointers
in vcpu and domain structures must never point to an already freed memory
area.
While at it, simplify sched_init_vcpu() by using DOM2OP instead VCPU2OP.
Tim Deegan [Thu, 28 Feb 2013 12:42:15 +0000 (12:42 +0000)]
vmx: fix handling of NMI VMEXIT.
Call do_nmi() directly and explicitly re-enable NMIs rather than
raising an NMI through the APIC. Since NMIs are disabled after the
VMEXIT, the raised NMI would be blocked until the next IRET
instruction (i.e. the next real interrupt, or after scheduling a PV
guest) and in the meantime the guest will spin taking NMI VMEXITS.
Also, handle NMIs before re-enabling interrupts, since if we handle an
interrupt (and therefore IRET) before calling do_nmi(), we may end up
running the NMI handler with NMIs enabled.
Signed-off-by: Tim Deegan <tim@xen.org> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Matthew Daley [Thu, 28 Feb 2013 05:16:04 +0000 (18:16 +1300)]
x86/mm: fix invalid unlinking of nested p2m tables
Commit 90805dc (c/s 26387:4056e5a3d815) ("EPT: Make ept data stucture or
operations neutral") makes nested p2m tables be unlinked from the host
p2m table before their destruction (in p2m_teardown_nestedp2m).
However, by this time the host p2m table has already been torn down,
leading to a possible race condition where another allocation between
the two kinds of table being torn down can lead to a linked list
assertion with debug=y builds or memory corruption on debug=n ones.
Fix by swapping the order the two kinds of table are torn down in. While
at it, remove the condition in p2m_final_teardown, as it is already
checked identically in p2m_teardown_hostp2m itself.
Signed-off-by: Matthew Daley <mattjd@gmail.com> Acked-by: Tim Deegan <tim@xen.org>
Jan Beulich [Thu, 28 Feb 2013 10:08:13 +0000 (11:08 +0100)]
x86: introduce create_perdomain_mapping()
... as well as free_perdomain_mappings(), and use them to carry out the
existing per-domain mapping setup/teardown. This at once makes the
setup of the first sub-range PV domain specific (with idle domains also
excluded), as the GDT/LDT mapping area is needed only for those.
Also fix an improperly scaled BUILD_BUG_ON() expression in
mapcache_domain_init().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
libxl: Made it possible to use 'timer='delay_for_missed_ticks'
The assertion only allows values of 1 (no_delay_for_missed_ticks)
up to 3 (one_missed_tick_pending). It should be possible to
use the value of 0 (delay_for_missed_ticks) for the timer configuration
option.
Acked-by: Ian Campbell <ian.cambell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Ian Campbell [Tue, 26 Feb 2013 10:12:46 +0000 (10:12 +0000)]
tools: foreign: ensure 64 bit values are properly aligned for arm
When building the foreign headers on x86_32 we use '#pragma pack(4)' and
therefore need to explicitly align types which should be aligned to 8-byte
boundaries.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Jan Beulich [Tue, 26 Feb 2013 09:15:56 +0000 (10:15 +0100)]
x86: fix CMCI injection
This fixes the wrong use of literal vector 0xF7 with an "int"
instruction (invalidated by 25113:14609be41f36) and the fact that doing
the injection via a software interrupt was never valid anyway (because
cmci_interrupt() acks the LAPIC, which does the wrong thing if the
interrupt didn't get delivered though it).
In order to do latter, the patch introduces send_IPI_self(), at once
removing two opend coded uses of "genapic" in the IRQ handling code.
The IOMMU may stop processing page translations due to a perceived lack
of credits for writing upstream peripheral page service request (PPR)
or event logs. If the L2B miscellaneous clock gating feature is enabled
the IOMMU does not properly register credits after the log request has
completed, leading to a potential system hang.
BIOSes are supposed to disable L2B micellaneous clock gating by setting
L2_L2B_CK_GATE_CONTROL[CKGateL2BMiscDisable](D0F2xF4_x90[2]) = 1b. This
patch corrects that for those which do not enable this workaround.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 26 Feb 2013 09:12:57 +0000 (10:12 +0100)]
AMD IOMMU: cover all functions of a device even if ACPI only tells us of func 0
This ought to work as all functions of a device have the same place in
the bus topology, i.e. use the same IOMMU.
Also fix the type of ivrs_bdf_entries (when it's 'unsigned short' and
the last device found on a segment is ff:1f.x, it would otherwise end
up being zero).
And drop the bogus 'last_bdf' static variable, which conflicted anyway
with various functions' parameters.
Daniel De Graaf [Wed, 13 Feb 2013 16:07:05 +0000 (16:07 +0000)]
flask/policy: rework policy build system
This adds the ability to define security classes and access vectors in
FLASK policy not defined by the hypervisor, for the use of stub domains
or applications without their own security policies.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Daniel De Graaf [Wed, 13 Feb 2013 16:06:57 +0000 (16:06 +0000)]
flask/policy: sort dom0 accesses
For the example policy shipped with Xen, it makes sense to allow dom0
access to all system calls so that policy does not need to be updated
for each new hypervisor or toolstack feature used.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Andrei Lifchits [Wed, 20 Feb 2013 16:54:03 +0000 (16:54 +0000)]
build: Fix distclean when repo location changes
If the path to xen-unstable.hg changes (i.e. you move the repo), the symlinks
inside xen-unstable.hg/stubdom/libxc-x86_[32|64]/ all become broken, which
breaks distclean because make attempts to clean inside those first and fails to
find Makefile (which is also a symlink).
Signed-off-by: Andrei Lifchits <andrei.lifchits@citrix.com>
Olaf Hering [Thu, 14 Feb 2013 17:18:56 +0000 (17:18 +0000)]
xend: Only add cpuid and cpuid_check to sexpr once
tools/xend: Only add cpuid and cpuid_check to sexpr once
When converting a XendConfig object to sexpr, cpuid and cpuid_check
were being emitted twice in the resulting sexpr. The first conversion
writes incorrect sexpr, causing parsing of the sexpr to fail when xend
is restarted and domain sexpr files in /var/lib/xend/domains/<dom-uuid>
are read and parsed.
This patch skips the first conversion, and uses only the custom
cpuid{_check} conversion methods called later. It is not pretty, but
is the least invasive fix in this complex code.
Ian Campbell [Fri, 22 Feb 2013 08:58:25 +0000 (08:58 +0000)]
xen: arm: implement cpuinfo
Use to:
- Only context switch ThumbEE state if the processor implements it. In
particular the ARMv8 FastModels do not.
- Detect the generic timer, and therefore call identify_cpu before
init_xen_time.
Also improve the boot time messages a bit.
I haven't added decoding for all of the CPUID words, it seems like overkill
for the moment.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Cc: stefano.stabellini@citrix.com
Ian Campbell [Fri, 22 Feb 2013 08:58:22 +0000 (08:58 +0000)]
xen: arm: Explicitly setup VPIDR & VMPIDR at start of day
These are supposed to reset to the value of the underlying hardware
but appears not to be on at least some v8 models. There's no harm in
setting them explicitly.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org>
Ian Campbell [Fri, 22 Feb 2013 08:58:09 +0000 (08:58 +0000)]
xen: arm: guest context switching.
One side effect of this is that we now save the full 64-bit
TTBR[0,1] even on a 32-bit hypervisor. This is needed anyway to
support LPAE guests (although this patch doesn't implement anything
other than the context switch).
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org>
Ian Campbell [Fri, 22 Feb 2013 08:58:03 +0000 (08:58 +0000)]
xen: arm: separate guest user regs from internal guest state.
struct cpu_user_regs is currently used as both internal state
(specifically at the base of the stack) and a guest/toolstack
visible API (via struct vcpu_guest_context used by
XEN_DOMCTL_{g,s}etvcpucontext and VCPUOP_initialise).
This causes problems when we want to make the API 64-bit clean since
we don't really want to change the size of the on-stack struct.
So split into vcpu_guest_core_regs which is the API facing struct
and keep cpu_user_regs purely internal, translate between the two.
In the user API arrange for both 64- and 32-bit registers to be
included in a layout which does not differ depending on toolstack
architecture. Also switch to using the more formal banked register
names (e.g. with the _usr suffix) for clarity.
This is an ABI change. Note that the kernel doesn't currently use
this data structure so it affects the tools interface only.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org>
Ian Campbell [Fri, 22 Feb 2013 08:58:00 +0000 (08:58 +0000)]
xen: arm: extend HSR struct definitions to 64-bit
The main change is that the 4-bit register specifiers are extended
to 5 bits by taking in an adjacent SBZP bit.
Also 64-bit has two other properties indicting whether or not the
target register was 64-bit (x<n>) or 32-bit (w<n>) and whether the
instruction has acquire/release semantics.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org>
Ian Campbell [Fri, 22 Feb 2013 08:57:48 +0000 (08:57 +0000)]
xen: arm: refactor co-pro and sysreg reg handling.
AArch64 has removed the concept of co-processors replacing them with a
combination of specific instructions (cache and tlb flushes etc) and
system registers (which are understood by name in the assembler).
However most system registers are equivalent to a particular AArch32
co-pro register and can be used by generic code in the same way. Note
that the names of the registers differ (often only slightly)
For consistency it would be better to use only set of names in the
common code. Therefore move the {READ,WRITE}_CP{32,64} accessors into
arm32/processor.h and provide {READ,WRITE}_SYSREG. Where the names
differ #defines will be provided on 32-bit.
HSR_CPREG and friends are required even on 64-bit in order to decode
traps from 32 bit guests.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org>
Ian Campbell [Fri, 22 Feb 2013 08:57:45 +0000 (08:57 +0000)]
xen: arm64: basic config and types headers
The 64-bit bitops are taken from the Linux asm-generic implementations. They
should be replaced with optimised versions from the Linux arm64 port when they
become available.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org>
Jan Beulich [Fri, 22 Feb 2013 10:56:54 +0000 (11:56 +0100)]
honor ACPI v4 FADT flags
- force use of physical APIC mode if indicated so (as we don't support
xAPIC cluster mode, the respective flag is taken to force physical
mode too)
- don't use MSI if indicated so (implies no IOMMU)
Both can be overridden on the command line, for the MSI case this at
once adds a new command line option allowing to turn off PCI MSI (IOMMU
and HPET are unaffected by this).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
Jan Beulich [Fri, 22 Feb 2013 10:48:57 +0000 (11:48 +0100)]
ACPI: support v5 (reduced HW) sleep interface
Note that this also fixes a broken input check in acpi_enter_sleep()
(previously validating the sleep->pm1[ab]_cnt_val relationship based
on acpi_sinfo.pm1b_cnt_val, which however gets set only subsequently).
Also adjust a few minor issues with the pre-v5 handling in
acpi_fadt_parse_sleep_info().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>