Starting with Family 0x10, model 10 processors, some AMD processors
will have support for the APERF/MPERF MSRs. This patch adds the
checks necessary to support those MSRs.
It also makes the get_measured_perf function defined inside cpufreq.c
driver independent. max_freq is taken from the policy definition
instead of being a private argument in struct acpi_cpufreq_data.
The struct member is entirely removed from the function since it
is no longer used.
Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
Add support for disabling AMD's Boost feature. Boost is similar to
Intel's Turbo and uses the same high level interface. The low
level implementation is different and encapsulated in the powernow
driver for cpufreq.
Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
Refactor the existing code that supports the Intel Turbo feature to
move all the driver specific bits in the cpufreq driver. Create
a tri-state interface for the Turbo feature that can distinguish
amongst enabled Turbo, disabled Turbo, and processors that don't
support Turbo at all.
Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
tmem: fix ia64 build
/xen/common/built_in.o: In function `tmh_get_first_byte':
/xen/include/xen/tmem_xen.h:350: undefined reference to
`__map_domain_page'
xen: allow guests to set caching attributes for MMIOs
This patch allows guests that have directly mapped MMIO regions to set
the caching attributes for them, and only for them.
Currently we have just an on/off check for a directly assigned device
instead of looking for directly mapped MMIO regions.
'xm info' command now also gives the cpu topology & host numa
information. This will be later used to build guest numa support. The
patch basically changes physinfo sysctl, and adds topology_info &
numa_info sysctls, and also changes the python & libxc code
accordingly.
Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com>
This fixes xenbus initialization of blkfront, netfront and pcifront
by uniformizing with fbfront: after writing parameters, set state to
initialised, then wait for backend to switch to connect state, and
then only read its parameter and switch to the connect state.
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
When no fb is available, init_fbfront will return, so the local
semaphore for synchronization with the kbd thread would get dropped.
Using a global static semaphore instead fixes this.
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
tmem: add page deduplication with optional compression or trailing-zero-elimination
Add "page deduplication" capability (with optional compression
and trailing-zero elimination) to Xen's tmem.
(Transparent to tmem-enabled guests.) Ephemeral pages
that have the exact same content are "combined" so that only
one page frame is needed. Since ephemeral pages are essentially
read-only, no C-O-W (and thus no equivalent of swapping) is
necessary. Deduplication can be combined with compression
or "trailing zero elimination" for even more space savings.
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
This patch adds a new field in hvm to indicate 1gb is supported by
CPU. In addition, users can turn 1GB feature on/off using a Xen
option ("hap_1gb", default is off). Per Tim's suggestion, I also add
an assertion check in shadow/common.c file to prevent affecting shadow
code.
This patch changes Xen tools to allocate 1GB first. If such requests
fail, it will fall back to 2MB and then 4KB. We skip 1GB allocation
for the MMIO space between 3GB and 4GB.
Limiting the number of idle cpus tickled for vcpu migration purpose
to ONLY ONE to get rid of a lot of IPI events which may impact the
average cpu idle residency time.
The default on option 'tickle_one_idle_cpu=0' can be used to disable
this optimization if needed.
cpuidle: mwait on softirq_pending & remove wakeup ipis
For cpu which enter deep C state via monitor/mwait, wakeup can be done
by writing to the monitored memory. So once monitor softirq_pending,
we can remove the redundant ipis.
Signed-off-by: Yu Ke <ke.yu@intel.com> Signed-off-by: Wei Gang <gang.wei@intel.com>
Allow all unused GSI to be configured via IO-APIC by new pv_ops dom0
Currently Xen disallows setting up any GSI < 16. This makes it
impossible by the kernel to use any PCI devices without ACPI override
but a mapping to this interrupts via IO-APIC.
The patch allows all unused interrupts to be setup via IO-APIC.
Keir Fraser [Wed, 31 Mar 2010 09:21:19 +0000 (10:21 +0100)]
x86/hvm: accelerate I/O intercept handling
currently we go through the emulator every time a HVM guest does an
I/O port access (in/out). This is unnecessary most of the times, as
both VMX and SVM provide all the necessary information already in the
VMCS/VMCB. String instructions are not covered by this shortcut, but
they are quite rare and we would need to access the guest memory
anyway. This patch decodes the information from VMCB/VMCS and calls a
simple handle_mmio wrapper. In handle_mmio() itself the emulation part
will simply be skipped, this approach avoids code duplication. Since
the vendor specific part is quite trivial, I implemented both the VMX
and SVM part, please check the VMX part for sanity.
I boot-tested both versions and ran some simple benchmarks. A micro
benchmark (hammering an I/O port in a tight loop) shows a significant
performance improvement (down to 66% of the time needed to handle the
intercept on an AMD K8, measured in the guest with TSC). Even with
reading a 1GB file from an emulated IDE harddisk (Dom0 cached) I could
get a 4-5% improvement. Some guest code (e.g. the TCP stack in some
Windows version) exercises the PM-Timer I/O port (0x1F48) very often
(multiple 10,000 times per second), these workloads also benefit with
up to 5% improvement from this patch.
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Keir Fraser [Wed, 31 Mar 2010 09:12:29 +0000 (10:12 +0100)]
x86: start PCI IRQs Xen uses from Dom0-invoked io_apic_set_pci_routing()
When using a serial port from an add-in PCI card, and that IRQ is (as
usual) outside of the legacy range (0...15), Xen would never really
enable the IRQ, as at the time setup_irq() runs the handler for the
IRQ still is &no_irq_type. Consequently, once the trigger mode and
polarity of the interrupt become known to Xen, it should start such
IRQ(s) it uses for itself.
The question is whether the same should also be done in
ioapic_guest_write(): Legacy kernels don't use PHYSDEVOP_setup_gsi
(and hence don't trigger the code path modified).
Note however that even when a kernel is using PHYSDEVOP_setup_gsi in
the way the pv-ops kernel currently does, there's still no guarantee
that the call would ever be issued for IRQs Xen may be using, since
this happens only when devices get enabled. For Xen's purposes, this
function should be called for *all* device IRQs, regardless of
whether those would actually be (attempted to be) used by the kernel,
i.e. in a subsys_initcall() from drivers/acpi/pci_irq.c iterating
over all PCI devices and doing mostly what acpi_pci_irq_enable() does
except for calling this function in place of acpi_register_gsi(). The
downside of this approach is that without extra filtering in Xen
(based on a hint from Dom0), vectors will then get up even for IRQs
that are unused by both hypervisor and kernel.
Keir Fraser [Wed, 31 Mar 2010 09:11:41 +0000 (10:11 +0100)]
ns16550: enable PCI serial card usage
On some machine, there is no build-in serial port and no LPC
connection. To use serial port, we have to plug in a serial
card. Sometime BIOS doesn't enable the BARs for the PCI devices which
lead to that Xen can't use the add-in serial port for early log
print. This patch try to initialize the serial card and related PCI
bridge to make it usable for xen.
Usage:
Step 1. boot into bare metal Linux, get the information for the PCI
serial ports and the related PCI bridge. On my case:
06:02.0 Serial controller: Lava Computer mfg Inc Lava DSerial-PCI Port
A (prog-if 02 [16550])
Region 0: I/O ports at 5000 [size=8]
Step 2. revise the grub.conf to include
'com1=115200,8n1,0xPPPP,0,<port-bdf>,<bridge-bdf> console=com1' for
xen cmdline. The 0xPPPP is the base I/O port address got for the
serial port in bare metal Linux. For my case, it is 0x5000. The 0
after 0xPPPP means enable polling model for the serial port. The
<port-bdf> is the serial port BDF, 06:02.0 in my case; the
<bridge-bdf> is the bridge BDF bebind which the serial card locates,
00:1e.0 for my case.
Keir Fraser [Tue, 30 Mar 2010 17:31:39 +0000 (18:31 +0100)]
xend: Fix bug of cpu affinity/ vcpu pin under ia32pae
c/s 21040 and 21044 used to break cpu number limit (<=64).
However, they result in bug under ia32pae model:
1. things will go wrong for affinity, making all vcpus pin to a same
cpu with same time;
2. when 'xm vcpu-pin' vpu to cpu, xend will exit abnormally.
Keir Fraser [Tue, 30 Mar 2010 17:30:30 +0000 (18:30 +0100)]
svm: vmcb intercept enumeration
Attached patch enumerates vmcb intercepts with hexadecimal numbers
additionally. This makes looking up the intercept number easier.
No functional changes.
Signed-off-by: Christoph Egger <Christoph.Egger@amd.com>
Keir Fraser [Tue, 30 Mar 2010 07:32:34 +0000 (08:32 +0100)]
mcheck: Small fix for CMCI Threshold set problem.
When generating new threshold value, we must firstly clean old value
before or the new set value since the new value might be different
with the old (BIOS might pre-set some threshold).
Signed-off-by: Liping Ke <liping.ke@intel.com> Signed-off-by: Ying Huang <ying.huang@intel.com>
Keir Fraser [Tue, 30 Mar 2010 07:31:16 +0000 (08:31 +0100)]
When flush tlb mask, we need consider the cpu_online_map.
The same is true for EPT flushes.
We noticed sometime system hang on cpu online/offline stress test. The
reason is because flush_tlb_mask from __get_page_type is deadloop.
This should be caused by a small windows in cpu offline. The
cpu_online_map is changed and the interrupt is disabled at
take_cpu_down() for the to-be-offline CPU.
However, the __sync_lazy_execstate() called from idle_task_exit() in
the idle_loop() for the to-be-offline CPU. At that time, the
stop_machine_run is finished already, and __get_page_type may be
called in other CPU before the __sync_lazy_execstate().
Thanks Jan pointing out issue in my original patch.
Keir Fraser [Fri, 26 Mar 2010 08:49:13 +0000 (08:49 +0000)]
cpufreq: fix statistic lock problem
cpufreq_statistic_lock should not only protect the statistic memory
pointed by cpufreq_statistic_data[cpu], but also have to protect the
pointer in cpufreq_statistic_data[cpu] itself. So move the read
operation of cpufreq_statistic_data[cpu] after
spin_lock(cpufreq_statistic_lock).
Keir Fraser [Thu, 25 Mar 2010 10:01:05 +0000 (10:01 +0000)]
VT-d: should not disable VT-d when find unknown DMAR structure type
Now 4 DMAR structure types are supported (type value 0 ~ 3). Type
values > 3 are reserved for future use. Current implementation
disables VT-d when find unknown DMAR structure type, this may lead to
VT-d disabling on future platforms before supporting new types on
Xen. For forward compatibility, just skip unknown structures by
skipping the appropriate number of bytes indicated by the Length
field, and then VT-d still can be used.
Signed-off-by: Weidong Han <weidong.han@intel.com>
Keir Fraser [Thu, 25 Mar 2010 09:19:33 +0000 (09:19 +0000)]
x86: s3: write_msi_msg: entry->msg should be in the compatibility format
When Interrupt Remapping is used, after Dom0 S3, Dom0's filesystem
might become inaccessible as the SATA disk's MSI interrupt becomes
buggy. The cause is: After set_msi_affinity() or setup_msi_irq()
invokes write_msi_msg(), entry->msg records the remappable format
message; during S3 resume, Dom0 invokes the PHYSDEVOP_restore_msi
hypercall to restore the MSI registers of devices, and in
pci_restore_msi_state() -> write_msi_msg(), the 'entry->msg' of
remappable format is passed, but in write_msi_msg() -> ... ->
msi_msg_to_remap_entry(), the 'msg' is assumed to be in compatibility
format. As a result, after s3, the IRTE is corrupted.
Actually the only users of 'entry->msg' are pci_restore_msi_state()
and dump_msi(). That's why we don't have issue except Dom0 S3.
Keir Fraser [Thu, 25 Mar 2010 07:41:55 +0000 (07:41 +0000)]
Fix gdbserver-xen support on older kernels.
The xc_ptrace API relies on errno for passing success/failure
indication back to callers. However, mapping operations that fall
back on legacy APIs may leave errno set to a non-zero result even
thought the operation is successful. This patch resets errno after
successful map operations so that xc_ptrace doesn't inadvertently
return a failure.
Keir Fraser [Thu, 25 Mar 2010 07:40:09 +0000 (07:40 +0000)]
x86: fix improper return value from relinquish_memory()
While apparently only a theoretical possibility (domain_kill() has a
BUG_ON() that wasn't reported to trigger so far), I still think it is
better to have the code cleaned up.
Keir Fraser [Wed, 24 Mar 2010 11:06:48 +0000 (11:06 +0000)]
Fix 21051:bcc09eb7379f "x86_32: Relocate multiboot modules to below 1GB."
Copy the modules in ascending order in memory, rather than decsending
order. This reduces the likelihood of the second relocation (in
setup.c) corrupting modules through accidental overwriting.
Keir Fraser [Tue, 23 Mar 2010 09:35:31 +0000 (09:35 +0000)]
x86: s3: ensure CR4.MCE is enabled after mcheck_init()
Changeset 21045: 7751288b1386 introduces a potential issue: CR4.MCE is
enabled before mcheck_init() -- thought looks I don't meet with an
actual issue with this, we'd better fix it.
Keir Fraser [Mon, 22 Mar 2010 10:29:42 +0000 (10:29 +0000)]
No cpu_add_remove_lock in do_boot_cpu.
The do_boot_cpu() will be called when system booting or CPU
online. When system booting, we don't need hold this lock. When system
online, the lock is held already by cpu_up.
Keir Fraser [Mon, 22 Mar 2010 10:29:13 +0000 (10:29 +0000)]
Do not spin on locks that may be held by stop_machine_run() callers.
Currently stop_machine_run() will try to bring all CPUs to softirq
context, with some locks held, like xenpf_lock or cpu_add_remove_lock
etc. However, if another CPU is trying to get these locks, it may
cause deadlock.
This patch replace all such spin_lock with spin_trylock. For
xenpf_lock and sysctl_lock, we try to use hypercall_continuation, so
that we will not cause trouble to user space tools. For
cpu_hot_remove_lock, we simply return EBUSY if failure, since it will
only impact small number of user space tools.
In the end, we should try to make the stop_machine_run as spinlock
free.
Keir Fraser [Sat, 20 Mar 2010 07:35:04 +0000 (07:35 +0000)]
Fix vcpu hotplug bug: transfer vcpu_avail hex string to qemu
Currently qemu has a bug: When maxvcpus > 64, qemu will get wrong
vcpu bitmap (s->cpus_sts[i]) since it only get bitmap from a long
variable.
This patch, cooperate with another qemu patch, is to fix this bug.
This patch transfer a vcpu_avail string in a hex string format, so
that at qemu side it's more easier to get vcpu bitmap from hex string,
especially when many vcpus, like more than 64.
(Also update QEMU_TAG for matching qemu-side update)
Keir Fraser [Thu, 18 Mar 2010 11:02:25 +0000 (11:02 +0000)]
x86: fix dom0 S3 when x2apic is used.
1) Some variables and functions in xen/arch/x86/genapic/x2apic.c
should not be marked with __init* as they will be used during s3
resume;
2) In do_suspend_lowlevel -> restore_rest_processor_state ->
mcheck_init, lapic is accessed, but x2apic hasn't been re-enabled yet
(x2apic is re-enabled() in device_power_up -> lapic_resume). The patch
moves mcheck_init to a later place.
Keir Fraser [Wed, 17 Mar 2010 14:10:43 +0000 (14:10 +0000)]
Improve graphical console performance
As it is pretty pointless to clear unused parts of a line over and
over again, keep track of how much of a line was actually written
to and avoid clearing parts of the screen that are known to already
be clear. With this, scrolling speed becomes comparable to that of
Linux' VESA console.
Keir Fraser [Wed, 17 Mar 2010 14:09:55 +0000 (14:09 +0000)]
x86: suppress pointless Xen messages from ioapic_guest_write()
Previously, these messages were only issued when old and new RTE
differed. Make it so again (requiring adjustment of the guest provided
RTE as that no longer holds a real vector).
While at it, also make the "allocated vector for irq" message more
useful and occur when what it says really happened.
Keir Fraser [Wed, 17 Mar 2010 09:18:34 +0000 (09:18 +0000)]
Fix a race condition for cpufreq dbs timer while S3 resuming
The cpufreq_dbs_timer_suspend/resume may race with dbs_timer_init
while s3 resuming before this patch.
This patch along with cset 21030 fix the bug 1586
http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1586.
Signed-off-by: Yu Ke <ke.yu@intel.com> Signed-off-by: Wei Gang <gang.wei@intel.com>
Keir Fraser [Wed, 17 Mar 2010 09:17:27 +0000 (09:17 +0000)]
libxc: Support set affinity for more than 64 CPUs.
There are more than 64 cpus on new intel platform especially on NUMA
system, so that we need break the pcpu limit (that is 64) when set
affinity of a VCPU.
Signed-off-by: James (song wei) <jsong@novell.com>
Keir Fraser [Wed, 17 Mar 2010 08:35:13 +0000 (08:35 +0000)]
VT-d: reduce default verbosity
Introduce a new sub-option "verbose" to "iommu=", and hide most
(debugging) messages when that option is not specified. Particularly
messages printed after time management was initialized can, on
sufficiently large systems and with a graphical console, lead to
time management issues (therefore a call to process_pending_softirqs()
also gets added in case the new sub-option is being used).
While touching that code, also convert all improper uses of gdprintk()
to dprintk(), and convert all boolean iommu config variables to bool_t
residing in the .data.read_mostly section.
Keir Fraser [Wed, 17 Mar 2010 08:34:16 +0000 (08:34 +0000)]
Increase default console ring allocation size and reduce default verbosity
In order to have better chance that relevant messages fit into the
ring buffer, allocate a dynamic (larger) one in more cases, and make
the default allocation size depend on both the number of CPUs and the
log level. Also free the static buffer if a dynamic one was obtained.
In order for "xm dmesg" to retrieve larger buffers, eliminate
pyxc_readconsolering()'s 32k limitation resulting from the use of a
statically allocated buffer.
Finally, suppress on x86 most per-CPU boot time messages (by default,
most of them can be re-enabled with a new command line option
"cpuinfo", some others are now only printed more than once when there
are inconsistencies between CPUs). This reduces both boot time (namely
when a graphical console is in use) and pressure on the console ring
and serial transmit buffers.
Keir Fraser [Mon, 15 Mar 2010 17:08:29 +0000 (17:08 +0000)]
blktap/fs-back: Build fixes for Fedora 13
1. Some files use stat, mkfifo, mkdir etc. without including
sys/stat.h
2. Some programs link against libpthread without a -lpthread compile
option. The compile used to work if this library happened to be used
by one of the other libraries that was being linked against, but
Fedora 13 has stopped allowing this.
From: M A Young <m.a.young@durham.ac.uk> Signed-off-by: Keir Fraser <keir.fraser@citrix.com>
Simply trying order-9 allocations until they won't succeed anymore
may consume unnecessarily much memory from the DMA zone (since the
page allocator will try to fulfill the request by using memory from
that zone when only lower order memory blocks are left in all other
zones). To avoid using DMA zone memory, make alloc_chunk() try to
allocate a second smaller chunk and use that one in favor of the
first one if it came from a higher addressed memory. This way, all
memory outside the DMA zone will be consumed before eating into that
zone.
Keir Fraser [Mon, 15 Mar 2010 13:24:33 +0000 (13:24 +0000)]
Reduce boot-time memory fragmentation
On certain NUMA configurations having init_node_heap() consume the
first few pages of a new node's memory for internal data structures
leads to unnecessary memory fragmentation, which can - with
sufficiently many nodes - result in there not remaining enough memory
below 4G for Dom0 to set up its swiotlb and PCI-consistent buffers.
Since alloc_boot_pages() generally consumes from the end of available
regions, make init_node_heap() prefer the end of such regions too (so
that fragmentation occurs at only one end of a region).
(Adjustment from first version: Use the tail of the region when the
end addresses alignment is less or equal to the beginning one's, not
just when it's less.)
Further, in order to prefer allocations from higher memory locations,
insert memory regions in reverse order in end_boot_allocator(), with
the exception of inserting one region residing on the boot CPU's node
first (for the statically allocated structures - used for the first
node seen - to be used for this node).
Finally, reduce MAX_ORDER on x86 to the maximum useful value (1Gb), so
that the reservation of a page on node boundaries (again leading to
fragmentation) can be avoided as much as possible (having node
boundaries on less the 1Gb aligned addresses is expected to be rare,
if found in practice at all).
Keir Fraser [Mon, 15 Mar 2010 13:23:07 +0000 (13:23 +0000)]
pygrub: further improve grub2 support
* Improve syntax error messages to say what actually went wrong
instead of giving an arbitrary and basically useless
integer.
* Improve handling of quoted values used with the "set" command,
previously only the default variable was special cased to
handle quoting.
* Allow for extra options to the menuentry command, syntax now
appears to be
menuentry "TITLE" --option1 --option2 {...}
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Keir Fraser [Mon, 15 Mar 2010 13:22:06 +0000 (13:22 +0000)]
libxenlight: fix segfault when domid_to_name returns NULL
The function libxl_domid_to_name() can return NULL if the path
/local/domain/%d/name does not exist. This causes a segfault if the
NULL name is later passed as a value to libxl_xs_writev(). I'm
hitting this making a call to libxl_device_vfb_add() from my graphical
switcher application.
This patch modifies xs_writev() and libxl_xs_writev() to skip NULL
values.
Keir Fraser [Mon, 15 Mar 2010 13:16:35 +0000 (13:16 +0000)]
hotplug: Avoid race condition when creating or destroying network bridges
I saw the following message when I created or destroyed two bridges by
using network-bridge script at same time. Of course names of the
bridges are different. But, a temporal name "tmpbridge" is used by
the script to create or destroy the bridges. I think that the message
was shown by "tmpbridge".
SIOCSIFNAME: File exists
This patch avoids race condition when creating or destroying the
bridges.
Keir Fraser [Thu, 11 Mar 2010 17:40:35 +0000 (17:40 +0000)]
x86: adjust available memory calculation for Dom0 construction
With a large number of CPUs, the amount of memory needed to construct
the vCPU structures for Dom0 becomes significant and hence should be
accounted for when calculating the amount of memory to pass to Dom0.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Add code comments and clean up compute_dom0_nr_pages() invocation.