Keir Fraser [Fri, 26 Mar 2010 08:49:13 +0000 (08:49 +0000)]
cpufreq: fix statistic lock problem
cpufreq_statistic_lock should not only protect the statistic memory
pointed by cpufreq_statistic_data[cpu], but also have to protect the
pointer in cpufreq_statistic_data[cpu] itself. So move the read
operation of cpufreq_statistic_data[cpu] after
spin_lock(cpufreq_statistic_lock).
Keir Fraser [Thu, 25 Mar 2010 10:01:05 +0000 (10:01 +0000)]
VT-d: should not disable VT-d when find unknown DMAR structure type
Now 4 DMAR structure types are supported (type value 0 ~ 3). Type
values > 3 are reserved for future use. Current implementation
disables VT-d when find unknown DMAR structure type, this may lead to
VT-d disabling on future platforms before supporting new types on
Xen. For forward compatibility, just skip unknown structures by
skipping the appropriate number of bytes indicated by the Length
field, and then VT-d still can be used.
Signed-off-by: Weidong Han <weidong.han@intel.com>
Keir Fraser [Thu, 25 Mar 2010 09:19:33 +0000 (09:19 +0000)]
x86: s3: write_msi_msg: entry->msg should be in the compatibility format
When Interrupt Remapping is used, after Dom0 S3, Dom0's filesystem
might become inaccessible as the SATA disk's MSI interrupt becomes
buggy. The cause is: After set_msi_affinity() or setup_msi_irq()
invokes write_msi_msg(), entry->msg records the remappable format
message; during S3 resume, Dom0 invokes the PHYSDEVOP_restore_msi
hypercall to restore the MSI registers of devices, and in
pci_restore_msi_state() -> write_msi_msg(), the 'entry->msg' of
remappable format is passed, but in write_msi_msg() -> ... ->
msi_msg_to_remap_entry(), the 'msg' is assumed to be in compatibility
format. As a result, after s3, the IRTE is corrupted.
Actually the only users of 'entry->msg' are pci_restore_msi_state()
and dump_msi(). That's why we don't have issue except Dom0 S3.
Keir Fraser [Thu, 25 Mar 2010 07:41:55 +0000 (07:41 +0000)]
Fix gdbserver-xen support on older kernels.
The xc_ptrace API relies on errno for passing success/failure
indication back to callers. However, mapping operations that fall
back on legacy APIs may leave errno set to a non-zero result even
thought the operation is successful. This patch resets errno after
successful map operations so that xc_ptrace doesn't inadvertently
return a failure.
Keir Fraser [Thu, 25 Mar 2010 07:40:09 +0000 (07:40 +0000)]
x86: fix improper return value from relinquish_memory()
While apparently only a theoretical possibility (domain_kill() has a
BUG_ON() that wasn't reported to trigger so far), I still think it is
better to have the code cleaned up.
Keir Fraser [Wed, 24 Mar 2010 11:06:48 +0000 (11:06 +0000)]
Fix 21051:bcc09eb7379f "x86_32: Relocate multiboot modules to below 1GB."
Copy the modules in ascending order in memory, rather than decsending
order. This reduces the likelihood of the second relocation (in
setup.c) corrupting modules through accidental overwriting.
Keir Fraser [Tue, 23 Mar 2010 09:35:31 +0000 (09:35 +0000)]
x86: s3: ensure CR4.MCE is enabled after mcheck_init()
Changeset 21045: 7751288b1386 introduces a potential issue: CR4.MCE is
enabled before mcheck_init() -- thought looks I don't meet with an
actual issue with this, we'd better fix it.
Keir Fraser [Mon, 22 Mar 2010 10:29:42 +0000 (10:29 +0000)]
No cpu_add_remove_lock in do_boot_cpu.
The do_boot_cpu() will be called when system booting or CPU
online. When system booting, we don't need hold this lock. When system
online, the lock is held already by cpu_up.
Keir Fraser [Mon, 22 Mar 2010 10:29:13 +0000 (10:29 +0000)]
Do not spin on locks that may be held by stop_machine_run() callers.
Currently stop_machine_run() will try to bring all CPUs to softirq
context, with some locks held, like xenpf_lock or cpu_add_remove_lock
etc. However, if another CPU is trying to get these locks, it may
cause deadlock.
This patch replace all such spin_lock with spin_trylock. For
xenpf_lock and sysctl_lock, we try to use hypercall_continuation, so
that we will not cause trouble to user space tools. For
cpu_hot_remove_lock, we simply return EBUSY if failure, since it will
only impact small number of user space tools.
In the end, we should try to make the stop_machine_run as spinlock
free.
Keir Fraser [Sat, 20 Mar 2010 07:35:04 +0000 (07:35 +0000)]
Fix vcpu hotplug bug: transfer vcpu_avail hex string to qemu
Currently qemu has a bug: When maxvcpus > 64, qemu will get wrong
vcpu bitmap (s->cpus_sts[i]) since it only get bitmap from a long
variable.
This patch, cooperate with another qemu patch, is to fix this bug.
This patch transfer a vcpu_avail string in a hex string format, so
that at qemu side it's more easier to get vcpu bitmap from hex string,
especially when many vcpus, like more than 64.
(Also update QEMU_TAG for matching qemu-side update)
Keir Fraser [Thu, 18 Mar 2010 11:02:25 +0000 (11:02 +0000)]
x86: fix dom0 S3 when x2apic is used.
1) Some variables and functions in xen/arch/x86/genapic/x2apic.c
should not be marked with __init* as they will be used during s3
resume;
2) In do_suspend_lowlevel -> restore_rest_processor_state ->
mcheck_init, lapic is accessed, but x2apic hasn't been re-enabled yet
(x2apic is re-enabled() in device_power_up -> lapic_resume). The patch
moves mcheck_init to a later place.
Keir Fraser [Wed, 17 Mar 2010 14:10:43 +0000 (14:10 +0000)]
Improve graphical console performance
As it is pretty pointless to clear unused parts of a line over and
over again, keep track of how much of a line was actually written
to and avoid clearing parts of the screen that are known to already
be clear. With this, scrolling speed becomes comparable to that of
Linux' VESA console.
Keir Fraser [Wed, 17 Mar 2010 14:09:55 +0000 (14:09 +0000)]
x86: suppress pointless Xen messages from ioapic_guest_write()
Previously, these messages were only issued when old and new RTE
differed. Make it so again (requiring adjustment of the guest provided
RTE as that no longer holds a real vector).
While at it, also make the "allocated vector for irq" message more
useful and occur when what it says really happened.
Keir Fraser [Wed, 17 Mar 2010 09:18:34 +0000 (09:18 +0000)]
Fix a race condition for cpufreq dbs timer while S3 resuming
The cpufreq_dbs_timer_suspend/resume may race with dbs_timer_init
while s3 resuming before this patch.
This patch along with cset 21030 fix the bug 1586
http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1586.
Signed-off-by: Yu Ke <ke.yu@intel.com> Signed-off-by: Wei Gang <gang.wei@intel.com>
Keir Fraser [Wed, 17 Mar 2010 09:17:27 +0000 (09:17 +0000)]
libxc: Support set affinity for more than 64 CPUs.
There are more than 64 cpus on new intel platform especially on NUMA
system, so that we need break the pcpu limit (that is 64) when set
affinity of a VCPU.
Signed-off-by: James (song wei) <jsong@novell.com>
Keir Fraser [Wed, 17 Mar 2010 08:35:13 +0000 (08:35 +0000)]
VT-d: reduce default verbosity
Introduce a new sub-option "verbose" to "iommu=", and hide most
(debugging) messages when that option is not specified. Particularly
messages printed after time management was initialized can, on
sufficiently large systems and with a graphical console, lead to
time management issues (therefore a call to process_pending_softirqs()
also gets added in case the new sub-option is being used).
While touching that code, also convert all improper uses of gdprintk()
to dprintk(), and convert all boolean iommu config variables to bool_t
residing in the .data.read_mostly section.
Keir Fraser [Wed, 17 Mar 2010 08:34:16 +0000 (08:34 +0000)]
Increase default console ring allocation size and reduce default verbosity
In order to have better chance that relevant messages fit into the
ring buffer, allocate a dynamic (larger) one in more cases, and make
the default allocation size depend on both the number of CPUs and the
log level. Also free the static buffer if a dynamic one was obtained.
In order for "xm dmesg" to retrieve larger buffers, eliminate
pyxc_readconsolering()'s 32k limitation resulting from the use of a
statically allocated buffer.
Finally, suppress on x86 most per-CPU boot time messages (by default,
most of them can be re-enabled with a new command line option
"cpuinfo", some others are now only printed more than once when there
are inconsistencies between CPUs). This reduces both boot time (namely
when a graphical console is in use) and pressure on the console ring
and serial transmit buffers.
Keir Fraser [Mon, 15 Mar 2010 17:08:29 +0000 (17:08 +0000)]
blktap/fs-back: Build fixes for Fedora 13
1. Some files use stat, mkfifo, mkdir etc. without including
sys/stat.h
2. Some programs link against libpthread without a -lpthread compile
option. The compile used to work if this library happened to be used
by one of the other libraries that was being linked against, but
Fedora 13 has stopped allowing this.
From: M A Young <m.a.young@durham.ac.uk> Signed-off-by: Keir Fraser <keir.fraser@citrix.com>
Simply trying order-9 allocations until they won't succeed anymore
may consume unnecessarily much memory from the DMA zone (since the
page allocator will try to fulfill the request by using memory from
that zone when only lower order memory blocks are left in all other
zones). To avoid using DMA zone memory, make alloc_chunk() try to
allocate a second smaller chunk and use that one in favor of the
first one if it came from a higher addressed memory. This way, all
memory outside the DMA zone will be consumed before eating into that
zone.
Keir Fraser [Mon, 15 Mar 2010 13:24:33 +0000 (13:24 +0000)]
Reduce boot-time memory fragmentation
On certain NUMA configurations having init_node_heap() consume the
first few pages of a new node's memory for internal data structures
leads to unnecessary memory fragmentation, which can - with
sufficiently many nodes - result in there not remaining enough memory
below 4G for Dom0 to set up its swiotlb and PCI-consistent buffers.
Since alloc_boot_pages() generally consumes from the end of available
regions, make init_node_heap() prefer the end of such regions too (so
that fragmentation occurs at only one end of a region).
(Adjustment from first version: Use the tail of the region when the
end addresses alignment is less or equal to the beginning one's, not
just when it's less.)
Further, in order to prefer allocations from higher memory locations,
insert memory regions in reverse order in end_boot_allocator(), with
the exception of inserting one region residing on the boot CPU's node
first (for the statically allocated structures - used for the first
node seen - to be used for this node).
Finally, reduce MAX_ORDER on x86 to the maximum useful value (1Gb), so
that the reservation of a page on node boundaries (again leading to
fragmentation) can be avoided as much as possible (having node
boundaries on less the 1Gb aligned addresses is expected to be rare,
if found in practice at all).
Keir Fraser [Mon, 15 Mar 2010 13:23:07 +0000 (13:23 +0000)]
pygrub: further improve grub2 support
* Improve syntax error messages to say what actually went wrong
instead of giving an arbitrary and basically useless
integer.
* Improve handling of quoted values used with the "set" command,
previously only the default variable was special cased to
handle quoting.
* Allow for extra options to the menuentry command, syntax now
appears to be
menuentry "TITLE" --option1 --option2 {...}
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Keir Fraser [Mon, 15 Mar 2010 13:22:06 +0000 (13:22 +0000)]
libxenlight: fix segfault when domid_to_name returns NULL
The function libxl_domid_to_name() can return NULL if the path
/local/domain/%d/name does not exist. This causes a segfault if the
NULL name is later passed as a value to libxl_xs_writev(). I'm
hitting this making a call to libxl_device_vfb_add() from my graphical
switcher application.
This patch modifies xs_writev() and libxl_xs_writev() to skip NULL
values.
Keir Fraser [Mon, 15 Mar 2010 13:16:35 +0000 (13:16 +0000)]
hotplug: Avoid race condition when creating or destroying network bridges
I saw the following message when I created or destroyed two bridges by
using network-bridge script at same time. Of course names of the
bridges are different. But, a temporal name "tmpbridge" is used by
the script to create or destroy the bridges. I think that the message
was shown by "tmpbridge".
SIOCSIFNAME: File exists
This patch avoids race condition when creating or destroying the
bridges.
Keir Fraser [Thu, 11 Mar 2010 17:40:35 +0000 (17:40 +0000)]
x86: adjust available memory calculation for Dom0 construction
With a large number of CPUs, the amount of memory needed to construct
the vCPU structures for Dom0 becomes significant and hence should be
accounted for when calculating the amount of memory to pass to Dom0.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Add code comments and clean up compute_dom0_nr_pages() invocation.
Keir Fraser [Thu, 11 Mar 2010 17:15:54 +0000 (17:15 +0000)]
VT-d: various initialization fixes
Detect invalid/unsupported configurations in iommu_alloc() - offsets
read from hardware must not lead to exceeding a single page (since
only that much gets mapped). This covers the apparently not uncommon
case of the address pointed to by a DMAR reading as all ones (Linux
for example also checks for this).
Further correct error handling of that function: Without storing the
allocated "struct iommu" instance in the drhd, iommu_free() won't do
anything, and hence all successfully set up pieces would be leaked.
Also keep iommu_free() from calling destroy_irq() when no irq was
ever set up.
Additionally, clear_fault_bits() has no need to read the capabilities
field from I/O memory - it's already cached in "struct iommu".
Finally, simplify print_iommu_regs() and its output, and actually use
this function.
gcc 4.4 incorrectly reports an "array subscript above array bounds"
warning in the flask policydb code, causing the build to fail with
FLASK_ENABLE=y. Rework the code slightly to make it go away.
Signed-off-by: Stephen D. Smalley <sds@tycho.nsa.gov>
This has a pretty serious bug. ioapic_to_iommu() gets returned
drhd->iommu. However, drhd->iommu isn't allocated until part of
iommu_setup(), which is called after enable_x2apic(). Has this ever
worked?
Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Keir Fraser [Tue, 9 Mar 2010 17:58:11 +0000 (17:58 +0000)]
Intel VT-D: Don't turn x2APIC if there is a missing DRHD entry for the IOAPIC.
Follow the Linux kernel lead in which the x2APIC is only turned on
only if there is an DRHD entry for all IOAPICs in the system. If we
don't do this we might enable x2APIC and see various devices not
covered by the IOAPIC mentioned in DRHD, not receive any interrupts.
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Keir Fraser <keir.fraser@citrix.com>
Keir Fraser [Tue, 9 Mar 2010 17:53:01 +0000 (17:53 +0000)]
tmem: typo causes incorrect return on out-of-memory
This classic typo in tmem would result in a false positive
report on a tmem "put" operation if a (unfragmented) page
of memory is completely unavailable.
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Keir Fraser [Tue, 9 Mar 2010 09:59:59 +0000 (09:59 +0000)]
Add cpufreq sanity check
This fixes bug 1585 http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=3D1585
root cause: with incorrect BIOS info, cpufreq driver may not start. in
this case, if user use xenpm to manipulate cpufreq driver, NULL
pointer will cause xen panic. this patch add the sanity check and
warning info to fix this issue.
Keir Fraser [Tue, 9 Mar 2010 09:57:25 +0000 (09:57 +0000)]
hvm: correct time offset update in RTC write emulation
mktime takes a month in 1..12 form while tm->tm_mon contains 0..11 so
we need to add 1. Without this fix setting the month back or forward a
month inside the guest would lead to the wrong number of days being
added/subtracted.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Keir Fraser [Tue, 9 Mar 2010 09:54:39 +0000 (09:54 +0000)]
x86: Increase the default NR_CPUS to 128.
We have newer systems which have more than 64 CPUs, and users
often complain some cpus can't be waken up when play with Xen.
Certainly, MAX_PHYS_CPUS option also can support more CPUs, but
it is still inconvenient for them, so change the default value to
128.
Keir Fraser [Fri, 5 Mar 2010 14:40:19 +0000 (14:40 +0000)]
Fix Makefile targets that generate several files at once
In a few places in the tree the Makefiles have constructs like this:
one_file another_file:
$(COMMAND_WHICH_GENERATES_BOTH_AT_ONCE)
This is wrong, because make will run _two copies_ of the same command
at once. This generally causes races and hard-to-reproduce build
failures.
Notably, `make -j4' at the top level will build stubdom libxc twice
simultaneously!
In this patch we replace the occurrences of this construct with the
correct idiom:
one_file: another_file
another_file:
$(COMMAND_WHICH_GENERATES_BOTH_AT_ONCE)
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Keir Fraser [Wed, 3 Mar 2010 17:40:22 +0000 (17:40 +0000)]
Fix domain exit actions that contain hyphen
Domain exit actions that contain a hyphen (e.g. rename-restart) were
not being detected properly when xm is configured to use xenapi.
Domain config containing on_crash=3D"rename-restart" results in
xen53:~ # xm new /tmp/domU.config
Using config file "/tmp/domU.config".
Unexpected error: <type 'exceptions.TypeError'>
This patch fixes the raised exception and at the same time
handles the replacement of hyphen with underscore properly.
Keir Fraser [Wed, 3 Mar 2010 17:39:22 +0000 (17:39 +0000)]
Replace config file parser for "xl"
This provides a replacement config file parser for "xl" based on bison
and flex.
Benefits:
* proper error reporting with line numbers
* parser can understand nearly all "xm" configuration files directly
(doesn't understand Python code but should do everything else)
* parser also understands the ;-infested "xl" style files
* removes the dependency on libconfig
* better checking for certain kinds of mistakes
* eliminates the strange "massage file and try again" code
This is intended to support all config files currently supported by
"xl" and almost all files supported by "xm". (NB that whether a
feature works depends on the implementation of that feature in
xl/libxl of course.)
This patch also introduces a new library "libxlutil" which is mainly
for the benefit of "xl". Users of libxl do not need to use libxlutil,
but they can do so if they want to parse "xl" files without being
"xl".
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Vincent Hanquez <vincent.hanquez@eu.citrix.com>
Keir Fraser [Fri, 26 Feb 2010 17:09:50 +0000 (17:09 +0000)]
Revert 20954:b4041e7bbe1b "paging_domctl: Add missing breaks in switch stmt"
This fixed a fairly innocuous bug (OP_ENABLE/OP_OFF both don't work
properly) but unmasked a much nastier one (turning off shadow mode on
a PV guest crashes the hypervisor).
So, for now, we pick the less of two evils. We don't really much rely
on OP_ENABLE/OP_OFF anyway, as it happens.
Keir Fraser [Thu, 25 Feb 2010 20:56:43 +0000 (20:56 +0000)]
ACPI: workaround for S3 fail in two facs tables case
Some legacy BIOS which support ACPI2.0+ may expose two FACS tables via
both FADT->FIRMWARE_CTRL and FADT->X_FIRMWARE_CTRL, but only lookup S3
waking_vector in the first one.
Signed-off-by: Wei Gang <gang.wei@intel.com> Signed-off-by: Keir Fraser <keir.fraser@citrix.com>
This patch fixes the following error on ia64:
iommu.c: In function 'init_vtd_hw':
iommu.c:1831: error: 'nr_ioapics' undeclared (first use in this
function)
Keir Fraser [Wed, 24 Feb 2010 10:59:37 +0000 (10:59 +0000)]
vtd: interrupt remapping: be more defensive
1) A buggy BIOS may not report IOAPIC in DRHD. Currently we still try
to enable IR while the IOAPIC RTEs are still in non-remappable format
and the host would hang. The patch detects this case and will not try
to enable IR.
2) Currently HPET's MSI mode doesn't work if IR is enabled because we
have no code to allocate IRTE for it. Luckily this HW configuration is
rather rarely at present, we can just work it around by only using
HPET's IOAPIC mode for now.