Keir Fraser [Wed, 4 Nov 2009 18:13:32 +0000 (18:13 +0000)]
xenbus: do not hold transaction_mutex when returning to userspace
================================================
[ BUG: lock held when returning to user space! ]
------------------------------------------------
xenstore-list/3522 is leaving the kernel with locks still held!
1 lock held by xenstore-list/3522:
#0: (&xs_state.transaction_mutex){......}, at: [<c026dc6f>]
xenbus_dev_request_and_reply+0x8f/0xa0
The canonical fix for this type of issue appears to be to maintain a
count manually rather than using an rwsem so do that here.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Keir Fraser [Fri, 23 Oct 2009 09:07:22 +0000 (10:07 +0100)]
xen/x86: fix GFP mask handling in dma_alloc_coherent()
Ever since no longer pushing all memory into the DMA zone (c/s 355),
explicitly setting GFP_DMA as well as not masking off GFP_DMA32 was
unnecessarily restricting the pool from which suitable memory could be
taken.
Keir Fraser [Wed, 7 Oct 2009 07:42:00 +0000 (08:42 +0100)]
PVUSB: Fixes and updates
- xenbus state flow changed.
Whole of the flow is changed to be like netback/netfront.
Reconfiguring/Reconfiguring are removed.
- New RING for hotplug notification added.
- USBIF_MAX_SEGMENTS_PER_REQUEST value is changed (10) to (16).
According to this change, RING_SIZE is decreased from 32 to 16.
This affects the performance. My flash drive's read throughput
was dropped from 29MB/s to 18MB/s in the linux environment.
However, Windows guest send urb with 64kB buffer(64KB = 4kB * 16).
This is required.
- New port-setting interface
xenbus_watch_path2 is added to usbback, port-setting interface
is moved from sysfs to xenstore.
Now, the port-rule is directly written to xenstore entry.
Example.
# xenstore-write /local/domain/0/backend/vusb/1/0/port/1 "2-1"
(adding physical bus 2-1 to vusb-1-0 port 1)
- urb dequeue function completed.
usbfront send unlink-request to usbback, and can cancel the urb
that is submitted in the backend.
- New USB Spec version (USB1.1/USB2.0) selection support.
usbfront can act as both USB1.1 and USB2.0 virtual host controller
according to the xenstore entry key "usb-ver".
- experimental bus_suspend/bus_resume added to usbfront.
- various cleanups, bugfix, refactoring and codestyle-fix.
Keir Fraser [Wed, 7 Oct 2009 06:33:40 +0000 (07:33 +0100)]
xen: re-synchronize ring.h public header
Patch 20267:e9366bed077e modified the definition of sring in the xen
repo's version of ring.h, but not the version in the linux kernel
repo. That change broke pause/resume/shutdown messages from the
blktap2 kernel module, which (for the time being) relies on pad[0]
being at consistent location in the sring struct. This patch fixes
this regression by resyncronizing the two the files.
Keir Fraser [Tue, 25 Aug 2009 13:55:22 +0000 (14:55 +0100)]
xen/x86: make do_settimeofday() return -EPERM when clock can't be changed
Rather than returning success here (without actually having done
anything), it seems more appropriate/conforming to let the caller know
that what he intended to do didn't succeed.
Keir Fraser [Wed, 5 Aug 2009 11:05:34 +0000 (12:05 +0100)]
xen/x86-64: fix Dom0 boot on AMD K8 CPUs
The workaround in question here should be (and is being) applied by
the hypervisor (which doesn't allow any guest - including Dom0 - to
write other than all zeroes or all ones into MCi_CTL).
Do not go beyond ARRAY_SIZE of info->shadow Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Jan Beulich <jbeulich@novell.com>
blktap2: make blktap2 work for auto translated mode with hvm domain.
This patch makes blktap2 work for hvm domain with auto translated
mode. (I.e. IA64 HVM domain case as Kuwamura reported its bug.)
blktap2 has introduces new feature that pages from the self domain
can be handled. However it doesn't work for auto translated mode
because blktap2 relies on p2m table manipulation. But the p2m
doesn't make sense for auto translated mode.
So self grant mapping is used instead.
Just passing same page to blktap2 daemon doesn't work because
when doing io, the page is locked, so the given page from blktap2
block device is already locked. When blktap2 daemon issues IO on
the page, it tries to lock it resulting in dead lock.
So resorted to self grant.
Keir Fraser [Mon, 29 Jun 2009 09:57:46 +0000 (10:57 +0100)]
blktap2: remove warnings.
This patch removes the following warnings on ia64.
> linux-2.6.18-xen.hg/drivers/xen/blktap2/device.c: In function
'blktap_device_finish_request':
> linux-2.6.18-xen.hg/drivers/xen/blktap2/device.c:403: warning:
format '%lld' expects type 'long long int', but argument 7 has type 'uint64_t'
> linux-2.6.18-xen.hg/drivers/xen/blktap2/sysfs.c: In function
'blktap_sysfs_debug_device':
> linux-2.6.18-xen.hg/drivers/xen/blktap2/sysfs.c:276: warning: format
'%llu' expects type 'long long unsigned int', but argument 4 has type
'uint64_t'
Keir Fraser [Tue, 23 Jun 2009 10:12:38 +0000 (11:12 +0100)]
xenbus: fix timeout with PV guest and physical CDROM
Specifying a physical CDROM in the configuration of a PV guest, like
disk =3D ['tap:aio:/....,xvda,w', 'phy:/dev/cdrom,hdc:cdrom,r' ]
will cause the 300 seconds timeout to occur if there is no physical
CDROM in the tray. The bug is due to the device being Closed (as shown by
the timeout message) but not ready. The configuration is quite bogus, but
this is a regression from when the timeout was 10 seconds only, and
the fix is easy and safe: only check is_ready for connected devices.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Keir Fraser [Thu, 18 Jun 2009 09:32:16 +0000 (10:32 +0100)]
x86-64: do not pass unmanageable amounts of memory to Dom0
Due to address space restrictions it is not possible to successfully
pass more than about 500Gb to a Linux Dom0 unless its kernel specifies
a non-default phys-to-machine map location via XEN_ELFNOTE_INIT_P2M.
For non-Linux Dom0 kernels I can't say whether the limit could be set
to close to 1Tb, but since passing such huge amounts of memory isn't
very useful anyway (and can be enforced via dom0_mem=3D), the patch
doesn't attempt to guess the kernel type and restricts the memory
amount in all cases.
Keir Fraser [Thu, 18 Jun 2009 09:24:18 +0000 (10:24 +0100)]
Transcendent memory ("tmem") for Linux
Tmem, when called from a tmem-capable (paravirtualized) guest, makes
use of otherwise unutilized ("fallow") memory to create and manage
pools of pages that can be accessed from the guest either as
"ephemeral" pages or as "persistent" pages. In either case, the pages
are not directly addressible by the guest, only copied to and fro via
the tmem interface. Ephemeral pages are a nice place for a guest to
put recently evicted clean pages that it might need again; these pages
can be reclaimed synchronously by Xen for other guests or other uses.
Persistent pages are a nice place for a guest to put "swap" pages to
avoid sending them to disk. These pages retain data as long as the
guest lives, but count against the guest memory allocation.
This patch contains the Linux paravirtualization changes to
complement the tmem Xen patch (xen-unstable c/s 19646). It
implements "precache" (ext3 only as of now), "preswap",
and limited "shared precache" (ocfs2 only as of now) support.
CONFIG options are required to turn on
the support (but in this patch they default to "y"). If
the underlying Xen does not have tmem support or has it
turned off, this is sensed early to avoid nearly all
hypercalls.
Lots of useful prose about tmem can be found at
http://oss.oracle.com/projects/tmem
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Keir Fraser [Tue, 16 Jun 2009 10:58:55 +0000 (11:58 +0100)]
x86: add MCA logging support in DOM0
When an MCE/CMCI error happens (or by polling), the related error
information will be sent to DOM0 by XEN. This patch will help to fetch
the xen-logged information by hypercall and then convert XEN-format
log into Linux format MCELOG. It makes using current available mcelog
tools for native Linux possible.
With this patch, after mce/cmci error log information is sent to DOM0,
running mcelog tools in DOM0, you will get same detailed decoded mce
information as in Native Linux.
Signed-Off-By: Liping Ke <liping.ke@intel.com> Signed-Off-By: Yunhong Jiang <yunhong.jiang@intel.com> Acked-By: Jan Beulich <jbeulich@novell.com>
Keir Fraser [Tue, 16 Jun 2009 10:09:39 +0000 (11:09 +0100)]
blktap: Indirection in vm_area_struct->vm_private_data
The recent patch in linux-2.6.18.hg (878: eba6fe6d8d53) changed the
way that the foreign map is stored in vm_area_struct. Currently blktap
(not 2) implementation is internally inconsistent, which triggers
kernel bug when tap:aio disk is used (dump attached at the end of the
email).
Keir Fraser [Mon, 8 Jun 2009 11:23:24 +0000 (12:23 +0100)]
pci: fix pcie-aer recovery mechanism defects.
When aer error happening, if the device is not hided or assigned,
exit. If the device is assigned yet not connected by PV guest or is
owned by HVM guest, kill the guest. [sh_info is NULL]
Signed-Off-By: Liping Ke <liping.ke@intel.com> Signed-Off-By: Yunhong Jiang <yunhong.jiang@intel.com>
Keir Fraser [Fri, 5 Jun 2009 13:01:20 +0000 (14:01 +0100)]
balloon: try harder to balloon up under memory pressure.
Currently if the balloon driver is unable to increase the guest's
reservation it assumes the failure was due to reaching its full
allocation, gives up on the ballooning operation and records the limit
it reached as the "hard limit". The driver will not try again until
the target is set again (even to the same value).
However it is possible that ballooning has in fact failed due to
memory pressure in the host and therefore it is desirable to keep
attempting to reach the target in case memory becomes available. The
most likely scenario is that some guests are ballooning down while
others are ballooning up and therefore there is temporary memory
pressure while things stabilise. You would not expect a well behaved
toolstack to ask a domain to balloon to more than its allocation nor
would you expect it to deliberately over-commit memory by setting
balloon targets which exceed the total host memory.
This patch drops the concept of a hard limit and causes the balloon
driver to retry increasing the reservation on a timer in the same
manner as when decreasing the reservation.
Also if we partially succeed in increasing the reservation
(i.e. receive less pages than we asked for) then we may as well keep
those pages rather than returning them to Xen.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Keir Fraser [Thu, 4 Jun 2009 09:33:52 +0000 (10:33 +0100)]
linux: fix blkback/blktap2 interaction
blkback's page map code needs to be accessible to both blkback and
blktap2, irrespective of whether either or both are modules. The
most immediate solution is to break it out into a separate, library-
like component that doesn't need building if either of the two
consumers is configured off, and that gets built as a module if both
consumers are modules.
Also fix the dummy implementation of blkback_pagemap_read(), since
using BUG() there doesn't compile.
Keir Fraser [Wed, 3 Jun 2009 10:22:24 +0000 (11:22 +0100)]
Dom0 PCI: fix SR-IOV function dependency link problem
PCIe Root Complex Integrated Endpoint does not implement ARI, so this
kind of endpoint uses 3-bit function number. The function dependency
link of the integrated endpoint should be calculated using the device
number field in conjunction with the value from function dependency
link register.
Normal SR-IOV endpoint always implements ARI and the function
dependency link register contains 8-bit function number (i.e. `devfn'
from software perspective).
Keir Fraser [Thu, 28 May 2009 09:04:26 +0000 (10:04 +0100)]
blktap2: add tlb flush properly.
xen_invlpg() flushes tlb on its cpu, but tlb flush is needed on
all cpus. So replace xen_invlpg() with more proper ones.
Maybe it would be possible to make tlb flush less.
this patch also makes blktap2 compile on ia64 because xen_invlpg()
is x86 specific.
Keir Fraser [Thu, 28 May 2009 08:57:49 +0000 (09:57 +0100)]
PCI pass through: PCIe IO space multiplexing
This is required for more than 16 HVM domain to boot from
PCIe pass through device.
Linux as dom0 exclusively assigns IO space to downstream PCI bridges
and the assignment unit of PCI bridge IO space is 4K. So the only up
to 16 PCIe device can be accessed via IO space within 64K IO ports.
PCI expansion ROM BIOS often uses IO port access to boot from the
device, so on virtualized environment, it means only up to 16 guest
domain can boot from pass-through device.
This patch allows PCIe IO space sharing of pass-through device.
- reassign IO space of PCIe devices specified by
"guestiomuldev=[<segment>:]<bus>:<dev>[,[<segment:><bus>:dev]][,...]"
to be shared.
This is implemented as Linux PCI quirk fixup.
The sharing unit is PCIe switch. Ie IO space of the end point
devices under the same switch will be shared. If there are more than
one switches, two areas of IO space will be used.
- And the driver which arbitrates the accesses to the multiplexed PCIe
IO space. Later qemu-dm will use this.
Limitation:
IO port of IO shared devices can't be accessed from dom0 Linux device
driver. But this wouldn't be a big issue because PCIe specification
discourages the use of IO space and recommends that IO space should be
used only for bootable device with ROM code. OS device driver should
work without IO space access.
Add a mode to netback in which it tries to return TX responses in a
slightly less predictable order. This can make some otherwise
hardware-specific frontend bugs reproduce much more easily, which is
obviously rather useful when trying to fix them. It'd also be quite
useful for making sure they didn't happen in the first place.
Randomisation is only done if a module parameter is set, and defaults
to off. It certainly isn't something you'd ever want to run with in
production, but it might be useful for other people developing
frontend drivers. I don't know if that's considered an adequate
reason to apply it, but, if anyone wants it, here it is.
Signed-off-by: Steven Smith <steven.smith@eu.citrix.com>
Keir Fraser [Tue, 26 May 2009 10:23:16 +0000 (11:23 +0100)]
blktap2: a completely rewritten blktap implementation
Benefits to blktap2 over the old version of blktap:
* Isolation from xenstore - Blktap devices are now created directly on
the linux dom0 command line, rather than being spawned in response
to XenStore events. This is handy for debugging, makes blktap
generally easier to work with, and is a step toward a generic
user-level block device implementation that is not Xen-specific.
* Improved tapdisk infrastructure: simpler request forwarding, new
request scheduler, request merging, more efficient use of AIO.
* Improved tapdisk error handling and memory management. No
allocations on the block data path, IO retry logic to protect
guests
transient block device failures. This has been tested and is known
to work on weird environments such as NFS soft mounts.
* Pause and snapshot of live virtual disks (see xmsnap script).
* VHD support. The VHD code in this release has been rigorously
tested, and represents a very mature implementation of the VHD
image
format.
* No more duplication of mechanism with blkback. The blktap kernel
module has changed dramatically from the original blktap. Blkback
is now always used to talk to Xen guests, blktap just presents a
Linux gendisk that blkback can export. This is done while
preserving the zero-copy data path from domU to physical device.
These patches deprecate the old blktap code, which can hopefully be
removed from the tree completely at some point in the future.
Keir Fraser [Tue, 19 May 2009 13:45:50 +0000 (14:45 +0100)]
xenbus: Allow lazy init in case xenstored runs in a separate minios domain.
Here's an explanation of the states:
It starts out in XENBUS_XSD_UNCOMMITTED.
As the master xenbus (the one local to xenstored), it will receive an
mmap from xenstore, putting it in XENBUS_XSD_LOCAL_INIT. This enables
the wake_waiting IRQ, which will put it in XENBUS_XSD_LOCAL_READY.
Alternatively, as a slave xenbus, it will receive an ioctl from the
xenstore domain builder, putting it in XENBUS_XSD_FOREIGN_INIT. This
enables the wake_waiting IRQ, which will put it in
XENBUS_XSD_FOREIGN_READY.
DomU's are immediately initialized to XENBUS_XSD_FOREIGN_READY.
Signed-off-by: Diego Ongaro <diego.ongaro@citrix.com> Signed-off-by: Alex Zeffertt <alex.zeffertt@eu.citrix.com>
Keir Fraser [Tue, 19 May 2009 13:42:04 +0000 (14:42 +0100)]
xenbus: Remove an assumption that 'initial domain' is dom0. Signed-off-by: Diego Ongaro <diego.ongaro@citrix.com> Signed-off-by: Alex Zeffertt <alex.zeffertt@eu.citrix.com>
Keir Fraser [Thu, 14 May 2009 09:09:15 +0000 (10:09 +0100)]
xen/x86: don't initialize cpu_data[]'s apicid field on generic code
Afaict, this is not only redundant with the intialization done in
drivers/xen/core/smpboot.c, but actually results - at least for
secondary CPUs - in the Xen-specific value written to be later
overwritten with whatever the generic code determines (with no
guarantee that the two values are identical).
Keir Fraser [Thu, 14 May 2009 09:08:40 +0000 (10:08 +0100)]
xen/i386: hypervisor_callback adjustments
The missing check of the interrupted code's code selector in
hypervisor_callback() allowed a user mode application to oops (and
perhaps crash) the kernel.
Further adjustments:
- the 'main' critical region does not include the jmp following the
disabling of interrupts
- the sysexit_[se]crit range checks got broken at some point - the
sysexit ciritcal region is always at higher addresses than the
'main'
one, yielding the check pointless (but consuming execution time);
since the supervisor mode kernel isn't actively used afaict, I moved
that code into an #ifdef using a hypothetical config option
- the use of a numeric label across more than 300 lines of code always
seemed pretty fragile to me, so the patch replaces this with a local
named label
- streamlined the critical_region_fixup code to eliminate a branch
Keir Fraser [Thu, 14 May 2009 09:08:10 +0000 (10:08 +0100)]
xen: miscellaneous cleanup
- add two missing unwind annotations
- mark remaining struct file_operations instances const
- use get_capacity() instead of raw access to the capacity field
- use assert_spin_locked() instead of BUG_ON(!spin_is_locked())
- use clear_tsk_thread_flag() instead of clear_ti_thread_flag()
- remove dead variable cpu_state
vm_area_struct::vm_private_data is used
by get_user_pages() so that we can't override
it. So in order to make blktap work, set it
to a array of struct page*.
Without mm->mmap_sem, virtual mapping can be changed.
so remembering vma which was passed to mmap callback
is bogus because later the vma can be freed or changed.
So don't remember vma and put necessary infomations into
tap_blkif_t. and use find_vma() to get necessary vma's.
Dereferencing filp->private_data->vma in the file_operations.release
actor isn't permitted, as the vma generally has been destroyed by that
time. The kfree()ing of vma->vm_private_data must be done in the
vm_operations.close actor, and the call to zap_page_range() seems
redundant with the caller of that actor altogether.
Without this patch, fakephp with reassigndev fails
to allocate memory resource.
Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
commit bf4162bcf82ebc3258d6bc0ddd6453132abde72d
Author: Darrick J. Wong <djwong@us.ibm.com>
Date: Tue Nov 25 13:51:44 2008 -0800
PCI hotplug: fakephp: Allocate PCI resources before adding the
device
For PCI devices, pci_bus_assign_resources() must be called to set
up the pci_device->resource array before pci_bus_add_devices() can
be called, else attempts to load drivers results in BAR collision
errors where there are none.
This is not done in fakephp, so devices can be "unplugged" but
scanning the
parent bus won't bring the devices back due to resource
unallocation. Move the
pci_bus_add_device-calling logic into pci_rescan_bus and preface
it with a call
to pci_bus_assign_resources so that we only have to (re)allocate
resources once
per bus where a new device is found.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Acked-by: Alex Chiang <achiang@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
dom0 linux: support SBDF with "guestdev=" and remove "reassigndev="
When we don't need to reassign resources and use device path,
pciback.hide= boot parameter can be used. The parameter is also needed
for backward compatibility.
pciback.hide=(00:01.0)(00:02.0)
When we need to reassign resources or use device path, guestdev= boot
parameter can be used. reassign_resources boot parameter is needed to
reassign resources, too.
PCI: sync up the SR-IOV changes between Dom0 and upstream kernel
The SR-IOV patches for the upstream kernel are finally in-tree. This
patch backports some minor changes that appeared in the upstream
kernel after the Dom0 patches were checked-in.
pci: Do not disable I/O decoding on reassigning resource.
When I reserve UHCI for guest domain with "guestdev=" and
"reassign_resources" parameters, spurious interrupts occurred.
The reason is that UHCI is not reset by uhci_check_and_reset_hc
because I/O decoding is disabled. UHCI keeps asserting the interrupt
line. As a result spurious interrupts occur.
The patch does not disable I/O decoding. It disables only memory
decoding. So UHCI is reset and spurious interrupts do not occur.
Keir Fraser [Tue, 31 Mar 2009 11:01:50 +0000 (12:01 +0100)]
usbfront: do not assume sequentially mapped pages
xenhcd_gnttab_map in usbfront-q.c looks up the mfn of the start of the
usb transfer buffer. But the buffer may span several pages, and the
current code simply increments the obtained mfn. Needless to say this
is an unwarranted assumption. It causes large transfers to be
corrupted and/or to overwrite other parts of memory.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Keir Fraser [Tue, 31 Mar 2009 11:00:53 +0000 (12:00 +0100)]
sfc_netfront: Only clear tx_skb when ready for netif_wake_queue
(doing otherwise could result in a lost packet) and document use of
locks to protect tx_skb
Keir Fraser [Tue, 31 Mar 2009 10:59:10 +0000 (11:59 +0100)]
net sfc: Update sfc and sfc_resource driver to latest release
...and update sfc_netfront, sfc_netback, sfc_netutil for any API changes
sfc_netback: Fix asymmetric use of SFC buffer table alloc and free
sfc_netback: Clean up if no SFC accel device found
sfc_netback: Gracefully handle case where page grant fails
sfc_netback: Disable net acceleration if the physical link goes down
sfc_netfront: Less verbose error messages, more verbose counters for
rx discard errors
sfc_netfront: Gracefully handle case where SFC netfront fails during
initialisation
Keir Fraser [Fri, 20 Mar 2009 09:00:58 +0000 (09:00 +0000)]
xen: swiotlb allocations do not need to come from low memory
Other than on native, where using the _low variants of alloc_bootmem()
is indeed a requirement for swiotlb, on Xen this is not needed. Using
the _low variants has the potential of preventing systems from booting
when they have lots of memory, due to the way the bootmem allocator
works: It allocates memory from bottom to top. Thus, if other large
(but not _low) allocations (memmap, large system hash tables)
mostly consume the memory below 4Gb, the swiotlb allocations can
fail. (This is equally so for native, but cannot be that easily fixed
there.)