Add a mode to netback in which it tries to return TX responses in a
slightly less predictable order. This can make some otherwise
hardware-specific frontend bugs reproduce much more easily, which is
obviously rather useful when trying to fix them. It'd also be quite
useful for making sure they didn't happen in the first place.
Randomisation is only done if a module parameter is set, and defaults
to off. It certainly isn't something you'd ever want to run with in
production, but it might be useful for other people developing
frontend drivers. I don't know if that's considered an adequate
reason to apply it, but, if anyone wants it, here it is.
Signed-off-by: Steven Smith <steven.smith@eu.citrix.com>
Keir Fraser [Tue, 26 May 2009 10:23:16 +0000 (11:23 +0100)]
blktap2: a completely rewritten blktap implementation
Benefits to blktap2 over the old version of blktap:
* Isolation from xenstore - Blktap devices are now created directly on
the linux dom0 command line, rather than being spawned in response
to XenStore events. This is handy for debugging, makes blktap
generally easier to work with, and is a step toward a generic
user-level block device implementation that is not Xen-specific.
* Improved tapdisk infrastructure: simpler request forwarding, new
request scheduler, request merging, more efficient use of AIO.
* Improved tapdisk error handling and memory management. No
allocations on the block data path, IO retry logic to protect
guests
transient block device failures. This has been tested and is known
to work on weird environments such as NFS soft mounts.
* Pause and snapshot of live virtual disks (see xmsnap script).
* VHD support. The VHD code in this release has been rigorously
tested, and represents a very mature implementation of the VHD
image
format.
* No more duplication of mechanism with blkback. The blktap kernel
module has changed dramatically from the original blktap. Blkback
is now always used to talk to Xen guests, blktap just presents a
Linux gendisk that blkback can export. This is done while
preserving the zero-copy data path from domU to physical device.
These patches deprecate the old blktap code, which can hopefully be
removed from the tree completely at some point in the future.
Keir Fraser [Tue, 19 May 2009 13:45:50 +0000 (14:45 +0100)]
xenbus: Allow lazy init in case xenstored runs in a separate minios domain.
Here's an explanation of the states:
It starts out in XENBUS_XSD_UNCOMMITTED.
As the master xenbus (the one local to xenstored), it will receive an
mmap from xenstore, putting it in XENBUS_XSD_LOCAL_INIT. This enables
the wake_waiting IRQ, which will put it in XENBUS_XSD_LOCAL_READY.
Alternatively, as a slave xenbus, it will receive an ioctl from the
xenstore domain builder, putting it in XENBUS_XSD_FOREIGN_INIT. This
enables the wake_waiting IRQ, which will put it in
XENBUS_XSD_FOREIGN_READY.
DomU's are immediately initialized to XENBUS_XSD_FOREIGN_READY.
Signed-off-by: Diego Ongaro <diego.ongaro@citrix.com> Signed-off-by: Alex Zeffertt <alex.zeffertt@eu.citrix.com>
Keir Fraser [Tue, 19 May 2009 13:42:04 +0000 (14:42 +0100)]
xenbus: Remove an assumption that 'initial domain' is dom0. Signed-off-by: Diego Ongaro <diego.ongaro@citrix.com> Signed-off-by: Alex Zeffertt <alex.zeffertt@eu.citrix.com>
Keir Fraser [Thu, 14 May 2009 09:09:15 +0000 (10:09 +0100)]
xen/x86: don't initialize cpu_data[]'s apicid field on generic code
Afaict, this is not only redundant with the intialization done in
drivers/xen/core/smpboot.c, but actually results - at least for
secondary CPUs - in the Xen-specific value written to be later
overwritten with whatever the generic code determines (with no
guarantee that the two values are identical).
Keir Fraser [Thu, 14 May 2009 09:08:40 +0000 (10:08 +0100)]
xen/i386: hypervisor_callback adjustments
The missing check of the interrupted code's code selector in
hypervisor_callback() allowed a user mode application to oops (and
perhaps crash) the kernel.
Further adjustments:
- the 'main' critical region does not include the jmp following the
disabling of interrupts
- the sysexit_[se]crit range checks got broken at some point - the
sysexit ciritcal region is always at higher addresses than the
'main'
one, yielding the check pointless (but consuming execution time);
since the supervisor mode kernel isn't actively used afaict, I moved
that code into an #ifdef using a hypothetical config option
- the use of a numeric label across more than 300 lines of code always
seemed pretty fragile to me, so the patch replaces this with a local
named label
- streamlined the critical_region_fixup code to eliminate a branch
Keir Fraser [Thu, 14 May 2009 09:08:10 +0000 (10:08 +0100)]
xen: miscellaneous cleanup
- add two missing unwind annotations
- mark remaining struct file_operations instances const
- use get_capacity() instead of raw access to the capacity field
- use assert_spin_locked() instead of BUG_ON(!spin_is_locked())
- use clear_tsk_thread_flag() instead of clear_ti_thread_flag()
- remove dead variable cpu_state
vm_area_struct::vm_private_data is used
by get_user_pages() so that we can't override
it. So in order to make blktap work, set it
to a array of struct page*.
Without mm->mmap_sem, virtual mapping can be changed.
so remembering vma which was passed to mmap callback
is bogus because later the vma can be freed or changed.
So don't remember vma and put necessary infomations into
tap_blkif_t. and use find_vma() to get necessary vma's.
Dereferencing filp->private_data->vma in the file_operations.release
actor isn't permitted, as the vma generally has been destroyed by that
time. The kfree()ing of vma->vm_private_data must be done in the
vm_operations.close actor, and the call to zap_page_range() seems
redundant with the caller of that actor altogether.
Without this patch, fakephp with reassigndev fails
to allocate memory resource.
Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
commit bf4162bcf82ebc3258d6bc0ddd6453132abde72d
Author: Darrick J. Wong <djwong@us.ibm.com>
Date: Tue Nov 25 13:51:44 2008 -0800
PCI hotplug: fakephp: Allocate PCI resources before adding the
device
For PCI devices, pci_bus_assign_resources() must be called to set
up the pci_device->resource array before pci_bus_add_devices() can
be called, else attempts to load drivers results in BAR collision
errors where there are none.
This is not done in fakephp, so devices can be "unplugged" but
scanning the
parent bus won't bring the devices back due to resource
unallocation. Move the
pci_bus_add_device-calling logic into pci_rescan_bus and preface
it with a call
to pci_bus_assign_resources so that we only have to (re)allocate
resources once
per bus where a new device is found.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Acked-by: Alex Chiang <achiang@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
dom0 linux: support SBDF with "guestdev=" and remove "reassigndev="
When we don't need to reassign resources and use device path,
pciback.hide= boot parameter can be used. The parameter is also needed
for backward compatibility.
pciback.hide=(00:01.0)(00:02.0)
When we need to reassign resources or use device path, guestdev= boot
parameter can be used. reassign_resources boot parameter is needed to
reassign resources, too.
PCI: sync up the SR-IOV changes between Dom0 and upstream kernel
The SR-IOV patches for the upstream kernel are finally in-tree. This
patch backports some minor changes that appeared in the upstream
kernel after the Dom0 patches were checked-in.
pci: Do not disable I/O decoding on reassigning resource.
When I reserve UHCI for guest domain with "guestdev=" and
"reassign_resources" parameters, spurious interrupts occurred.
The reason is that UHCI is not reset by uhci_check_and_reset_hc
because I/O decoding is disabled. UHCI keeps asserting the interrupt
line. As a result spurious interrupts occur.
The patch does not disable I/O decoding. It disables only memory
decoding. So UHCI is reset and spurious interrupts do not occur.
Keir Fraser [Tue, 31 Mar 2009 11:01:50 +0000 (12:01 +0100)]
usbfront: do not assume sequentially mapped pages
xenhcd_gnttab_map in usbfront-q.c looks up the mfn of the start of the
usb transfer buffer. But the buffer may span several pages, and the
current code simply increments the obtained mfn. Needless to say this
is an unwarranted assumption. It causes large transfers to be
corrupted and/or to overwrite other parts of memory.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Keir Fraser [Tue, 31 Mar 2009 11:00:53 +0000 (12:00 +0100)]
sfc_netfront: Only clear tx_skb when ready for netif_wake_queue
(doing otherwise could result in a lost packet) and document use of
locks to protect tx_skb
Keir Fraser [Tue, 31 Mar 2009 10:59:10 +0000 (11:59 +0100)]
net sfc: Update sfc and sfc_resource driver to latest release
...and update sfc_netfront, sfc_netback, sfc_netutil for any API changes
sfc_netback: Fix asymmetric use of SFC buffer table alloc and free
sfc_netback: Clean up if no SFC accel device found
sfc_netback: Gracefully handle case where page grant fails
sfc_netback: Disable net acceleration if the physical link goes down
sfc_netfront: Less verbose error messages, more verbose counters for
rx discard errors
sfc_netfront: Gracefully handle case where SFC netfront fails during
initialisation
Keir Fraser [Fri, 20 Mar 2009 09:00:58 +0000 (09:00 +0000)]
xen: swiotlb allocations do not need to come from low memory
Other than on native, where using the _low variants of alloc_bootmem()
is indeed a requirement for swiotlb, on Xen this is not needed. Using
the _low variants has the potential of preventing systems from booting
when they have lots of memory, due to the way the bootmem allocator
works: It allocates memory from bottom to top. Thus, if other large
(but not _low) allocations (memmap, large system hash tables)
mostly consume the memory below 4Gb, the swiotlb allocations can
fail. (This is equally so for native, but cannot be that easily fixed
there.)
Keir Fraser [Thu, 19 Mar 2009 10:21:46 +0000 (10:21 +0000)]
PCI: pass ARI and SR-IOV device information to the hypervisor
PCIe Alternative Routing-ID Interpretation (ARI) ECN defines the Extended
Function -- a function whose function number is greater than 7 within an
ARI Device. Intel VT-d spec 1.2 section 8.3.2 specifies that the Extended
Function is under the scope of the same remapping unit as the traditional
function. The hypervisor needs to know if a function is Extended
Function so it can find proper DMAR for it.
And section 8.3.3 specifies that the SR-IOV Virtual Function is under the
scope of the same remapping unit as the Physical Function. The hypervisor
also needs to know if a function is the Virtual Function and which
Physical Function it's associated with for same reason.
Keir Fraser [Thu, 19 Mar 2009 10:21:21 +0000 (10:21 +0000)]
PCI: save and restore PCIe 2.0 registers
PCIe 2.0 defines several new registers (Device Control 2, Link Control
2, and Slot Control 2). Save and retore them in pci_save_pcie_state()
and pci_restore_pcie_state().
Keir Fraser [Thu, 19 Mar 2009 10:06:52 +0000 (10:06 +0000)]
linux pciback/pcifront: work queue management fixes
flush_scheduled_work() only flushes work queued to the global
keventd_wq, but pciback is using its own local work queue, so that is
what needs to be flushed.
Calling cancel_delayed_work() on something never inserted through
queue_delayed_work() or schedule_delayed_work() is pointless.
Keir Fraser [Wed, 18 Mar 2009 11:51:05 +0000 (11:51 +0000)]
x86: Fix interaction of NTP and dom0->xen time updates
Don't discard NTP sync when updating Xen wallclock time from dom0,
as that's almost the first thing we do when we become synced.
Move the call to ntp_clear() into do_settimeofday(), which is the
only caller of __update_wallclock() that looks like it should break
NTP sync.
This fixes the timer chain that sets Xen's wallclock every minute when
dom0 is NTP synced, which in turn greatly improves wallclock accuracy
in PV domU.
Keir Fraser [Wed, 18 Mar 2009 11:45:30 +0000 (11:45 +0000)]
backport: allocate cap save buffers for PCIe/PCI-X.
Changeset 819:e8a9f8910a3f doesn't backport all the necessary code.
This patch adds the missing part. It is also backported from upstream
Linux kernel and the git commit is:
PCI: handle PCI state saving with interrupts disabled
Since interrupts will soon be disabled at PCI resume time, we need
to
pre-allocate memory to save/restore PCI config space (or use
GFP_ATOMIC=, but this is safer).
Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Dexuan Cui <dexuan.cui@intel.com>=
Keir Fraser [Wed, 18 Mar 2009 11:40:10 +0000 (11:40 +0000)]
PCI: add SR-IOV API for Physical Function driver
Add or remove the Virtual Function when the SR-IOV is enabled or
disabled by the device driver. This can happen anytime rather than
only at the device probe stage.
Keir Fraser [Wed, 18 Mar 2009 11:39:04 +0000 (11:39 +0000)]
PCI: initialize and release SR-IOV capability
If a device has the SR-IOV capability, initialize it (set the ARI
Capable Hierarchy in the lowest numbered PF if necessary; calculate
the System Page Size for the VF MMIO, probe the VF Offset, Stride
and BARs). A lock for the VF bus allocation is also initialized if
a PF is the lowest numbered PF.
PCI: Restore PCI Express capability registers after PM event
Restore PCI Express capability registers after PM event.
This includes maxumum MTU for PCI express and other vital data.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit cc692a5f1e9816671b77da77c6d6c463156ba1c7
Author: Stephen Hemminger <shemminger@osdl.org>
Date: Wed Nov 8 16:17:15 2006 -0800
PCI: save/restore PCI-X state
Shouldn't PCI-X state be saved/restored? No device really needs
this
right now. qla24xx (fc HBA) and mthca (infiniband) don't do
suspend,
and sky2 resets its tweaks when links are brought up.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Dexuan Cui <dexuan.cui@intel.com>
This patch moves all definitions of the PCI resource names to an
'enum',
and also replaces some hard-coded resource variables with symbol
names. This change eases introduction of device specific
resources.
PCI: remove unnecessary arg of pci_update_resource()
This cleanup removes unnecessary argument 'struct resource *res'
in
pci_update_resource(), so it takes same arguments as other
companion
functions (pci_assign_resource(), etc.).
PCI: allow pci_alloc_child_bus() to handle a NULL bridge
Allow pci_alloc_child_bus() to allocate buses without bridge
devices.
Some SR-IOV devices can occupy more than one bus number, but there
is no
explicit bridges because that have internal routing mechanism.
Change parameter of pci_ari_enabled() from 'pci_dev' to 'pci_bus'.
ARI forwarding on the bridge mostly concerns the subordinate
devices
rather than the bridge itself. So this change will make the
function
easier to use.
PCI: fix ARI code to be compatible with mixed ARI/non-ARI systems
The original ARI support code has a compatibility problem with
non-ARI
devices. If a device doesn't support ARI, turning on ARI
forwarding on
its upper level bridge will cause undefined behavior.
This fix turns on ARI forwarding only when the subordinate devices
support it.
This patch adds support for PCI Express Alternative Routing-ID
Interpretation (ARI) capability.
The ARI capability extends the Function Number field of the PCI
Express
Endpoint by reusing the Device Number which is otherwise hardwired
to 0.
With ARI, an Endpoint can have up to 256 functions.
Since patch 6ac665c63dcac8fcec534a1d224ecbb8b867ad59 my infiniband
controller hasn't worked. This is because it has 64-bit
prefetchable
memory, which was mistakenly being taken to be 32-bit memory.
The
resource flags in this case are PCI_BASE_ADDRESS_MEM_TYPE_64 |
PCI_BASE_ADDRESS_MEM_PREFETCH.
This patch checks only for the PCI_BASE_ADDRESS_MEM_TYPE_64 bit;
thus
whether the region is prefetchable or not is ignored. This fixes
my
Infiniband.
Reviewed-by: Matthew Wilcox <matthew@wil.cx> Signed-off-by: Peter Chubb <peterc@gelato.unsw.edu.au> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Yu Zhao <yu.zhao@intel.com>
PCI: handle 64-bit resources better on 32-bit machines
If the kernel is configured to support 64-bit resources on a
32-bit
machine, we can support 64-bit BARs properly. Just change the
condition
to check sizeof(resource_size_t) instead of BITS_PER_LONG.
Factor out the code to read one BAR from the loop in
pci_read_bases into
a new function, __pci_read_base. The new code is slightly more
readable, better commented and removes the ifdef.
Keir Fraser [Mon, 2 Mar 2009 11:06:52 +0000 (11:06 +0000)]
netfront: Unregister inetdev notifiers on failure
If you attempt to modprobe the pv-on-hvm netfront driver on a machine
not running under Xen (say, bare-metal, or under another hypervisor), the
netfront code correctly returns an ENODEV and fails to load. However, if you
then shutdown that machine, you will oops while tearing down the network.
This is because we forget to unregister the the inetaddr_notifier on failure,
and so the kernel takes a fatal page fault. The attached patch just unregisters
the notifier on failure, and solves the problem for me.
Signed-off-by: Chris Lalancette <clalance@redhat.com>
Keir Fraser [Mon, 2 Mar 2009 10:57:56 +0000 (10:57 +0000)]
pciback: Fix invalid use of pci_match_id()
We cannot use pci_match_id() because the first argument (tmp_quirk->devid)
is not an array of pci device ids. Instead this patch adds a utility
function to compare a pci_device_id and a pci_dev.
The ACPI_PDC_SMP_T_SWCOORD bit is set by and OS that is capable of
native ACPI throttling software coordination for mutli-processors
using the _TSD information.
Signed-off-by: Zhao Yakui <yakui.zhao@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Wei Gang <gang.wei@intel.com>
ACPI: Get throttling info from BIOS only after evaluating _PDC
Previously _PDC was evaluated later, and thus we'd not get
the chance to tell the BIOS that we can suport FixedHW registers
(MSRs)
and the BIOS would always ask us to use System I/O access
for throttling.
Signed-off-by: Zhao Yakui <yakui.zhao@intel.com> Signed-off-by: Li Shaohua <shaohua.li@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Wei Gang <gang.wei@intel.com>
Add throttling control via MSR when T-states uses
the FixHW Control Status registers.
Signed-off-by: Zhao Yakui <yakui.zhao@intel.com> Signed-off-by: Li Shaohua <shaohua.li@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Wei Gang <gang.wei@intel.com>
Keir Fraser [Tue, 17 Feb 2009 11:17:11 +0000 (11:17 +0000)]
pvSCSI: add new device assignment mode
Add a new device assignment mode, which assigns whole HBA
(SCSI host) to guest domain. Current implementation requires SCSI
command emulation on backend driver, and it causes limitations for
some SCSI commands. (Please see
"http://www.xen.org/files/xensummit_tokyo/24_Hitoshi%20Matsumoto_en.pdf"
for detail about why we need the new assignment mode.
SCSI command emulation on backend driver is bypassed when "host" mode
is specified.
Signed-off-by: Tomonari Horikoshi <t.horikoshi@jp.fujitsu.com> Signed-off-by: Jun Kamada <kama@jp.fujitsu.com>
Keir Fraser [Wed, 4 Feb 2009 12:26:00 +0000 (12:26 +0000)]
linux: fix IRQ handling for PV passthrough
For DomU-s registering PIRQ-s must be done separately, as they don't
use the IO-APIC code.
Additionally make sure the IRQ chip doesn't get set twice (and the
event channel information overwritten) for an IRQ possibly in use by
more than one device.
Keir Fraser [Wed, 4 Feb 2009 12:25:09 +0000 (12:25 +0000)]
linux: remove xen specific member from pci_dev
Move msi related variable irq_old out of struct pci_dev. This is
logically more consistent and has the additional benefit that xen
kernel and vanilla kernel now have the same pci_dev layout
Keir Fraser [Tue, 3 Feb 2009 13:59:17 +0000 (13:59 +0000)]
fbfront: Improve diagnostics when kthread_run() fails
Failure is reported with xenbus_dev_fatal(..."register_framebuffer"),
which was already suboptimal before it got moved away from
register_framebuffer(), and is outright misleading now.
Signed-off-by: Markus Armbruster <armbru@redhat.com>