Keir Fraser [Thu, 14 May 2009 09:08:10 +0000 (10:08 +0100)]
xen: miscellaneous cleanup
- add two missing unwind annotations
- mark remaining struct file_operations instances const
- use get_capacity() instead of raw access to the capacity field
- use assert_spin_locked() instead of BUG_ON(!spin_is_locked())
- use clear_tsk_thread_flag() instead of clear_ti_thread_flag()
- remove dead variable cpu_state
vm_area_struct::vm_private_data is used
by get_user_pages() so that we can't override
it. So in order to make blktap work, set it
to a array of struct page*.
Without mm->mmap_sem, virtual mapping can be changed.
so remembering vma which was passed to mmap callback
is bogus because later the vma can be freed or changed.
So don't remember vma and put necessary infomations into
tap_blkif_t. and use find_vma() to get necessary vma's.
Dereferencing filp->private_data->vma in the file_operations.release
actor isn't permitted, as the vma generally has been destroyed by that
time. The kfree()ing of vma->vm_private_data must be done in the
vm_operations.close actor, and the call to zap_page_range() seems
redundant with the caller of that actor altogether.
Without this patch, fakephp with reassigndev fails
to allocate memory resource.
Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
commit bf4162bcf82ebc3258d6bc0ddd6453132abde72d
Author: Darrick J. Wong <djwong@us.ibm.com>
Date: Tue Nov 25 13:51:44 2008 -0800
PCI hotplug: fakephp: Allocate PCI resources before adding the
device
For PCI devices, pci_bus_assign_resources() must be called to set
up the pci_device->resource array before pci_bus_add_devices() can
be called, else attempts to load drivers results in BAR collision
errors where there are none.
This is not done in fakephp, so devices can be "unplugged" but
scanning the
parent bus won't bring the devices back due to resource
unallocation. Move the
pci_bus_add_device-calling logic into pci_rescan_bus and preface
it with a call
to pci_bus_assign_resources so that we only have to (re)allocate
resources once
per bus where a new device is found.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Acked-by: Alex Chiang <achiang@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
dom0 linux: support SBDF with "guestdev=" and remove "reassigndev="
When we don't need to reassign resources and use device path,
pciback.hide= boot parameter can be used. The parameter is also needed
for backward compatibility.
pciback.hide=(00:01.0)(00:02.0)
When we need to reassign resources or use device path, guestdev= boot
parameter can be used. reassign_resources boot parameter is needed to
reassign resources, too.
PCI: sync up the SR-IOV changes between Dom0 and upstream kernel
The SR-IOV patches for the upstream kernel are finally in-tree. This
patch backports some minor changes that appeared in the upstream
kernel after the Dom0 patches were checked-in.
pci: Do not disable I/O decoding on reassigning resource.
When I reserve UHCI for guest domain with "guestdev=" and
"reassign_resources" parameters, spurious interrupts occurred.
The reason is that UHCI is not reset by uhci_check_and_reset_hc
because I/O decoding is disabled. UHCI keeps asserting the interrupt
line. As a result spurious interrupts occur.
The patch does not disable I/O decoding. It disables only memory
decoding. So UHCI is reset and spurious interrupts do not occur.
Keir Fraser [Tue, 31 Mar 2009 11:01:50 +0000 (12:01 +0100)]
usbfront: do not assume sequentially mapped pages
xenhcd_gnttab_map in usbfront-q.c looks up the mfn of the start of the
usb transfer buffer. But the buffer may span several pages, and the
current code simply increments the obtained mfn. Needless to say this
is an unwarranted assumption. It causes large transfers to be
corrupted and/or to overwrite other parts of memory.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Keir Fraser [Tue, 31 Mar 2009 11:00:53 +0000 (12:00 +0100)]
sfc_netfront: Only clear tx_skb when ready for netif_wake_queue
(doing otherwise could result in a lost packet) and document use of
locks to protect tx_skb
Keir Fraser [Tue, 31 Mar 2009 10:59:10 +0000 (11:59 +0100)]
net sfc: Update sfc and sfc_resource driver to latest release
...and update sfc_netfront, sfc_netback, sfc_netutil for any API changes
sfc_netback: Fix asymmetric use of SFC buffer table alloc and free
sfc_netback: Clean up if no SFC accel device found
sfc_netback: Gracefully handle case where page grant fails
sfc_netback: Disable net acceleration if the physical link goes down
sfc_netfront: Less verbose error messages, more verbose counters for
rx discard errors
sfc_netfront: Gracefully handle case where SFC netfront fails during
initialisation
Keir Fraser [Fri, 20 Mar 2009 09:00:58 +0000 (09:00 +0000)]
xen: swiotlb allocations do not need to come from low memory
Other than on native, where using the _low variants of alloc_bootmem()
is indeed a requirement for swiotlb, on Xen this is not needed. Using
the _low variants has the potential of preventing systems from booting
when they have lots of memory, due to the way the bootmem allocator
works: It allocates memory from bottom to top. Thus, if other large
(but not _low) allocations (memmap, large system hash tables)
mostly consume the memory below 4Gb, the swiotlb allocations can
fail. (This is equally so for native, but cannot be that easily fixed
there.)
Keir Fraser [Thu, 19 Mar 2009 10:21:46 +0000 (10:21 +0000)]
PCI: pass ARI and SR-IOV device information to the hypervisor
PCIe Alternative Routing-ID Interpretation (ARI) ECN defines the Extended
Function -- a function whose function number is greater than 7 within an
ARI Device. Intel VT-d spec 1.2 section 8.3.2 specifies that the Extended
Function is under the scope of the same remapping unit as the traditional
function. The hypervisor needs to know if a function is Extended
Function so it can find proper DMAR for it.
And section 8.3.3 specifies that the SR-IOV Virtual Function is under the
scope of the same remapping unit as the Physical Function. The hypervisor
also needs to know if a function is the Virtual Function and which
Physical Function it's associated with for same reason.
Keir Fraser [Thu, 19 Mar 2009 10:21:21 +0000 (10:21 +0000)]
PCI: save and restore PCIe 2.0 registers
PCIe 2.0 defines several new registers (Device Control 2, Link Control
2, and Slot Control 2). Save and retore them in pci_save_pcie_state()
and pci_restore_pcie_state().
Keir Fraser [Thu, 19 Mar 2009 10:06:52 +0000 (10:06 +0000)]
linux pciback/pcifront: work queue management fixes
flush_scheduled_work() only flushes work queued to the global
keventd_wq, but pciback is using its own local work queue, so that is
what needs to be flushed.
Calling cancel_delayed_work() on something never inserted through
queue_delayed_work() or schedule_delayed_work() is pointless.
Keir Fraser [Wed, 18 Mar 2009 11:51:05 +0000 (11:51 +0000)]
x86: Fix interaction of NTP and dom0->xen time updates
Don't discard NTP sync when updating Xen wallclock time from dom0,
as that's almost the first thing we do when we become synced.
Move the call to ntp_clear() into do_settimeofday(), which is the
only caller of __update_wallclock() that looks like it should break
NTP sync.
This fixes the timer chain that sets Xen's wallclock every minute when
dom0 is NTP synced, which in turn greatly improves wallclock accuracy
in PV domU.
Keir Fraser [Wed, 18 Mar 2009 11:45:30 +0000 (11:45 +0000)]
backport: allocate cap save buffers for PCIe/PCI-X.
Changeset 819:e8a9f8910a3f doesn't backport all the necessary code.
This patch adds the missing part. It is also backported from upstream
Linux kernel and the git commit is:
PCI: handle PCI state saving with interrupts disabled
Since interrupts will soon be disabled at PCI resume time, we need
to
pre-allocate memory to save/restore PCI config space (or use
GFP_ATOMIC=, but this is safer).
Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Dexuan Cui <dexuan.cui@intel.com>=
Keir Fraser [Wed, 18 Mar 2009 11:40:10 +0000 (11:40 +0000)]
PCI: add SR-IOV API for Physical Function driver
Add or remove the Virtual Function when the SR-IOV is enabled or
disabled by the device driver. This can happen anytime rather than
only at the device probe stage.
Keir Fraser [Wed, 18 Mar 2009 11:39:04 +0000 (11:39 +0000)]
PCI: initialize and release SR-IOV capability
If a device has the SR-IOV capability, initialize it (set the ARI
Capable Hierarchy in the lowest numbered PF if necessary; calculate
the System Page Size for the VF MMIO, probe the VF Offset, Stride
and BARs). A lock for the VF bus allocation is also initialized if
a PF is the lowest numbered PF.
PCI: Restore PCI Express capability registers after PM event
Restore PCI Express capability registers after PM event.
This includes maxumum MTU for PCI express and other vital data.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit cc692a5f1e9816671b77da77c6d6c463156ba1c7
Author: Stephen Hemminger <shemminger@osdl.org>
Date: Wed Nov 8 16:17:15 2006 -0800
PCI: save/restore PCI-X state
Shouldn't PCI-X state be saved/restored? No device really needs
this
right now. qla24xx (fc HBA) and mthca (infiniband) don't do
suspend,
and sky2 resets its tweaks when links are brought up.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Dexuan Cui <dexuan.cui@intel.com>
This patch moves all definitions of the PCI resource names to an
'enum',
and also replaces some hard-coded resource variables with symbol
names. This change eases introduction of device specific
resources.
PCI: remove unnecessary arg of pci_update_resource()
This cleanup removes unnecessary argument 'struct resource *res'
in
pci_update_resource(), so it takes same arguments as other
companion
functions (pci_assign_resource(), etc.).
PCI: allow pci_alloc_child_bus() to handle a NULL bridge
Allow pci_alloc_child_bus() to allocate buses without bridge
devices.
Some SR-IOV devices can occupy more than one bus number, but there
is no
explicit bridges because that have internal routing mechanism.
Change parameter of pci_ari_enabled() from 'pci_dev' to 'pci_bus'.
ARI forwarding on the bridge mostly concerns the subordinate
devices
rather than the bridge itself. So this change will make the
function
easier to use.
PCI: fix ARI code to be compatible with mixed ARI/non-ARI systems
The original ARI support code has a compatibility problem with
non-ARI
devices. If a device doesn't support ARI, turning on ARI
forwarding on
its upper level bridge will cause undefined behavior.
This fix turns on ARI forwarding only when the subordinate devices
support it.
This patch adds support for PCI Express Alternative Routing-ID
Interpretation (ARI) capability.
The ARI capability extends the Function Number field of the PCI
Express
Endpoint by reusing the Device Number which is otherwise hardwired
to 0.
With ARI, an Endpoint can have up to 256 functions.
Since patch 6ac665c63dcac8fcec534a1d224ecbb8b867ad59 my infiniband
controller hasn't worked. This is because it has 64-bit
prefetchable
memory, which was mistakenly being taken to be 32-bit memory.
The
resource flags in this case are PCI_BASE_ADDRESS_MEM_TYPE_64 |
PCI_BASE_ADDRESS_MEM_PREFETCH.
This patch checks only for the PCI_BASE_ADDRESS_MEM_TYPE_64 bit;
thus
whether the region is prefetchable or not is ignored. This fixes
my
Infiniband.
Reviewed-by: Matthew Wilcox <matthew@wil.cx> Signed-off-by: Peter Chubb <peterc@gelato.unsw.edu.au> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Yu Zhao <yu.zhao@intel.com>
PCI: handle 64-bit resources better on 32-bit machines
If the kernel is configured to support 64-bit resources on a
32-bit
machine, we can support 64-bit BARs properly. Just change the
condition
to check sizeof(resource_size_t) instead of BITS_PER_LONG.
Factor out the code to read one BAR from the loop in
pci_read_bases into
a new function, __pci_read_base. The new code is slightly more
readable, better commented and removes the ifdef.
Keir Fraser [Mon, 2 Mar 2009 11:06:52 +0000 (11:06 +0000)]
netfront: Unregister inetdev notifiers on failure
If you attempt to modprobe the pv-on-hvm netfront driver on a machine
not running under Xen (say, bare-metal, or under another hypervisor), the
netfront code correctly returns an ENODEV and fails to load. However, if you
then shutdown that machine, you will oops while tearing down the network.
This is because we forget to unregister the the inetaddr_notifier on failure,
and so the kernel takes a fatal page fault. The attached patch just unregisters
the notifier on failure, and solves the problem for me.
Signed-off-by: Chris Lalancette <clalance@redhat.com>
Keir Fraser [Mon, 2 Mar 2009 10:57:56 +0000 (10:57 +0000)]
pciback: Fix invalid use of pci_match_id()
We cannot use pci_match_id() because the first argument (tmp_quirk->devid)
is not an array of pci device ids. Instead this patch adds a utility
function to compare a pci_device_id and a pci_dev.
The ACPI_PDC_SMP_T_SWCOORD bit is set by and OS that is capable of
native ACPI throttling software coordination for mutli-processors
using the _TSD information.
Signed-off-by: Zhao Yakui <yakui.zhao@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Wei Gang <gang.wei@intel.com>
ACPI: Get throttling info from BIOS only after evaluating _PDC
Previously _PDC was evaluated later, and thus we'd not get
the chance to tell the BIOS that we can suport FixedHW registers
(MSRs)
and the BIOS would always ask us to use System I/O access
for throttling.
Signed-off-by: Zhao Yakui <yakui.zhao@intel.com> Signed-off-by: Li Shaohua <shaohua.li@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Wei Gang <gang.wei@intel.com>
Add throttling control via MSR when T-states uses
the FixHW Control Status registers.
Signed-off-by: Zhao Yakui <yakui.zhao@intel.com> Signed-off-by: Li Shaohua <shaohua.li@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Wei Gang <gang.wei@intel.com>
Keir Fraser [Tue, 17 Feb 2009 11:17:11 +0000 (11:17 +0000)]
pvSCSI: add new device assignment mode
Add a new device assignment mode, which assigns whole HBA
(SCSI host) to guest domain. Current implementation requires SCSI
command emulation on backend driver, and it causes limitations for
some SCSI commands. (Please see
"http://www.xen.org/files/xensummit_tokyo/24_Hitoshi%20Matsumoto_en.pdf"
for detail about why we need the new assignment mode.
SCSI command emulation on backend driver is bypassed when "host" mode
is specified.
Signed-off-by: Tomonari Horikoshi <t.horikoshi@jp.fujitsu.com> Signed-off-by: Jun Kamada <kama@jp.fujitsu.com>
Keir Fraser [Wed, 4 Feb 2009 12:26:00 +0000 (12:26 +0000)]
linux: fix IRQ handling for PV passthrough
For DomU-s registering PIRQ-s must be done separately, as they don't
use the IO-APIC code.
Additionally make sure the IRQ chip doesn't get set twice (and the
event channel information overwritten) for an IRQ possibly in use by
more than one device.
Keir Fraser [Wed, 4 Feb 2009 12:25:09 +0000 (12:25 +0000)]
linux: remove xen specific member from pci_dev
Move msi related variable irq_old out of struct pci_dev. This is
logically more consistent and has the additional benefit that xen
kernel and vanilla kernel now have the same pci_dev layout
Keir Fraser [Tue, 3 Feb 2009 13:59:17 +0000 (13:59 +0000)]
fbfront: Improve diagnostics when kthread_run() fails
Failure is reported with xenbus_dev_fatal(..."register_framebuffer"),
which was already suboptimal before it got moved away from
register_framebuffer(), and is outright misleading now.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Keir Fraser [Wed, 14 Jan 2009 14:03:42 +0000 (14:03 +0000)]
revert: "netfront/back: do not mark packets of length < MSS as GSO"
changeset: 774:107e10e0e07c
user: Keir Fraser <keir.fraser@citrix.com>
date: Tue Jan 13 15:17:54 2009 +0000
summary: netfront/back: do not mark packets of length < MSS as GSO
Herbert Xu suggested a better fix in the network
stack which will follow.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Keir Fraser [Tue, 13 Jan 2009 15:17:54 +0000 (15:17 +0000)]
netfront/back: do not mark packets of length < MSS as GSO
Linux assumes that skbs marked for GSO are longer than MSS. In
particular tcp_tso_segment assumes that skb_segment will return a
chain of at least 2 skbs.
Both netfront and back should therefor not pass such a packet up the
stack.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>