Keir Fraser [Fri, 8 Jan 2010 13:07:17 +0000 (13:07 +0000)]
Update sfc_netback driver to match sfc_resource 3.0.2.2074
Add support for direct guest access and acceleration of SFC9000 series
NICs.
Improve handling of NIC reset in sfc_netback
Remove nic_index state and replace with if_index from struct
net_device Remove duplication of header files with sfc_resource driver
Keir Fraser [Fri, 8 Jan 2010 13:05:49 +0000 (13:05 +0000)]
Update Solarflare Communications net driver to version 3.0.2.2074
Bring net driver in Xen tree in line with kernel.org tree
Add support for new SFC9000 series NICs
Keir Fraser [Wed, 6 Jan 2010 08:38:09 +0000 (08:38 +0000)]
xen/privcmd: fix for proper operation in compat mode
- sizeof(struct privcmd_mmapbatch_32) was wrong
- MFN array must be translated for IOCTL_PRIVCMD_MMAPBATCH
Also, the error indicator of IOCTL_PRIVCMD_MMAPBATCH should be in the
top nibble (it is documented that way in include/xen/public/privcmd.h
and include/xen/compat_ioctl.h), but since that is an incompatible
change it is not being done here (instead, a new ioctl with proper
behavior will need to be added).
Keir Fraser [Wed, 6 Jan 2010 08:15:35 +0000 (08:15 +0000)]
xenoprof: dynamic buffer array allocation
The recent change to locally define MAX_VIRT_CPUS wasn't really
appropriate - with there not being a hard limit on the number of
vCPU-s anymore, these arrays should be allocated dynamically.
Keir Fraser [Wed, 6 Jan 2010 08:14:10 +0000 (08:14 +0000)]
xen/privcmd: convert single shot check to be per-page
For the sake of not breaking the ia64 build, old behavior is being
retained when HAVE_ARCH_PRIVCMD_MMAP. Hopefully someone able to
test ia64 can fix this up in the not too distant future.
Keir Fraser [Wed, 16 Dec 2009 16:44:12 +0000 (16:44 +0000)]
xen/backends: simplify address translations
There are quite a number of places where e.g. page->va->page
translations happen.
Besides yielding smaller code (source and binary), a second goal is to
make it easier to determine where virtual addresses of pages allocated
through alloc_empty_pages_and_pagevec() are really used (in turn in
order to determine whether using highmem pages would be possible
there).
Keir Fraser [Mon, 7 Dec 2009 14:14:28 +0000 (14:14 +0000)]
netback: Fixes for delayed copy of tx network packets.
- Should call net_tx_action_dealloc() even when dealloc ring is
empty, as there may in any case be work to do on the
pending_inuse list.
- Should not exit directly from the middle of the tx_action tasklet,
as the tx_pending_timer should always be checked and updated at the
end of the tasklet.
Keir Fraser [Thu, 3 Dec 2009 13:53:06 +0000 (13:53 +0000)]
xenfb: Only start one xenfb kthread
When doing save/restore testing with the linux-2.6.18-xen.hg tree it
was discovered that every time a restore happened we would get a new
xenfb thread. While the framebuffer continues to work, this is an
obvious resource leak. The attached patch only starts up a new xenfb
thread the first time the backend connects, and continues to re-use
that in the future. Jeremy's upstream LKML tree doesn't suffer from
this since it uses a completely different mechanism to do screen
updates. Original patch from John Haxby @ Oracle; slightly modified
by me to apply to the linux-2.6.18-xen.hg tree.
Signed-off-by: Chris Lalancette <clalance@redhat.com>
When mem= is being used to specify a value below the amount a domain
got passed from Xen, init_memory_mapping() got called with the higher
original value (end_pfn_map), triggering the BUG()s in maddr.h
checking PFNs against end_pfn.
Keir Fraser [Tue, 24 Nov 2009 14:45:19 +0000 (14:45 +0000)]
xen: Dont call msi_unmap_pirq() if did not enable msi
When device driver unload, it may call pci_disable_msi(), if msi did
not enabled but do msi_unmap_pirq(), then later driver reload and
without msi, then will failed in request_irq() for irq_desc[irq]->chip
valie is no_irq_chip. So when did not enable msi during driver
initializing, then unloaded driver will not try to disable it.
How to reproduce it:
At the server with QLogic 25xx, try to reload qla2xxx will hit it.
Keir Fraser [Wed, 4 Nov 2009 18:13:32 +0000 (18:13 +0000)]
xenbus: do not hold transaction_mutex when returning to userspace
================================================
[ BUG: lock held when returning to user space! ]
------------------------------------------------
xenstore-list/3522 is leaving the kernel with locks still held!
1 lock held by xenstore-list/3522:
#0: (&xs_state.transaction_mutex){......}, at: [<c026dc6f>]
xenbus_dev_request_and_reply+0x8f/0xa0
The canonical fix for this type of issue appears to be to maintain a
count manually rather than using an rwsem so do that here.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Keir Fraser [Fri, 23 Oct 2009 09:07:22 +0000 (10:07 +0100)]
xen/x86: fix GFP mask handling in dma_alloc_coherent()
Ever since no longer pushing all memory into the DMA zone (c/s 355),
explicitly setting GFP_DMA as well as not masking off GFP_DMA32 was
unnecessarily restricting the pool from which suitable memory could be
taken.
Keir Fraser [Wed, 7 Oct 2009 07:42:00 +0000 (08:42 +0100)]
PVUSB: Fixes and updates
- xenbus state flow changed.
Whole of the flow is changed to be like netback/netfront.
Reconfiguring/Reconfiguring are removed.
- New RING for hotplug notification added.
- USBIF_MAX_SEGMENTS_PER_REQUEST value is changed (10) to (16).
According to this change, RING_SIZE is decreased from 32 to 16.
This affects the performance. My flash drive's read throughput
was dropped from 29MB/s to 18MB/s in the linux environment.
However, Windows guest send urb with 64kB buffer(64KB = 4kB * 16).
This is required.
- New port-setting interface
xenbus_watch_path2 is added to usbback, port-setting interface
is moved from sysfs to xenstore.
Now, the port-rule is directly written to xenstore entry.
Example.
# xenstore-write /local/domain/0/backend/vusb/1/0/port/1 "2-1"
(adding physical bus 2-1 to vusb-1-0 port 1)
- urb dequeue function completed.
usbfront send unlink-request to usbback, and can cancel the urb
that is submitted in the backend.
- New USB Spec version (USB1.1/USB2.0) selection support.
usbfront can act as both USB1.1 and USB2.0 virtual host controller
according to the xenstore entry key "usb-ver".
- experimental bus_suspend/bus_resume added to usbfront.
- various cleanups, bugfix, refactoring and codestyle-fix.
Keir Fraser [Wed, 7 Oct 2009 06:33:40 +0000 (07:33 +0100)]
xen: re-synchronize ring.h public header
Patch 20267:e9366bed077e modified the definition of sring in the xen
repo's version of ring.h, but not the version in the linux kernel
repo. That change broke pause/resume/shutdown messages from the
blktap2 kernel module, which (for the time being) relies on pad[0]
being at consistent location in the sring struct. This patch fixes
this regression by resyncronizing the two the files.
Keir Fraser [Tue, 25 Aug 2009 13:55:22 +0000 (14:55 +0100)]
xen/x86: make do_settimeofday() return -EPERM when clock can't be changed
Rather than returning success here (without actually having done
anything), it seems more appropriate/conforming to let the caller know
that what he intended to do didn't succeed.
Keir Fraser [Wed, 5 Aug 2009 11:05:34 +0000 (12:05 +0100)]
xen/x86-64: fix Dom0 boot on AMD K8 CPUs
The workaround in question here should be (and is being) applied by
the hypervisor (which doesn't allow any guest - including Dom0 - to
write other than all zeroes or all ones into MCi_CTL).
Do not go beyond ARRAY_SIZE of info->shadow Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Jan Beulich <jbeulich@novell.com>
blktap2: make blktap2 work for auto translated mode with hvm domain.
This patch makes blktap2 work for hvm domain with auto translated
mode. (I.e. IA64 HVM domain case as Kuwamura reported its bug.)
blktap2 has introduces new feature that pages from the self domain
can be handled. However it doesn't work for auto translated mode
because blktap2 relies on p2m table manipulation. But the p2m
doesn't make sense for auto translated mode.
So self grant mapping is used instead.
Just passing same page to blktap2 daemon doesn't work because
when doing io, the page is locked, so the given page from blktap2
block device is already locked. When blktap2 daemon issues IO on
the page, it tries to lock it resulting in dead lock.
So resorted to self grant.
Keir Fraser [Mon, 29 Jun 2009 09:57:46 +0000 (10:57 +0100)]
blktap2: remove warnings.
This patch removes the following warnings on ia64.
> linux-2.6.18-xen.hg/drivers/xen/blktap2/device.c: In function
'blktap_device_finish_request':
> linux-2.6.18-xen.hg/drivers/xen/blktap2/device.c:403: warning:
format '%lld' expects type 'long long int', but argument 7 has type 'uint64_t'
> linux-2.6.18-xen.hg/drivers/xen/blktap2/sysfs.c: In function
'blktap_sysfs_debug_device':
> linux-2.6.18-xen.hg/drivers/xen/blktap2/sysfs.c:276: warning: format
'%llu' expects type 'long long unsigned int', but argument 4 has type
'uint64_t'
Keir Fraser [Tue, 23 Jun 2009 10:12:38 +0000 (11:12 +0100)]
xenbus: fix timeout with PV guest and physical CDROM
Specifying a physical CDROM in the configuration of a PV guest, like
disk =3D ['tap:aio:/....,xvda,w', 'phy:/dev/cdrom,hdc:cdrom,r' ]
will cause the 300 seconds timeout to occur if there is no physical
CDROM in the tray. The bug is due to the device being Closed (as shown by
the timeout message) but not ready. The configuration is quite bogus, but
this is a regression from when the timeout was 10 seconds only, and
the fix is easy and safe: only check is_ready for connected devices.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Keir Fraser [Thu, 18 Jun 2009 09:32:16 +0000 (10:32 +0100)]
x86-64: do not pass unmanageable amounts of memory to Dom0
Due to address space restrictions it is not possible to successfully
pass more than about 500Gb to a Linux Dom0 unless its kernel specifies
a non-default phys-to-machine map location via XEN_ELFNOTE_INIT_P2M.
For non-Linux Dom0 kernels I can't say whether the limit could be set
to close to 1Tb, but since passing such huge amounts of memory isn't
very useful anyway (and can be enforced via dom0_mem=3D), the patch
doesn't attempt to guess the kernel type and restricts the memory
amount in all cases.
Keir Fraser [Thu, 18 Jun 2009 09:24:18 +0000 (10:24 +0100)]
Transcendent memory ("tmem") for Linux
Tmem, when called from a tmem-capable (paravirtualized) guest, makes
use of otherwise unutilized ("fallow") memory to create and manage
pools of pages that can be accessed from the guest either as
"ephemeral" pages or as "persistent" pages. In either case, the pages
are not directly addressible by the guest, only copied to and fro via
the tmem interface. Ephemeral pages are a nice place for a guest to
put recently evicted clean pages that it might need again; these pages
can be reclaimed synchronously by Xen for other guests or other uses.
Persistent pages are a nice place for a guest to put "swap" pages to
avoid sending them to disk. These pages retain data as long as the
guest lives, but count against the guest memory allocation.
This patch contains the Linux paravirtualization changes to
complement the tmem Xen patch (xen-unstable c/s 19646). It
implements "precache" (ext3 only as of now), "preswap",
and limited "shared precache" (ocfs2 only as of now) support.
CONFIG options are required to turn on
the support (but in this patch they default to "y"). If
the underlying Xen does not have tmem support or has it
turned off, this is sensed early to avoid nearly all
hypercalls.
Lots of useful prose about tmem can be found at
http://oss.oracle.com/projects/tmem
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Keir Fraser [Tue, 16 Jun 2009 10:58:55 +0000 (11:58 +0100)]
x86: add MCA logging support in DOM0
When an MCE/CMCI error happens (or by polling), the related error
information will be sent to DOM0 by XEN. This patch will help to fetch
the xen-logged information by hypercall and then convert XEN-format
log into Linux format MCELOG. It makes using current available mcelog
tools for native Linux possible.
With this patch, after mce/cmci error log information is sent to DOM0,
running mcelog tools in DOM0, you will get same detailed decoded mce
information as in Native Linux.
Signed-Off-By: Liping Ke <liping.ke@intel.com> Signed-Off-By: Yunhong Jiang <yunhong.jiang@intel.com> Acked-By: Jan Beulich <jbeulich@novell.com>
Keir Fraser [Tue, 16 Jun 2009 10:09:39 +0000 (11:09 +0100)]
blktap: Indirection in vm_area_struct->vm_private_data
The recent patch in linux-2.6.18.hg (878: eba6fe6d8d53) changed the
way that the foreign map is stored in vm_area_struct. Currently blktap
(not 2) implementation is internally inconsistent, which triggers
kernel bug when tap:aio disk is used (dump attached at the end of the
email).
Keir Fraser [Mon, 8 Jun 2009 11:23:24 +0000 (12:23 +0100)]
pci: fix pcie-aer recovery mechanism defects.
When aer error happening, if the device is not hided or assigned,
exit. If the device is assigned yet not connected by PV guest or is
owned by HVM guest, kill the guest. [sh_info is NULL]
Signed-Off-By: Liping Ke <liping.ke@intel.com> Signed-Off-By: Yunhong Jiang <yunhong.jiang@intel.com>
Keir Fraser [Fri, 5 Jun 2009 13:01:20 +0000 (14:01 +0100)]
balloon: try harder to balloon up under memory pressure.
Currently if the balloon driver is unable to increase the guest's
reservation it assumes the failure was due to reaching its full
allocation, gives up on the ballooning operation and records the limit
it reached as the "hard limit". The driver will not try again until
the target is set again (even to the same value).
However it is possible that ballooning has in fact failed due to
memory pressure in the host and therefore it is desirable to keep
attempting to reach the target in case memory becomes available. The
most likely scenario is that some guests are ballooning down while
others are ballooning up and therefore there is temporary memory
pressure while things stabilise. You would not expect a well behaved
toolstack to ask a domain to balloon to more than its allocation nor
would you expect it to deliberately over-commit memory by setting
balloon targets which exceed the total host memory.
This patch drops the concept of a hard limit and causes the balloon
driver to retry increasing the reservation on a timer in the same
manner as when decreasing the reservation.
Also if we partially succeed in increasing the reservation
(i.e. receive less pages than we asked for) then we may as well keep
those pages rather than returning them to Xen.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Keir Fraser [Thu, 4 Jun 2009 09:33:52 +0000 (10:33 +0100)]
linux: fix blkback/blktap2 interaction
blkback's page map code needs to be accessible to both blkback and
blktap2, irrespective of whether either or both are modules. The
most immediate solution is to break it out into a separate, library-
like component that doesn't need building if either of the two
consumers is configured off, and that gets built as a module if both
consumers are modules.
Also fix the dummy implementation of blkback_pagemap_read(), since
using BUG() there doesn't compile.
Keir Fraser [Wed, 3 Jun 2009 10:22:24 +0000 (11:22 +0100)]
Dom0 PCI: fix SR-IOV function dependency link problem
PCIe Root Complex Integrated Endpoint does not implement ARI, so this
kind of endpoint uses 3-bit function number. The function dependency
link of the integrated endpoint should be calculated using the device
number field in conjunction with the value from function dependency
link register.
Normal SR-IOV endpoint always implements ARI and the function
dependency link register contains 8-bit function number (i.e. `devfn'
from software perspective).
Keir Fraser [Thu, 28 May 2009 09:04:26 +0000 (10:04 +0100)]
blktap2: add tlb flush properly.
xen_invlpg() flushes tlb on its cpu, but tlb flush is needed on
all cpus. So replace xen_invlpg() with more proper ones.
Maybe it would be possible to make tlb flush less.
this patch also makes blktap2 compile on ia64 because xen_invlpg()
is x86 specific.
Keir Fraser [Thu, 28 May 2009 08:57:49 +0000 (09:57 +0100)]
PCI pass through: PCIe IO space multiplexing
This is required for more than 16 HVM domain to boot from
PCIe pass through device.
Linux as dom0 exclusively assigns IO space to downstream PCI bridges
and the assignment unit of PCI bridge IO space is 4K. So the only up
to 16 PCIe device can be accessed via IO space within 64K IO ports.
PCI expansion ROM BIOS often uses IO port access to boot from the
device, so on virtualized environment, it means only up to 16 guest
domain can boot from pass-through device.
This patch allows PCIe IO space sharing of pass-through device.
- reassign IO space of PCIe devices specified by
"guestiomuldev=[<segment>:]<bus>:<dev>[,[<segment:><bus>:dev]][,...]"
to be shared.
This is implemented as Linux PCI quirk fixup.
The sharing unit is PCIe switch. Ie IO space of the end point
devices under the same switch will be shared. If there are more than
one switches, two areas of IO space will be used.
- And the driver which arbitrates the accesses to the multiplexed PCIe
IO space. Later qemu-dm will use this.
Limitation:
IO port of IO shared devices can't be accessed from dom0 Linux device
driver. But this wouldn't be a big issue because PCIe specification
discourages the use of IO space and recommends that IO space should be
used only for bootable device with ROM code. OS device driver should
work without IO space access.