Keir Fraser [Fri, 10 Dec 2010 10:48:31 +0000 (10:48 +0000)]
xen/netloop: make netloop permanent
with reference to RH BZ#567540 [0], this patch makes the netloop
module permanent (like netback is currently). It reverts parts of
xen-unstable c/s 9019:271cb04a4f2b [1] [2] (though that has a typo:
"__init clean_loopback", so it was probably changed later too).
The patch fixes the problem of "rmmod netloop" hanging, resulting in
blocked tasks and inability to shut down cleanly:
... kernel: unregister_netdevice: waiting for veth4 to become
free. Usage count = 1
The problem was also reported for Debian [3] and on the Fedora-xen
mailing list [4].
Keir Fraser [Tue, 23 Nov 2010 13:58:38 +0000 (13:58 +0000)]
blkback/blktap/netback: Fix CVE-2010-3699
A guest can cause the backend driver to leak a kernel
thread. Such leaked threads hold references to the device, whichmakes
the device impossible to tear down. If shut down, the guest remains a
zombie domain, the xenwatch process hangs, and most xm commands will
stop working.
This patch tries to do the following, for all of netback, blkback,
blktap:
- identify/extract idempotent teardown operations,
- add/move the invocation of said teardown operation
right before we're about to allocate new resources in the
Connected states.
Keir Fraser [Fri, 19 Nov 2010 13:20:06 +0000 (13:20 +0000)]
blktap2: fix synchronization in blktap_device_run_queue()
c/s 896 (use of blk_rq_map_sg()) made the problem worse, but from what
I can tell there had been races (ring and stats updates) before. If
that's not a correct observation, a perhaps better solution might be
to move the struct scatterlist array out of struct blktap (and make it
e.g. an on-stack variable, the problem being that
blktap_device_process_request() has a pretty large stack frame
already - shrinking this might be possible by moving e.g. the
struct blktap_grant_table and struct blkif_request blkif_req instances
the other way if the locking change here is the right thing to do).
Keir Fraser [Tue, 16 Nov 2010 11:31:19 +0000 (11:31 +0000)]
xen/evtchn: clear secondary CPUs' cpu_evtchn_mask[] after restore
To bind all event channels to CPU#0, it is not sufficient to set all
of its cpu_evtchn_mask[] bits; all other CPUs also need to get their
bits cleared. Otherwise, evtchn_do_upcall() will start handling
interrupts on CPUs they're not intended to run on, which can be
particularly bad for per-CPU ones.
Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Keir Fraser <keir@xen.org>
Keir Fraser [Mon, 15 Nov 2010 09:32:57 +0000 (09:32 +0000)]
pcifront: fix PCI reference leak
Stanse found that when pdev is found and has no driver a reference is
leaked in pcifront_common_process. So add pci_dev_put there. For the
pdev == NULL case, pci_dev_put(NULL) is fine.
Keir Fraser [Wed, 3 Nov 2010 08:20:42 +0000 (08:20 +0000)]
netback: take net_schedule_list_lock when removing entry from net_schedule_list
There is a race in net_tx_build_mops between checking if
net_schedule_list is empty and actually dequeuing the first entry on
the list. If another thread dequeues the only entry on the list during
this window we crash because list_first_entry expects a non-empty
list. Therefore after the initial lock free check for an empty list
check again with the lock held before dequeueing the entry.
Based on a patch by Tomasz Wroblewski.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Jan Beulich <jbeulich@novell.com>
Keir Fraser [Mon, 4 Oct 2010 12:30:33 +0000 (13:30 +0100)]
linux/pcifront: claim PCI resources also on rescan
Condensed from the following two patches:
http://git.kernel.org/?p=3Dlinux/kernel/git/konrad/xen.git;a=3Dcommitdiff;h=
=3D621d869f36b215d63bb99e7ecd7a11f029821b85=20
xen-pcifront: Claim PCI resources before going live.
author Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>=09
Fri, 18 Jun 2010 19:31:47 +0000 (15:31 -0400)
committer Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>=09
Fri, 18 Jun 2010 19:40:37 +0000 (15:40 -0400)
We were missing the important step of claiming (and setting the
parent of IO and MEM regions to 'PCI IO' and 'PCI mem' respectivly)
of the BARs. This meant that during hot inserts we would get:
igb 0000:01:00.1: device not available (can't reserve [mem
0xfb840000-0xfb8=
5ffff])
even thought the memory region had been reserved before.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
http://git.kernel.org/?p=3Dlinux/kernel/git/konrad/xen.git;a=3Dcommitdiff;h=
=3D4a65de894fc0af05397eedca180d0ea7d8c6caba=20
xen-pcifront: Don't race with udev when discovering new devices.
We inadvertly would call 'pci_bus_add_device' right after discovering
the device, but before claiming the BARs. This ended up firing off
a uevent and udev loading the module and the modules failing to
request_region as they were not claimed. We fix this by holding off
going live by calling 'pci_bus_add_devices' at the end.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jan Beulich <jbeulich@novell.com> Reported-by: Rafal Wojtczuk <rafal@invisiblethingslab.com>
xen/evtchn: Handle VIRQ_TIMER before any other hardirq in event loop.
This avoids any other hardirq handler seeing a very stale jiffies
value immediately after wakeup from a long idle period. The one
observable symptom of this was a USB keyboard, with software keyboard
repeat, which would always repeat a key immediately that it was
pressed. This is due to the key press waking the guest, the key
handler immediately runs, sees an old jiffies value, and then that
jiffies value significantly updated, before the key is unpressed.
xen/x86: synchronize vmalloc_sync_all() with mm_{,un}pin()
As recently diagnosed by Citrix engineers, mm_{,un}pin() and
vmalloc_sync_all() aren't properly synchronized. So we add a backlink
to the referencing struct mm_struct to the pgd's struct page, and use
this to lock the page table updates in vmalloc_sync_all().
Due to the way pgd-s get allocated and put on the global list on i386,
we have to account for the backlink not to be set yet (in which case
they cannot be subject to (un)pinning.
Along the way, I found it necessary/desirable to also fix
- a potential NULL dereference in i386's pgd_alloc(),
- x86-64 adding not yet cleaned out pgd-s to the global list, and
- x86-64 removing pgd-s from the global list rather late.
xen/blkfront: forward unknown IOCTLs to scsi_cmd_ioctl() for /dev/sdX
Certain utilities (here: parted) expect certain SCSI IOCTLs (here:
SCSI_IOCTL_GET_IDLUN) to not fail on /dev/sdX devices. Rather than
handling them one-by-one, just forward control to scsi_cmd_ioctl().
Handle GNTST_eagain status from GNTTABOP_map_grant_ref and
GNTTABOP_copy operations properly to allow usage of xenpaging without
causing crashes or data corruption.
Remove the existing retry code from net_rx_action(),
dispatch_rw_block_io(), net_accel_map_grants_contig() and
net_accel_map_iomem_page() and replace all relevant
HYPERVISOR_grant_table_op() calls with a retry loop. This loop is
implemented as a macro to allow different GNTTABOP_* args. It will
sleep up to 33 seconds and wait for the page to be paged in again.
All ->status checks were updated to use the GNTST_* namespace. All
return values are converted from GNTST_* namespace to 0/-EINVAL, since
all callers did not use the actual return value.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Patrick Colp <pjcolp@cs.ubc.ca>
Because xen-compat.h defines __XEN_INTERFACE_VERSION__ only if __XEN__
or __XEN_TOOLS__ is defined, I added #include xen-compat.h before
#define __XEN_TOOLS__.
I confirmed that dom0 kernel could be built without warnings and
guests could be created.
xen: netback: save interrupt state in add_to_net_schedule_list_tail
add_to_net_schedule_list_tail is called from both hard interrupt
context (add_to_net_schedule_list_tail) and soft interrupt/process
context (netif_schedule_work) so use the interrupt state saving
spinlock variants.
xen/x86: make __direct_remap_pfn_range()'s return value meaningful
From: Olaf Hering <ohering@novell.com>
This change fixes the xc_map_foreign_bulk interface, which would
otherwise cause SIGBUS when pages are gone because -ENOENT is not
returned as expected by the IOCTL_PRIVCMD_MMAPBATCH_V2 ioctl.
Because the cpumap member of struct xen_sysctl_cpupool_op is used only
when the operation is XEN_SYSCTL_CPUPOOL_OP_INFO or
XEN_SYSCTL_CPUPOOL_OP_FREEINFO, in case of others, xencomm_map to
cpumap fails, thus XEN_SYSCTL_cpupool_op fails.
Keir Fraser [Tue, 10 Aug 2010 14:47:41 +0000 (15:47 +0100)]
xen/x86: eliminate nesting of run-queue locks inside xtime_lock
From: Zdenek Salvet <salvet@ics.muni.cz>
According to Debian bug 591362 this has been causing problems. While
no proof was given that the inverse lock order does actually occur
anywhere (with interrupts enabled), it is plain unnecessary to take
the risk.
Signed-off-by: Jan Beulich <jbeulich@novell.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Keir Fraser [Tue, 10 Aug 2010 14:46:56 +0000 (15:46 +0100)]
blktap2: eliminate bogus clearing of PG_reserved
While making sure PG_reserved is set for pages allocated from the
balloon driver (and to be used for I/O) is a necessary thing to do
(as 2.6.18's as well as pv-ops' balloon drivers don't guarantee this
for the pages returned from alloc_empty_pages_and_page_vec()),
clearing this flag again when a page is no longer in use for I/O is
bogus at best (after all, the page at that point is not associated
with any MFN anymore), and causes problems when the balloon driver
properly marks all such pages as reserved and checks, upon their
return, that they are still marked this way.
Keir Fraser [Mon, 2 Aug 2010 10:02:18 +0000 (11:02 +0100)]
xenoprofile: Add IBS support
Add IBS support for AMD family 10h processors. The major
implementation is derived from latest Linux. Two hypercalls are added,
which is necessary for IBS feature detection and user mode parameter
read.
Keir Fraser [Fri, 18 Jun 2010 13:11:57 +0000 (14:11 +0100)]
xen/x86: fix for special behavior of first sys_settimeofday(NULL, &tz) invocation
The data Xen's time implementation maintains to make do_gettimeofday()
return values monotonic needs to be reset not only during normal
do_gettimeofday() invocations, but also when the clock gets warped
due to the hardware (CMOS) clock running on local (rather than UTC)
time.
Additionally there was a time window in do_gettimeofday() (between
the end of the xtime read loop and the acquiring of the monotonicity
data lock) where, if on another processor do_settimeofday() would
execute to completion, the zeroes written by the latter could get
overwritten by the former with values obtained before the time was
updated. This now gets prevented by maintaining a version for the
monotonicity data.
This fixes the following errors:
/arch/ia64/xen/xcom_privcmd.c: In function `xencomm_privcmd_sysctl':
/arch/ia64/xen/xcom_privcmd.c:295: error: case label not within a
switch statement
/arch/ia64/xen/xcom_privcmd.c:305: error: break statement not within
loop or switch
Since 1018:b7eb9756e522 inserted lines in outside of a switch
statement. This patch corrects it.
blktap: fix cleanup after unclean application exit #2
When an application using blktap devices doesn't close the mmap-s of
/dev/xen/blktapN and the frontend driver never connects, we cannot
defer the mmput() on the stored mm until blktap_release() or the exit
path of the worker thread, as the former will never be called without
the mm's reference count dropping to zero, and the worker thread
would never get started.
- array indices got checked after having indexed the array already
- several were off by one
- BLKTAP_IOCTL_FREEINTF should not be used on other than the control
device (or the logic should be changed to that when thus used only
the respective device can be freed)
- BLKTAP_IOCTL_MINOR can reasonably also be used on non-control
- devices
(returning that device's minor and ignoring the passed in argument)
xen/blktap: fix cleanup after unclean application exit
When an application using blktap devices doesn't close the file handle
(or mmap-s) of /dev/xen/blktapN, we cannot defer the mmput() on the
stored mm until blktap_release(), as that will never be called without
the mm's reference count dropping to zero.
Keir Fraser [Tue, 30 Mar 2010 17:28:34 +0000 (18:28 +0100)]
xen/balloon: Fix return value interpretation for XENMEM_get_pod_target
Unfortunately c/s 989 didn't consider what I would call a quirk in
pre-3.4 Xen, resulting in XENMEM_get_pod_target calls to not return
-ENOSYS as one would normally expect.
Keir Fraser [Mon, 1 Mar 2010 09:56:15 +0000 (09:56 +0000)]
blktap2: Fix queue restart, racing block device removal.
Makes tapdisk context test dev->gd before attempting a queue restart,
with the device lock held. Fixes a race lost against device
destruction, which may issued anywhere on the control path.
Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com> Signed-off-by: Jan Beulich <jbeulich@novell.com>
Keir Fraser [Mon, 1 Mar 2010 09:55:09 +0000 (09:55 +0000)]
Guest SR-IOV: Replace previous changeset with a more complete implementation from Intel
"""Guest SR-IOV support for PV guest
These changes are for PV guest to use Virtual Function. Because the
VF's vendor, device registers in cfg space are 0xffff, which are
invalid and ignored by PCI device scan. Values in 'struct pci_dev' are
fixed up by SR-IOV code, and using these values will present correct
VID and DID to PV guest kernel.
And command registers in the cfg space are read only 0, which means we
have to emulate MMIO enable bit (VF only uses MMIO resource) so PV
kernel can work properly."""
Keir Fraser [Mon, 22 Feb 2010 10:03:18 +0000 (10:03 +0000)]
linux/x86: fix long timeout handling in stop_hz_timer()
Other than for HYPERVISOR_set_timer_op, zero doesn't mean "no timeout"
for VCPUOP_set_singleshot_timer (but should be retained rather than
adjusted by NS_PER_TICK/2 for the former).
Also properly cancel the singleshot timer is start_hz_timer().
Keir Fraser [Mon, 1 Feb 2010 14:12:36 +0000 (14:12 +0000)]
xen/balloon: fix balloon driver accounting for HVM-with-PoD case
With PoD, ballooning down a guest to the target set through xenstore
based on its totalram_pages value isn't sufficient, since that value
doesn't include all the pages assigned to the guest. Since the delta
is static, determine it once at load time.
Keir Fraser [Mon, 18 Jan 2010 10:46:43 +0000 (10:46 +0000)]
xen/blkfront: fixes for 'xm block-detach ... --force'
Prevent prematurely freeing 'struct blkfront_info' instances (when the
xenbus data structures are gone, but the Linux ones are still needed).
Prevent adding a disk with the same (major, minor) [and hence the same
name and sysfs entries, which leads to oopses] when the previous
instance wasn't fully de-allocated yet.
This still doesn't address all issues resulting from forced detach:
I/O submitted after the detach still blocks forever, likely preventing
subsequent un-mounting from completing. It's not clear to me (not
knowing much about the block layer) how this can be avoided.
This also doesn't address issues with duplicate device creation caused
by re-using the hdXX and sdXX name spaces - this would require
synchronization with the respective native code.
Keir Fraser [Wed, 13 Jan 2010 08:11:51 +0000 (08:11 +0000)]
privcmd: add new (replacement) mmap-batch ioctl
While the error indicator of IOCTL_PRIVCMD_MMAPBATCH should be in the
top nibble (it is documented that way in include/xen/public/privcmd.h
and include/xen/compat_ioctl.h), it really wasn't for 64-bit
implementations. With MFNs now possibly being 32 or more bits wide on
x86-64, using bits 28-31 as failure indicator (and bit 31 as paged-out
indicator) is not longer acceptable. Instead, a new ioctl with a
separate error indication array is being introduced.
Keir Fraser [Fri, 8 Jan 2010 13:07:17 +0000 (13:07 +0000)]
Update sfc_netback driver to match sfc_resource 3.0.2.2074
Add support for direct guest access and acceleration of SFC9000 series
NICs.
Improve handling of NIC reset in sfc_netback
Remove nic_index state and replace with if_index from struct
net_device Remove duplication of header files with sfc_resource driver
Keir Fraser [Fri, 8 Jan 2010 13:05:49 +0000 (13:05 +0000)]
Update Solarflare Communications net driver to version 3.0.2.2074
Bring net driver in Xen tree in line with kernel.org tree
Add support for new SFC9000 series NICs
Keir Fraser [Wed, 6 Jan 2010 08:38:09 +0000 (08:38 +0000)]
xen/privcmd: fix for proper operation in compat mode
- sizeof(struct privcmd_mmapbatch_32) was wrong
- MFN array must be translated for IOCTL_PRIVCMD_MMAPBATCH
Also, the error indicator of IOCTL_PRIVCMD_MMAPBATCH should be in the
top nibble (it is documented that way in include/xen/public/privcmd.h
and include/xen/compat_ioctl.h), but since that is an incompatible
change it is not being done here (instead, a new ioctl with proper
behavior will need to be added).
Keir Fraser [Wed, 6 Jan 2010 08:15:35 +0000 (08:15 +0000)]
xenoprof: dynamic buffer array allocation
The recent change to locally define MAX_VIRT_CPUS wasn't really
appropriate - with there not being a hard limit on the number of
vCPU-s anymore, these arrays should be allocated dynamically.
Keir Fraser [Wed, 6 Jan 2010 08:14:10 +0000 (08:14 +0000)]
xen/privcmd: convert single shot check to be per-page
For the sake of not breaking the ia64 build, old behavior is being
retained when HAVE_ARCH_PRIVCMD_MMAP. Hopefully someone able to
test ia64 can fix this up in the not too distant future.
Keir Fraser [Wed, 16 Dec 2009 16:44:12 +0000 (16:44 +0000)]
xen/backends: simplify address translations
There are quite a number of places where e.g. page->va->page
translations happen.
Besides yielding smaller code (source and binary), a second goal is to
make it easier to determine where virtual addresses of pages allocated
through alloc_empty_pages_and_pagevec() are really used (in turn in
order to determine whether using highmem pages would be possible
there).
Keir Fraser [Mon, 7 Dec 2009 14:14:28 +0000 (14:14 +0000)]
netback: Fixes for delayed copy of tx network packets.
- Should call net_tx_action_dealloc() even when dealloc ring is
empty, as there may in any case be work to do on the
pending_inuse list.
- Should not exit directly from the middle of the tx_action tasklet,
as the tx_pending_timer should always be checked and updated at the
end of the tasklet.