Daniel Stodden [Thu, 30 Sep 2010 21:01:47 +0000 (14:01 -0700)]
CA-44322: Restrict I/O request merging on filesystems.
Ensure that every single iocb can be issued with only the memory
reserves held in kernel space. Main resource prone to congestion are
bio structs.
For I/O continguous in physical storage, such as bare LUN mappings, a
single bio will hold up to 256 pages. To accomodate block mappings on
file systems, we reserve a more than 1 bio, but cannot submit iocbs of
arbitrary length without risking to stall once the reserve is
exhausted.
Limits the iocb size on ext. Assumes 4k blocks for now.
Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com>
Daniel Stodden [Thu, 30 Sep 2010 21:01:45 +0000 (14:01 -0700)]
CA-44322: Add an I/O-submit thread.
Slightly annoying to add threads to core blktap code, but necessary to
avoid potential starvation when dom0 gets under memory
pressure. Blktap can guarantee io_submit makes progress by keeping
memory reserves, but not enough to guarantee that it's
non-blocking. To refill the reserves, we want completion of in-flight
I/O the main even loop.
Signed-off-by: Daniel Stodden <daniel.stodden@citrix.comm>
Daniel Stodden [Thu, 30 Sep 2010 21:01:40 +0000 (14:01 -0700)]
CA-46079: Recover the image storage type.
Used to be a message parameter passed in at open time, then down
through the VBD and images up to the driver. Replaced by stat() and
statfs().
The vbd->storage isn't really applicable with cross-SR VHD chains
became more popular, so removed. We keep the driver->storage, but only
for verbosity. Drivers with type-dependent code call
tapdisk_storage_type() during td_open() are encouraged to store the
result here.
Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com>
Daniel Stodden [Thu, 30 Sep 2010 21:01:35 +0000 (14:01 -0700)]
CA-46079: Remove the image reopen hack.
Used to reopen the image chain on the first request, thereby detecting
guest activation after migration. Obsolete since tapdisk is
spawned/resumed after VM stop/copy now.
Signed-off-by: Daniel Stoddden <daniel.stodden@citrix.com>
Daniel Stodden [Thu, 9 Sep 2010 09:05:17 +0000 (02:05 -0700)]
CA-44974: Make tap_ctl_close idempotent.
Avoid potential freelist/conn vector corruption due to
double-frees. Upgrade the WARN_ON() to a panic(), the present drain
loop doesn't want to be asked after disconnect.
Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com>
Daniel Stodden [Tue, 7 Sep 2010 02:42:26 +0000 (19:42 -0700)]
CA-44675: Fix parent cache corruption due to I/O crosstalk.
Previous patch requeued completing ring I/O buffers.
An interesting question is why this succeeds without a proper tapdisk
crash. The only sane explanation I can come up with is that the common
path manages to queue AIOs before our response hits the kernel so
unmap goes after GUP page translation. Which sounds not too
improbable, the target leaf vhd bitmap was likely still hot at this
point.
Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com>
Daniel Stodden [Tue, 17 Aug 2010 08:56:28 +0000 (01:56 -0700)]
blktap: Write tapdisk-control response data asynchronous and fully buffered.
Tapdisk should never block on control connections. We preallocate
connection state for a number of clients, including buffer space for
the response (presently 4k/conn). Then stream back response data
asynchronously, driven by the event loop.
Daniel Stodden [Fri, 6 Aug 2010 10:53:33 +0000 (03:53 -0700)]
CA-43084: Remove blocking opportunities from tapdisk-control.
Certainly should not syslog(), but use tlog instead. While we are at
it, preallocate the connection structs, too. Fixing a memleak by
509:abadd2f7ca77 (control op cancellation). Now uses tlog_syslog. The
noise "received"/"sending" noise should be avoided, but the logging
aids debugging in the meantime.
Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com>
Andrei Lifchits [Tue, 20 Jul 2010 16:20:24 +0000 (17:20 +0100)]
CP-1732, CP-1733, CP-1734: add local caching support (alpha quality). This includes:
- read caching into the leaf
- mirror write mode
- failover to the secondary image on ENOSPC
- possibility of snapshotting of empty images (for cache setup)
Daniel Stodden [Wed, 14 Jul 2010 00:47:39 +0000 (17:47 -0700)]
blktap2: Redo tap-ctl-list.
Consolidate all outputs to tap_list_t on list_heads, removing the
overcomplicated vectors. Includes a change to the tap_ctl_list
signature, accordingly. Simplifies the old join3 code, now
inlined. Remove the obsolete tap_list id. Introduces list_move and
list_splice_tail macros to list.h.
Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com>
Daniel Stodden [Tue, 29 Jun 2010 20:03:09 +0000 (13:03 -0700)]
All: Fix RPM install path definitions.
Help make libraries install to the correct location without Makefile
divergence from OSS org. Removes private defintions of LIBDIR from the
Makefiles. LIBDIR is globally defined by the toplevel make at xen.org,
and now by the RPM build accordingly.
Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com>
Daniel Stodden [Tue, 29 Jun 2010 20:03:09 +0000 (13:03 -0700)]
blktap2: Update tap-ctl timeouts and blocking behavior.
Adds a -t <timeout> switch to the destroy, pause and close operations,
which are known to block. Changes the library call interfaces to use
timeval structs, not <int> secs, which are then passed around
internally. On Linux, the effect should be that we track the total
timeout across calls, not a maximum for individual operations.
Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com>
Daniel Stodden [Tue, 29 Jun 2010 20:03:09 +0000 (13:03 -0700)]
blktap2: Synchronize device removal when removing the vbd.
Employ a new ring ioctl, BLKTAP2_IOCTL_REMOVE_DEVICE, which succeeds
only once the bdev is closed. Iterating events then safely drains the
queue. If not implemented, fall back to the previous behavior, which
may fail requests.
Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com>
Daniel Stodden [Tue, 29 Jun 2010 20:03:09 +0000 (13:03 -0700)]
blktap2: Secure control event loop recursions.
Mask connection from the loop fdset while processing pending
requests.
Annotate message types, separating reentrant from non-rentrant control
operations. Non-reentrant operations fail with EBUSY. There is no
usecase for deferrals, arbitration is up to to upper levels in the
control stack.
Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com>
Daniel Stodden [Sat, 5 Jun 2010 02:05:01 +0000 (19:05 -0700)]
blktap: Fix tapdisk disktype issues.
Stop coercing drivers/disktype code into the tool stack. Make both
blktapctrl and tap-ctl transfer type/path pairs as "<type>:<path>"
strings. Remove the message.disktype integer altogether.
Daniel Stodden [Sun, 30 May 2010 01:34:14 +0000 (18:34 -0700)]
blktap2: The tap-ctl control utility and library.
Tapdisk control in userspace, a replacement for the original blktap2
control stack, which had to pass a kernel space interface based on
sysfs nodes.
All tapdisk processes listen for commands on a unix stream socket. The
control library supports scanning the socket namespace for running
tapdisks, VBD minors allocated, associated images and state inquiry.
Control operations include allocating/releasing devices, spawning
tapdisks, opening/closing images, attaching disk images to
devices. disk pause/resume operations and runtime switching of disk
images.
Signed-off-by: Jake Wires <jake.wires@citrix.com> Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com>
Daniel Stodden [Fri, 16 Apr 2010 02:06:04 +0000 (19:06 -0700)]
CA-40171: Validate vhd parent chain during LV reactivation.
Implementation of tapdisk_vbd_reactivate_volumes follows the vhd
parent chain, but therein lacks a critical check matching child and
parent uuids.
This creates a race window wherein reactivation hits an lv resize on
the master. The new last sector, while unrewritten, may carry garbage
footers. Results may vary, from plain reactivation failures to chain
traversal running off into the weeds.
Fixed with a proper uuid check. Adds some eprintfs to aid debugging
the corner cases.
Daniel Stodden [Mon, 5 Apr 2010 19:35:18 +0000 (12:35 -0700)]
CA-29373: Unwedge queue after new request failure.
A tapdisk_vbd_issue_request failure is indication to back of from
further queue processing. The call however will also return error
status of a synchronous request failure.
When failing a new request immediately, we stop making progress. We
break out of the loop, subsequent incoming requests stay on the
new_requests lists, failed_requests is empty, so we block
indefinitely.
Patch decouples queue status from request status, we only back off if
the queue test fails, not the request. Otoh, this means we won't back
off after the first EBUSY encountered. One may argue the decision
about what's busy and not is better made by the driver, not the VBD.
Daniel Stodden [Mon, 5 Apr 2010 19:35:16 +0000 (12:35 -0700)]
CA-29373: Stop retrying the pathetic case.
Separating the retryable from the recoverable errors should make the
control path more responsive to broken SRs or images. Fail vreqs with
irrecoverable errors immediately. Presently known ones comprise ESTALE
and ENOSPC. Both don't have a great prospect to improve within the
next two minutes.
Note that this somewhat obsoletes ENOSPC, as our original reason for
adding a tapdisk-level forced-shutdown. However, there may still be
reasons to discard retryable requests.
Daniel Stodden [Mon, 5 Apr 2010 19:35:16 +0000 (12:35 -0700)]
CP-1613: Fix compiler warnings
The vhd_cache_init/enabled calls were redeclared inline after their
original declaration, yielding an ugly warning. These are module
members, not header macros. Safe to leave the inlining to the
compiler.
Daniel Stodden [Mon, 5 Apr 2010 19:35:15 +0000 (12:35 -0700)]
CA-39535: Break call chain recursion during force-shutdown.
Pending requests during shutdown flag TD_VBD_SHUTDOWN_REQUESTED,
resulting in an endless vbd_close -> vbd_kick -> vbd_check_state ->
vbd_close loop.
The vbd_check_state call was added to unblock canonical (synchronous)
I/O mode in cset 45c15fdaed55 (Separate tapdisk raw I/O into different
backends).
We only need the queue dispatch during vbd_kick. Fixed by breaking a
vbd_check_queue_state out of vbd_check_state.
Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com>
Keir Fraser [Fri, 29 Jan 2010 08:55:27 +0000 (08:55 +0000)]
CA-36385: Prefer AIO eventfd support on kernels >= 2.6.22
Mainline kernel support for eventfd(2) in linux aio was added between
2.6.21 and 2.6.22. Libaio after 0.3.107 has the header file, but
presently few systems support it. Neither do we rely on an up-to-date
libc6.
Instead, this patch adds a header which defines custom iocb_common
struct, and works around a potentially missing sys/eventfd.h.
Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com>
Keir Fraser [Fri, 29 Jan 2010 08:54:51 +0000 (08:54 +0000)]
CA-36385: Separate tapdisk raw I/O into different backends.
Hide tapdisk support for different raw I/O interfaces behind a new
struct tio. Libaio remains to dominate the interface, requiring
everyone to dispatch iocb/ioevent structs.
Misc:
- Fixes a bug in tapdisk-vbd which locks up the sync io mode.
- Wants a PERROR macro in blktaplib.h
- Removes dead code in qcow2raw to make it link again.
Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com> Signed-off-by: Jake Wires <jake.wires@citrix.com>
Daniel Stodden [Sat, 21 Nov 2009 02:54:31 +0000 (18:54 -0800)]
CA-34846: Integrate tapdisk-syslog with the tlog_error path.
Currently removes the caller line filter, based on the fact that we
are only reporting terminal vreq failures anyway.
[The alternative would have been to keep filtering and flush the log
only once per loop iteration.]
With an overall request timeout of 2 minutes per request, there should
presently be no need to filter.
This may likely change in future, so bursts of errors would lead to
message loss. We anticipate this by logging to both syslog and the
logfile, which is then reliable.
Daniel Stodden [Sat, 21 Nov 2009 02:54:31 +0000 (18:54 -0800)]
CA-34846: Add non-blocking syslog client.
Limited talk to syslog on the datapath. Integrates with the event loop
and directly talks to /dev/log. EAGAIN redirects messages into a
fixed-size ring buffer.
The main result is that datapath errors get reported immediately,
instead of being deferred until the next control path intervention.
Daniel Stodden [Sat, 21 Nov 2009 02:54:31 +0000 (18:54 -0800)]
CA-34846: Support recursive event loop iterations.
Split scheduler_run_events() into two phases:
1. scheduler_check_events() processes the results from select().
2. scheduler_run_events() dispatches results collected during the prior check.
Given that fd notification are typically level-triggered, this does
slightly more than absolutely necessary (likely might just as well
re-select the event sets instead).
But the given approach should generate a little less overhead and the
split above would also be the way to go when integrating tapdisks with
foreign event loops.
Daniel Stodden [Sat, 21 Nov 2009 02:54:31 +0000 (18:54 -0800)]
CA-34846: Fix a scheduler glitch.
Because max_fd = 0 would indicate we're polling stdin in the unlikely
case where no events whatsoever are to be polled. We never poll empty
fd sets, so this should yield no practical effect.
Daniel Stodden [Sat, 21 Nov 2009 02:54:31 +0000 (18:54 -0800)]
CA-34846: Eliminiate the restart flag.
Working towards a reentrant scheduler_run_events(). The restart flag
is meant to recover from event struct removals. Instead, do not delete
events during scheduler_event_callback(). Mark them as dead instead,
and collect them on the return path.