Daniel Stodden [Fri, 16 Apr 2010 02:06:04 +0000 (19:06 -0700)]
CA-40171: Validate vhd parent chain during LV reactivation.
Implementation of tapdisk_vbd_reactivate_volumes follows the vhd
parent chain, but therein lacks a critical check matching child and
parent uuids.
This creates a race window wherein reactivation hits an lv resize on
the master. The new last sector, while unrewritten, may carry garbage
footers. Results may vary, from plain reactivation failures to chain
traversal running off into the weeds.
Fixed with a proper uuid check. Adds some eprintfs to aid debugging
the corner cases.
Daniel Stodden [Mon, 5 Apr 2010 19:35:18 +0000 (12:35 -0700)]
CA-29373: Unwedge queue after new request failure.
A tapdisk_vbd_issue_request failure is indication to back of from
further queue processing. The call however will also return error
status of a synchronous request failure.
When failing a new request immediately, we stop making progress. We
break out of the loop, subsequent incoming requests stay on the
new_requests lists, failed_requests is empty, so we block
indefinitely.
Patch decouples queue status from request status, we only back off if
the queue test fails, not the request. Otoh, this means we won't back
off after the first EBUSY encountered. One may argue the decision
about what's busy and not is better made by the driver, not the VBD.
Daniel Stodden [Mon, 5 Apr 2010 19:35:16 +0000 (12:35 -0700)]
CA-29373: Stop retrying the pathetic case.
Separating the retryable from the recoverable errors should make the
control path more responsive to broken SRs or images. Fail vreqs with
irrecoverable errors immediately. Presently known ones comprise ESTALE
and ENOSPC. Both don't have a great prospect to improve within the
next two minutes.
Note that this somewhat obsoletes ENOSPC, as our original reason for
adding a tapdisk-level forced-shutdown. However, there may still be
reasons to discard retryable requests.
Daniel Stodden [Mon, 5 Apr 2010 19:35:16 +0000 (12:35 -0700)]
CP-1613: Fix compiler warnings
The vhd_cache_init/enabled calls were redeclared inline after their
original declaration, yielding an ugly warning. These are module
members, not header macros. Safe to leave the inlining to the
compiler.
Daniel Stodden [Mon, 5 Apr 2010 19:35:15 +0000 (12:35 -0700)]
CA-39535: Break call chain recursion during force-shutdown.
Pending requests during shutdown flag TD_VBD_SHUTDOWN_REQUESTED,
resulting in an endless vbd_close -> vbd_kick -> vbd_check_state ->
vbd_close loop.
The vbd_check_state call was added to unblock canonical (synchronous)
I/O mode in cset 45c15fdaed55 (Separate tapdisk raw I/O into different
backends).
We only need the queue dispatch during vbd_kick. Fixed by breaking a
vbd_check_queue_state out of vbd_check_state.
Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com>
Keir Fraser [Fri, 29 Jan 2010 08:55:27 +0000 (08:55 +0000)]
CA-36385: Prefer AIO eventfd support on kernels >= 2.6.22
Mainline kernel support for eventfd(2) in linux aio was added between
2.6.21 and 2.6.22. Libaio after 0.3.107 has the header file, but
presently few systems support it. Neither do we rely on an up-to-date
libc6.
Instead, this patch adds a header which defines custom iocb_common
struct, and works around a potentially missing sys/eventfd.h.
Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com>
Keir Fraser [Fri, 29 Jan 2010 08:54:51 +0000 (08:54 +0000)]
CA-36385: Separate tapdisk raw I/O into different backends.
Hide tapdisk support for different raw I/O interfaces behind a new
struct tio. Libaio remains to dominate the interface, requiring
everyone to dispatch iocb/ioevent structs.
Misc:
- Fixes a bug in tapdisk-vbd which locks up the sync io mode.
- Wants a PERROR macro in blktaplib.h
- Removes dead code in qcow2raw to make it link again.
Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com> Signed-off-by: Jake Wires <jake.wires@citrix.com>
Daniel Stodden [Sat, 21 Nov 2009 02:54:31 +0000 (18:54 -0800)]
CA-34846: Integrate tapdisk-syslog with the tlog_error path.
Currently removes the caller line filter, based on the fact that we
are only reporting terminal vreq failures anyway.
[The alternative would have been to keep filtering and flush the log
only once per loop iteration.]
With an overall request timeout of 2 minutes per request, there should
presently be no need to filter.
This may likely change in future, so bursts of errors would lead to
message loss. We anticipate this by logging to both syslog and the
logfile, which is then reliable.
Daniel Stodden [Sat, 21 Nov 2009 02:54:31 +0000 (18:54 -0800)]
CA-34846: Add non-blocking syslog client.
Limited talk to syslog on the datapath. Integrates with the event loop
and directly talks to /dev/log. EAGAIN redirects messages into a
fixed-size ring buffer.
The main result is that datapath errors get reported immediately,
instead of being deferred until the next control path intervention.
Daniel Stodden [Sat, 21 Nov 2009 02:54:31 +0000 (18:54 -0800)]
CA-34846: Support recursive event loop iterations.
Split scheduler_run_events() into two phases:
1. scheduler_check_events() processes the results from select().
2. scheduler_run_events() dispatches results collected during the prior check.
Given that fd notification are typically level-triggered, this does
slightly more than absolutely necessary (likely might just as well
re-select the event sets instead).
But the given approach should generate a little less overhead and the
split above would also be the way to go when integrating tapdisks with
foreign event loops.
Daniel Stodden [Sat, 21 Nov 2009 02:54:31 +0000 (18:54 -0800)]
CA-34846: Fix a scheduler glitch.
Because max_fd = 0 would indicate we're polling stdin in the unlikely
case where no events whatsoever are to be polled. We never poll empty
fd sets, so this should yield no practical effect.
Daniel Stodden [Sat, 21 Nov 2009 02:54:31 +0000 (18:54 -0800)]
CA-34846: Eliminiate the restart flag.
Working towards a reentrant scheduler_run_events(). The restart flag
is meant to recover from event struct removals. Instead, do not delete
events during scheduler_event_callback(). Mark them as dead instead,
and collect them on the return path.
Andrei Lifchits [Wed, 11 Nov 2009 19:06:42 +0000 (11:06 -0800)]
CA-27617: improve VHD sequential write performance by not skipping bitmap regions when overwriting fully allocated blocks
(i.e., introduce redundant writes to make the pattern on the underlying block device sequential)
Andrei Lifchits [Wed, 11 Nov 2009 19:03:35 +0000 (11:03 -0800)]
CA-27732: improve (buffered) read performance on long chains by not splitting vreqs into multiple treqs.
This reduces the overhead by having fewer treqs percolate through the VHD chain.
Daniel Stodden [Fri, 6 Nov 2009 19:16:41 +0000 (11:16 -0800)]
CA-34208: Quiesce watches for dead/broken channels.
Notably pause watches otherwise tend to generate spurious errors, if
the channel failed early.
Also fixes a potential bug driving RECYCLED -> PAUSED/UNPAUSED if we
get a spurious pause watch at the wrong moment. Didn't catch this one
when recycled was added.
Daniel Stodden [Tue, 13 Oct 2009 22:07:50 +0000 (15:07 -0700)]
CA-33664: Close unplug/plug race window since CA-32254.
Race window in tapdisk_daemon_probe_vb if VBD recreation is observered
before we managed to close the previous channel.
We don't permit duplicate channels on the same XS keys. Instead add
TAPDISK_VBD_RECYCLED, which indicates a channel whose vbd state went
one step beyond DEAD, pending a synchronous re-probe.
Trying to avoid spurious probe failures, we cannot just rerun probe()
after destruction. Instead, perform a test/write transaction cycle.
We abort in the unlikely case where we destroy the channel after the
backend path was already gone again.
The related case of observing a remove event in RECYCLED state would
drive us back to DEAD, which should work okay.
Daniel Stodden [Thu, 8 Oct 2009 05:47:56 +0000 (22:47 -0700)]
CA-32254: Rewrite channel/vbd state machine.
* Refine channel states:
- Replaces the former IDLE state with a more detailed representation:
- DEAD (just after spawning the channel)
- LAUNCHED (just after spawning the channel)
- PID (following a pid response)
- RUNNING (following an open resume response)
- PAUSED (following a pause response)
- Former channel->open replaced with TAPDISK_CHANNEL_IPC_OPEN(),
based on the above channel states.
- Former channel->connected removed. It was used as an indicator
for pause request legitimacy, but process state is not a criteria
anymore.
* Add 'vbd state', reflecting agent control state:
- UNPAUSED (may be RUNNING)
- PAUSING (following a pause request)
- PAUSED (finished PAUSING)
- BROKEN (following a fatal error)
- DEAD (following channel->path removal)
* Add 'shutdown state', reflecting kernel control state:
- UP (tapdisk shall live)
- DOWN (tapdisk shall die)
* Reimplement pause/unpause on top of that:
- mapping vbd and shutdown states to a channel state to be driven.
- tapdisk-daemon drives vbd DEAD state on path removal
- allow start-tapdisk to resurrect channels: CLOSED/DEAD -> LAUNCHED
- the point is that this maintains vbd paused state across tapdisk exits.
Daniel Stodden [Sat, 11 Jul 2009 01:22:19 +0000 (18:22 -0700)]
CA-30816: Fix shutdown phase on device reset.
Cset 397:272912c3d0ec (CA-26523) broke the tapdisk shutdown phase run
on a reset of the backend's XenStore root node. I missed the fact that
this is the point where the shutdown message in channel_close() is
essential.
This patch keeps the channel_reap() operation to close the startup
failure loophole, but doesn't defer channel_close(). We now free the
channel only - but unconditionally - after tapdisk exit.
Removed the comment on safe errors since it doesn't apply any more. At
the expense of splitting the original close code.
Daniel Stodden [Wed, 8 Jul 2009 02:39:30 +0000 (19:39 -0700)]
CA-27472: Log past events to a ring buffer
Adds debug code to the main event loop to trace unexpectedly high
event frequencies we sometimes seem to come across.
Comprises a 1K ring buffer and makes the select loop append fd sets
plus relative time into it. We cover up to 128 events. We only cover
the lower 32 fds to save space, but that's usually more than needed.
Andrei Lifchits [Thu, 7 May 2009 00:54:17 +0000 (17:54 -0700)]
CA-29010: do not enforce OPEN_STRICT (for exclusivity of RW mode) because we don't have a story for crash recoveries (this problem was uncovered after fixing CA-28285). [The previous cset (417:417711d49f8f) also refers to CA-29010, not CA-19010]
Daniel Stodden [Wed, 6 May 2009 18:08:53 +0000 (11:08 -0700)]
CA-19010: Make vhd-util repair work on block devices.
The repair facility opens a VHD, recovers metadata from the backup
footer, then writes the data back to the primary footer at
end-of-file.
We only had it implemented for file-based VDIs, the difference being
that our definition of 'end-of-file' is a slightly different one on
block devices.
This patch removes vhd-util's own assumptions about what the right
footer offset is and lets instead vhd_write_footer() do the math. The
price is a redundant backup footer re-write in the repair case and a
redundant ftruncate() issued by tapdisk's vhd_close() in the
non-faulty case.
None of those are on a critical path. So let's avoid to declare more
funny functions non-static.
Daniel Stodden [Fri, 10 Apr 2009 03:49:40 +0000 (20:49 -0700)]
CA-27472: Improve the timestamps kept on tlog errors.
Apart from the buffering, the tapdisk-log code collects log entries in
buckets. Every error condition is saved only once, later ocurrences
just increment a counter. Presently this means one only gets to see
the timestamp of the very first occurrence.
While this is often sufficient, a bit more information may sometimes
help to tell transient from recurrent issues. Dump first and last
occurrence per dispersion instead.
Saving tv structs also simplifies the string formatting code a little.
Daniel Stodden [Fri, 10 Apr 2009 03:49:40 +0000 (20:49 -0700)]
CA-27472: Limit dispersion tracking to timevals.
Get rid of the FPU stuff. We're only tracking time this way. The
floats lack the necessary precision to track absolute time, and fixed
integer arithmetic is nicer anyways.
Daniel Stodden [Fri, 10 Apr 2009 03:49:40 +0000 (20:49 -0700)]
CA-27472: Count/log incoming/outgoing VBD kicks in tapdisk.
The notification count helps debugging some interesting VBD
problems. One is that that dom0 plugs have a problem with batching
requests issued through the frontend device (xvd). Looks like there is
no batching at all.
We cannot just dump request timestamps, since the log would explode if
failures occur frequently. Instead, produce some minimum amount of
statistics.
For single, apparently spurious failures (e.g. CA-26804), this may
help making more informed guess on whether we are looking at image
driver issues or the current event and retry logic is to blame.
Daniel Stodden [Tue, 7 Apr 2009 23:46:19 +0000 (16:46 -0700)]
CA-27472: Add slightly more debug info to the request failure path.
Replace the hardcoded EIO we dump to the logs. We do have an actual
error to display. The interesting piece is actually vreq->error, response
status is rather boring.
Doing it like this should also prevent people from attributing
failures to *real* EIO, as would be the case with NFS or physical
device issues.
Not that vreq->error may be zero on some failure paths. Consider this
useful information as well; these case can be attribute path to
tapdisk-vbd itself, meaning the respective treqs may have never made
it down to the block driver.
Daniel Stodden [Mon, 6 Apr 2009 20:22:21 +0000 (13:22 -0700)]
CA-26523: Allow for alternate tapdisk programs spawned by blktapctrl.
Some experiments, such as profiling blktap, get a whole lot easier if
there's a simple way to wrap tapdisk processes in custom code. A
typical example would be running tapdisk processes through Valgrind.
This patch makes blktapctrl sensitive to an environment variable
TAPDISK, which can be set to point to a replacement program which
blktapctrl is then supposed to spawn instead. PATH search per execvp()
remains unaffected.
Daniel Stodden [Mon, 6 Apr 2009 20:22:21 +0000 (13:22 -0700)]
CA-26523: Adjust blktapctrl exit status to main loop result.
In summary, this remains as unused as the prior code: the catch is
that we presently never exit voluntarily.
A watchdog parent + termination on signals, fatal conditions, etc,
would change the landscape. Especially since exiting with running
tapdisks is presently irrecoverable.
So for now just clean stuff up a little to prevent people from having
to investigate where all those return values go.
Daniel Stodden [Mon, 6 Apr 2009 20:22:21 +0000 (13:22 -0700)]
CA-26523: Maintain child affiliation in blktapctrl.
Early fatal tapdisk failures are rare but presently go unnoticed. The
less improbable pathologic example case covered here is an AIO init
failure due to resource exhaustion.
At the very least, we would (a) start tapdisk1 synchronously and test
for errors or (b) defer post-daemonization to somewhere after PID_MSG.
Synchronous starts recently becoming more popular -- we used to
daemonize from the first instruction -- sheds light on a more general
issue: tapdisk1 shouldn't detach from blttapctrl at all.
Ergo:
- Eliminate daemon()ization from tapdisk1.
- Leaves detaching as a (hopefully unpopular) option, in case
something is ever found to disagree.
- Makes blktapctrl reap children.
- Assocites child exists with respective tapdisk-channels.
At this point, even doing the right thing (tm) and suspending guests
in face of crashed tapdisks becomes an option, but this takes an
additional patch or two on the agent.