Ian Jackson [Fri, 1 Jul 2016 17:30:21 +0000 (18:30 +0100)]
Executive: Allow out-of-order manipulations of flights intended play
Flights being operated on by a developer hacking about with the code,
which were created with intended blessing `play', are usually blessed
`running' or `broken' or something. So the safety catch bypass needs
to look at the intended blessing too.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
The order of results reported by sg-report-flight determines which
testid the bisector will try to work on first. It also determines the
order in which failures are shown in the email reports. We currently
sort them by the duration estimate (for each failure's containing job).
We should prefer earlier steps. So change the first sort key to be
the duration estimate only for the steps leading up to the step of
interest for each failure. (By passing the testid to the duration
estimator.)
Since the granularity is in seconds, this may still not distinguish
when there are fast steps. So as a secondary sort criterion, use the
stepno.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
squash! sg-report-flight: Report earlier, earlier step failures
Ian Jackson [Wed, 10 Aug 2016 16:10:39 +0000 (17:10 +0100)]
duration_estimator: Be able to estimate job duration up to a particular step
If this is passed, we are interested only in the duration up to and
including the specified test step. (If the specified test step is not
present or didn't have a recorded finish, we look at the whole job.)
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Wed, 3 Aug 2016 14:49:34 +0000 (15:49 +0100)]
ts-xen-build-prep: Use .gitconfig so _everything_ uses git cache
In particular, when xen.git clones a subtrees, whose url we didn't
specify in the runvars, we end up using the url from xen.git's
Config.mk.
Arrange to use the git cache for all git urls, via the insteadOf
feature.
Note that the git config url insteadOf feature is backwards: one
configures the config variable "url.NEW-URL.insteadOf.OLD-URL". So
the key is the value, and the value is the key.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Mon, 25 Jul 2016 10:19:18 +0000 (11:19 +0100)]
cr-ensure-disk-space: Correct stdout output
d221996eea64 "cr-ensure-disk-space: Run check_space before taking
lock" introduced an additional call to check_space but check_space
prints the start of a message (with no newline) expecting
iteration_proceed to print the rest.
Move $|=1 up appropriately and add a couple of messages in the right
place. This involves calling quit_ok rather than exit 0.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Thu, 25 Feb 2016 12:31:09 +0000 (12:31 +0000)]
mg-list-all-branches: Do not match ${BRANCHES+= ... }
This is not valid shell syntax and should not appear. The confusion
seems to have arisen because of the need for to match BRANCHES+=...
(without the surrounding { }).
This results in no change to the output. (I seem to have collected
this patch some time ago as part of some fixes to mg-list-all-branches
which have by now been applied, but not managed to write up and post
this specific change.)
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Tue, 19 Jul 2016 16:25:48 +0000 (17:25 +0100)]
cr-ensure-disk-space: Run check_space before taking lock
This allows cr-ensure-disk-space to be a noop if there is enough
space, even if run on a host which doesn't have access to the relevant
lock directory.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Tue, 19 Jul 2016 14:06:58 +0000 (15:06 +0100)]
mfi-common: Do not set di_version runvar to empty string in build jobs
2601498df77c "mfi-common: Do not set di_version runvar to empty string"
fixed the test jobs but not the build jobs, because the setting of
hostos_runvars was (it seems) cloned-and-hacked, and it fixed only one
instance.
Now that we have set_hostos_runvars, use it in create_build_jobs too.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Fri, 15 Jul 2016 14:59:30 +0000 (15:59 +0100)]
Bisection truncation: Stop a bisection job after the step of interest
Set the `truncate_testid' runvar when we create a bisection flight.
Thus, the bisection will stop when it has collected the data point we
wanted. This is especially useful if the failing step is early in a
long job: passes do not have to wait for the whole rest of the job to
run.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Fri, 15 Jul 2016 14:30:56 +0000 (15:30 +0100)]
Job truncation: sg-run-job support truncation by setting global `truncate'
Arrange for a global variable `truncate' to be honoured. It is
initialised to 0. If it becomes 1 then:
* spawn-ts does not spawn jobs any more (reap-ts reaps these non-jobs
immediately), unless they are marked with ! in their iffail
* per-host-ts does not try to spawn anything any more, likewise
(strictly, we could leave checking truncate to spawn-ts, but this
way is clearer).
* These not-spawned jobs count as successful when reaped, unlike
jobs not spawned due to the presence of `abort'.
* At the end of the job, if things otherwise went OK, we set the
status to `truncated' rather than `pass'.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Fri, 15 Jul 2016 15:27:50 +0000 (16:27 +0100)]
sg-run-job: Change spawn-ts internal representation of reap handles
Previously, spawn-ts would pass reap-ts (via its caller) either a
filehandle, or an empty string meaning `when this is reaped, count it
as failed'.
We are going to want to represent `when this is reaped, count it as
successful', too. So change the representation to a variadic list,
with an enum type field at the front.
NB: oddly, reap-ts returns 1 for success.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Fri, 15 Jul 2016 14:20:00 +0000 (15:20 +0100)]
sg-run-job: Break out iffail-check
Both spawn-ts and per-host-ts do some processing of `iffail' values:
* Strip any leading !, which means "run this even if the job
is being stopped due to error";
* Turn `.' into `fail'.
The first of these is currently only done by per-host-ts, which checks
ok. We are going to want to do something more sophisticated when
truncating flights. So we introduce a new helper.
For now spawn-ts passes 1 for okexpr so its iffail-check always
returns 1 so it doesn't check the return value.
No functional change yet.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Fri, 15 Jul 2016 14:08:20 +0000 (15:08 +0100)]
Job truncation: Tolerate `truncated' job status
We are going to introduce a new `truncated' job status, which means
that the job went OK until sg-run-job decided not to continue with it
because it had done all that was requested.
(This will be used for bisection, to stop a bisection job after the
step of interest.)
Its properties are:
* In summary HTML `truncated' shows up as a green job status, like `pass'.
* The duration estimator _does_ look at truncated jobs. (Note that
it only looks on the specific branch, so only when organising a
bisection will it look at bisections.)
* Consequently the host allocator for bisections will expect the
duration to be that of the last flight where this job passed,
failed or was truncated, which is correct.
* When the host allocator is choosing a host for non-bisections it
won't consider these truncated jobs because they ought not to
appear in main branch flights. If they do they count more as
fails than (that is, they do make the job sticky).
* sg-execute-flight expects that sg-run-job might set the job
status to `truncated' and then exit with status 0.
* sg-report-flight does not look for an interesting failing step
when the job is truncated (ie for this purpose it's like pass).
* sg-report-flight doesn't consider truncated jobs to indicate
trouble, and handles truncated properly in Subject line generation.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Once upon a time we tried to run one bisector for all the branches.
But that doesn't work because they would overwrite each others' mros,
making the bisector flap as main flights finish.
If sharing a bisector working tree is desirable, something more
complex will be needed.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Thu, 14 Jul 2016 15:40:00 +0000 (16:40 +0100)]
Bisection: Do not try to inhibit queue
In an effort to try to increase the chance that the next bisection
step will get the same host quickly, the cri-bisect uses
mg-queue-inhibit to inhibit all resource allocation for 5 minutes.
With the increasing size of the test facility and the increasing
number of bisector instances running, this is starting to become a
very crude hammer indeed.
And this is largely ineffective anyway as we try bisections every 15
minutes but only inhibit for 5 minutes.
Disable it, until we have a better answer.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Thu, 14 Jul 2016 15:26:30 +0000 (16:26 +0100)]
cr-try-bisect: Use WAITSTART of when we started bisecting this testid
Otherwise bisection jobs get queued up very late.
The intent is that once we have a regression, we /start/ bisecting it
roughly FCFS along with other flights, but then it gets priority until
the bisection is done.
Then next bisection in the same branch will have to wait again, to
start.
We implement this by keeping a stamp file, whose timestamp shows when
we started bisecting this testid and this step.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Fri, 15 Jul 2016 18:55:11 +0000 (18:55 +0000)]
ms-flights-summary: Invent a `prep.alloc.' pseudo job state
This allows us to separate out `preparing' jobs into ones which are in
our data plan and ones which are not. The ones which are not may not
have quite started to run ts-hosts-allocate, or may still be in the
planning queue and not made it into the projection.
In either case we don't have an estimated finish time for them.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Fri, 1 Jul 2016 11:28:18 +0000 (12:28 +0100)]
tcl daemons: transaction: Support db autoreconnect
Provide an `autoreconnect' argument which will automatically reconnect
to the db if the connection has been lost. It will make only one
reconnection attempt.
No functional change yet because no call sites have been changed.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Fri, 1 Jul 2016 11:23:12 +0000 (12:23 +0100)]
tcl daemons: transaction: Only try ROLLBACK when necessary
In the deadlock case, we need to ROLLBACK. In other error cases we
are going to close the connection. And in those other cases the
ROLLBACK might fail, causing our error recovery to go wrong.
So do ROLLBACK only on the single path where we might continue to use
the connection.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Thu, 7 Jul 2016 12:14:26 +0000 (13:14 +0100)]
tcl daemons: Break out db-ensure-open and db-ensure-closed
To be able to deliberately reconnect to the database, in case of
error, we need functions which actually work with dbh, rather than
simply the refcount.
No functional change as yet.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Fri, 1 Jul 2016 11:17:37 +0000 (12:17 +0100)]
tcl daemons: jobdb::transaction: Improve two message generation sites
* Use logputs rather than puts to report transaction deadlock retry
* Use $ei and $ec rather than $errorInfo and $errorCode when calling
error due to too many deadlock retries. This has no functional change
but is less fragile in case of future addition of new calls to catch
between the main catch and this throw.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Thu, 7 Jan 2016 19:30:42 +0000 (19:30 +0000)]
ms-ownerdaemon: Cope with db restart. Retry recording dead tasks.
In chan-destroy-stuff, instead of accessing the db directly, add the
dead task(s) to a queue, and arrange to look at that queue.
Errors are handled by setting an `after' handler which we cancel if we
are successful.
The after handler requeues a queue run attempt as the first thing
(which will arrange that a further retry will occur if things are
still broken) and then attempts to reconnect to the database.
I have tested this with a test instance by renaming the `tasks' table
under its feet, and it functions as expected.
DEPLOYMENT NOTE: The owner daemon cannot be restarted without shutting
everything down. So this update should first be deployed in
Cambridge, probably, to see how it goes. Also, it is less critical in
the main Xen production test lab because there the db and the owner
daemon are co-hosted on the same VM.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v2: Put back the `unset tasks' which was mistakenly removed. The
effect of its lack is to fail to clear out the task list for
previous uses of the channel (which is named after the fd); this
is mostly harmless apart from log spam but causes the usual
case to be something like
OK created-task 456354 ownd [10.80.227.94]:44852-876
rather than
OK created-task 456354 ownd [10.80.227.94]:44852-876
which some of the clients (rightly) don't expect.
Ian Jackson [Thu, 7 Jan 2016 18:47:03 +0000 (18:47 +0000)]
Database locking: Tcl: Retry only on DEADLOCK DETECTED
Use the new errorCode coming out of db-execute* to tell when the error
is that we got a database deadlock, which is the situation in which we
should retry.
This involves combining the two catch blocks, so that there is only
one error handling strategy. Previously errors on COMMIT would be
retried and others would not. Now errors anywhere might be retried
but only if the DB indicated deadlock.
We now unconditionally execute ROLLBACK. This is more correct, since
we always previously executed BEGIN.
And, we pass the errorInfo and errorCode from the $body to the caller.
I have tested this with a test db instance, using contrived means to
generate a database deadlock, and it does actually retry.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Thu, 7 Jan 2016 18:22:53 +0000 (18:22 +0000)]
Database locking: Tcl: for errorCode, use pg_exec, not pg_execute
We would like to be able to retry db transactions. To do this we need
to know why they failed (if they did).
But pg_execute does not set errorCode. (This is clearly a bug.) And
since it immediately discards a failed statement, any error
information has been lost by the time pg_execute returns.
So, instead, use pg_exec, and manually mess about with fishing
suitable information out of a failed statement handle, and generating
an appropriate errorCode.
There are no current consumers of this errorCode: that will come in a
moment.
A wrinkle is that as a result it is no longer possible to use
db-execute on a SELECT statement nor db-execute-array on a non-SELECT
statement. This is because 1. the `ok' status that we have to
check for is different for statements which are commands and ones
which return tuples and 2. we need to fish a different return value out
of the statement handle (-cmdTuples vs -numTuples). But all uses in
the codebase are now fine for this distinction.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v3: Put emsg at the start of errorInfo; things that print errors that
print errorInfo typically print _only_ errorInfo.
Ian Jackson [Thu, 14 Jul 2016 12:11:44 +0000 (13:11 +0100)]
Database locking: Tcl: Use db-execute-array for SELECT in sg-execute-flight
We are going to make it wrong to use db-execute for SELECT statements.
Convert the existing violation site, which uses db-execute, into
db-execute-array (providing a dummy arrayvar).
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v3: Dropped change to become-task which is actually wrong and should
be db-update-1, anyway. This will now be fixed in a separate patch.
Ian Jackson [Fri, 1 Jul 2016 18:40:41 +0000 (19:40 +0100)]
Tcl: Use tclsh8.5
I have checked that tclsh8.5 and TclX work on osstest.test-lab (and
also osstest.xs.citrite.net). TclX seems to be provided by tcl8.4 but
work with tcl8.5 (at least on wheezy and jessie).
Deployment note: hosts running Debian wheezy (including
osstest.xs.citrite.net, the Citrix Cambridge instance), will need
OSSTEST_DAEMON_TCLSH=tclsh8.4 in ~/.xen-osstest/settings.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Fri, 8 Jul 2016 10:48:39 +0000 (11:48 +0100)]
invoke-daemon: Honour OSSTEST_DAEMON_TCLSH
It appears that tcl8.5 in wheezy has a serious bug which makes `after
idle' not always work. tcl8.4 has been working well in wheezy but is
not in jessie, where tcl8.5 works (and tcl8.6 has a serious event loop
bug - Debian #826741).
So we need to use different versions of Tcl on different hosts.
Allow this to be specified in ~/.xen-osstest/settings.
This affects only:
- invoke-daemon (which is normally run from inittab)
- mg-schema-test-database
sg-run-job and sg-execute-flight are not affected. They do not
currently use `after idle' so that is OK for now.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Fri, 8 Jul 2016 10:57:32 +0000 (11:57 +0100)]
mg-schema-test-database: Change default minflight to -100
It is tiresome to try to create a test db for playing with and have to
wait for a big copy. Better to create a small one by default; if the
user has forgotten to specify a minflight, they can always drop it and
run it again.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Thu, 7 Jul 2016 15:58:04 +0000 (16:58 +0100)]
mg-allocate: Do not treat already-allocated resources as satisfactory
This was always rather odd for ./mg-allocate HOSTNAME but makes the
more sophisticated uses like ./mg-allocate '{FLAG,FLAG,...}' very much
less useful.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Thu, 7 Jul 2016 11:18:33 +0000 (12:18 +0100)]
mg-allocate: Fix "issteallable" call
81cac5a1656e "mg-allocate: Support --steal" introduced an erroneous
call to the subref $issteallable, using { } instead of ( ), producing
this error:
Not a HASH reference at ./mg-allocate line 225.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Mon, 27 Jun 2016 15:49:52 +0000 (16:49 +0100)]
cr-daily-branch: libvirt: use frozen version on stable branches
libvirt master might increase its LIBXL_API_VERSION. When this feeds
through osstest it can cause the push gates of Xen stable branches to
break.
So for stable Xen branches do not track libvirt upstream. Instead,
use a frozen revision. (Only for main push gate tests of stable Xen
branches.)
The frozen branch is never going to be updated so it is not suitable
for other kinds of uses. In particular it won't get security fixes.
So we call the refs osstest/frozen/xen-K.L-testing to discourage
users from using them.
Deployment note: The Xen release checklist needs a new item "add this
frozen libvirt branch".
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Mon, 27 Jun 2016 11:25:14 +0000 (12:25 +0100)]
ts-xen-build-prep: Do not install Ocaml on squeeze or wheezy
squeeze doesn't (didn't) have it at all. wheezy doesn't have ocamlopt
on armhf, and the Xen build system (in the old branches where this is
relevant) seems not to be able to test this.
In any case we use these old Debian suites when testing old Xen
branches, which were (when they were current) built without ocaml.
This partially reverts "ts-xen-build-prep: Install Ocaml" bbe1a9b2a6c0.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> CC: Andrew Cooper <andrew.cooper3@citrix.com> CC: Wei Liu <wei.liu2@citrix.com> CC: David Scott <dave@recoil.org> CC: Jan Beulich <JBeulich@suse.com>
Ian Jackson [Tue, 22 Mar 2016 19:40:53 +0000 (19:40 +0000)]
mg-hosts serial attach: Provide serial-attach command
This is like running sympathy -r or xenuse by hand, except that it
checks that you have the host allocated, and looks up in the database
what the right rune is.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Tue, 22 Mar 2016 16:57:49 +0000 (16:57 +0000)]
Executive: Provide findtask_spec
This will allow code elsewhere to look up tasks other than the one
specified in OSSTEST_TASK. No callers of findtask_spec yet, so no
functional change.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Fri, 22 Apr 2016 15:25:04 +0000 (16:25 +0100)]
ts-xen-build-prep: Install Ocaml
This will result in the Xen build system building, and then
preferring, oxenstored.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: David Scott <dave@recoil.org>
Ian Jackson [Fri, 22 Apr 2016 14:46:30 +0000 (15:46 +0100)]
crontab: Drop linux-mingo-tip-master linux-next linux-linus
It appears that no-one is looking at the output. These have not had a
push to the tested output branch for at least 250 days (742 days in
the case of linux-linus!) and the reports don't seem to be generating
any bugfixing activity.
There is a plan to do some Xen testing in Zero-day but even if that
doesn't lead to anything we would still be just where we are now.
So drop these to save our test bandwith for more useful work.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Roger Pau Monne <roger.pau@citrix.com> Acked-by: Juergen Gross <jgross@suse.com> CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> CC: Boris Ostrovsky <boris.ostrovsky@oracle.com> CC: David Vrabel <david.vrabel@citrix.com> CC: Stefano Stabellini <stefano.stabellini@eu.citrix.com> CC: Wei Liu <wei.liu2@citrix.com> CC: Anshul Makkar <anshul.makkar@citrix.com>