Ian Campbell [Wed, 17 Feb 2016 16:13:11 +0000 (16:13 +0000)]
Increase priority of xen-unstable-coverity
Since we are limited on the number of these we can do per week (to 2)
we would like these to happen fairly promptly after the time given in
the crontab, otherwise we can potentially end up with the Wednesday
run not actually happening until late Saturday, right before the
Sunday run which might happen right away.
Therefore specify OSSTEST_RESOURCE_PRIORITY=-15, which is right behind
xen-unstable-smoke in priority order.
We don't have much data yet but based on what we have so far
ts-coverity-build takes up to 1000s (around quarter of an hour) and
ts-coverity-upload a little over half an hour. So including host
install (if needed, it can use a share of an existing build host if
one is around) the whole thing comes in at well under an hour, so
having this slip to the head of the queue is unlikely to cause
problems.
Also put mg-allocate and mg-blockage in the correct order in the doc.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Wed, 17 Feb 2016 10:50:01 +0000 (10:50 +0000)]
mg-show-flight-runvars: avoid "SELECT .. AND TRUE" for sqlite
c5e29f93fb6e "mg-show-flight-runvars: recurse on buildjobs upon
request" broke standalone mode with:
Error: no such column: TRUE
from sqlite. Do as is done for $syntcond and use (1=1) instead.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Mon, 18 Jan 2016 14:28:57 +0000 (14:28 +0000)]
make-flight: Use older Debian for host and guest OS with older Xen
Sometimes when updating osstest to use a newer version of Debian as a
baseline we find that the new compiler or other tools pickup latent
errors in older code bases for which the fixes are invasive or
otherwise inappropriate for a stable branch.
This is the case with Debian Jessie and Xen 4.3 and earlier, so
restrict those branches to keep using Wheezy.
This only applies to xen-X.Y-testing branches and
qemu-upstream-X.Y-testing branches since other branch all use
xen-unstable as their Xen.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Mon, 18 Jan 2016 14:28:56 +0000 (14:28 +0000)]
mfi-common: usual_debianhvm_image: derive version from $guestsuite
This more likely matches the callers intention.
Move the setting into production-config* alongside the Suite and
TftpDiVersion settings. Continue to support $DEBIAN_IMAGE_VERSION as an
override. The value for Wheezy is from what was replaced
in 610ea1628363 "Switch to Debian 8.0 (jessie) as OS for test hosts".
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Mon, 18 Jan 2016 14:28:55 +0000 (14:28 +0000)]
Qualify TftpDiVersion with the suite.
This allows the version to differ e.g. between Wheezy and Jessie.
Update production-config* to set TftpDiVersion_jessie instead of just
TftpDiVersion, also add TftpDiVersion_wheezy using the version
replaced in commit f610ea162836 "Switch to Debian 8.0 (jessie) as OS
for test hosts".
In mfi-common we need to check for TftpDiVersion_$suite (_$guestsuite)
and TftpDiVersion manually since getconfig In that context will not
see any DebianSuite override in the environment.
This ensures that when a non-default suite is configured a
corresponding useful version of DI is selected.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Mon, 18 Jan 2016 14:28:52 +0000 (14:28 +0000)]
ts-debian-di-install: Allow Di Version to come from runvars
and following the lead of the suite arrange for a version selected
from the defaults to be written back to the runvars.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
[ ijc -- missing s/diversion/di_version/ in ts-debian-di-install,
drop unnecessary \ wrapping from $di_path assignment ]
Ian Campbell [Mon, 18 Jan 2016 14:28:51 +0000 (14:28 +0000)]
ts-host-install: Support DiVersion coming from runvars
To do so initialise $ho->{DiVersion} in select host and use it in
ts-host-install.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
[ ijc missing s/diversion/di_version in selecthost ]
Ian Campbell [Mon, 18 Jan 2016 14:28:50 +0000 (14:28 +0000)]
mfi-common: Always add debian_suite to debian_runvars
This adds an explicit debian_suite to some jobs which didn't already
have one, meaning that those jobs will remain the same when cloned for
a bisect and run in a tree where $c{DebianGuestSuite} has changed
since the original construction.
No expected semantic change.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Mon, 18 Jan 2016 14:28:49 +0000 (14:28 +0000)]
mfi-common: always add host suite to hostos_runvars
This avoids situations where production-config* has changed
DebianSuite but the bisector is still picking up baselines etc from
before the change and reusing their runvars (without suite) with an
inconsistent config.
Switch selecthost() to use target_var when querying the suite. This
means it will check the "{ident}_suite" runvar first as before but
fallback to just looking at the "all_host_suite" runvar. We also
change the existing host_suite to all_host_suite in mfi-commong so
that test_matrix_iterate() needn't worry about ident=host vs
=src_host/dst_host etc (of course this can still be overridden if
desired by using src_host_suite etc, but nowhere does.
Other uses of $c{DebianSuite} have been abolished already.
Note that "$suite != $defsuite" is not true for any current production
invocation of osstest. If this was ever true then we would have set
the host_suite runvar, whereas now we always set all_host_suite.
However any old flights with host_suite would still be interpretted
the same. Note also that the "$suite != $defsuite" case was previously
broken for the -pair tests since the host idents there are 'src_host'
and 'dst_host', so the previous code would have fallen back to
$c{DebianSuite} without looking at the host_suite runvar.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Mon, 18 Jan 2016 14:28:46 +0000 (14:28 +0000)]
Debian: Abolish $suite and $xopts{Suite} from preseed_* interfaces.
Generating a preseed for a suite which does not match the ->{Suite} of
the underlying guest or host object does not seem useful, so remove
this option and use ->{Suite} instead.
For guests ->{Suite} is set by debian_guest_suite() (which is called
from preseed_guest_create(), although it is often also called prior to
that) and by selectguest()
For hosts $ho->{Suite} is initialised by selecthost if we are in the
context of a $job (and if we aren't we had best not be trying to
reinstall a host).
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Fri, 5 Feb 2016 09:30:39 +0000 (09:30 +0000)]
Add a weekly coverity flight
This primarily consists of ts-coverity-{build,upload} and
make-coverity-flight which constructs the sole job.
The branch is named "xen-unstable-coverity" which matches various xen*
in the cr-* scripts. Places which needed special treatement are
handled by matching xen-*-coverity, which leaves the possibility of
xen-4.7-testing-coverity etc in the future, but note that care would
be needed so that coverity's tracking of new vs existing issues would
likely be confused by uploading different branches without
differentiating somehow (I don't know how this is supposed to work).
The most recently scanned revision is pushed to a new
coverity-scanned/master branch in the usual xen.git, tests are run on
the master branch.
I initially thoughts that $c{CoverityEmail} would need to be an actual
account registered with scan, however a manual experiment using
email=security@xen.org was accepted by the service. An "analysis
complete" message was sent to security@ while individual results mails
were sent to each member of the coverity project who was configured to
receive them. I think this is what we want. The "analysis complete"
mail contained no sensitive data, but also no real information other
than "success" (or presumably "failure" if that were to be the case).
I think going to security@ is probably OK.
The upload URL defaults to a dummy local URL, which will fail (it
would be possible in principal to put a stunt CGI there though). When
run with "cr-daily-branch --real" (i.e. in full on production mode)
then this is set instead to the value of CoverityUploadUrl from the
config (production-config etc). This means that adhoc and play runs
still exercise all the code (but the curl will fail) while --real runs
upload to a site-configurable location. (Note that the URL includes
the coverity project name, which would likely differ for different
instances).
I have run this via cr-daily-branch --real on the production infra
and it did upload as expected (flight 80516). Since
master==coverity-tested/master at this point it came out as a baseline
test which didn't attempt ap-push, which I would have expected to fail
anyway since it was running as my user in the colo which cannot push
to osstest@xenbits.
In my experiments the curl command took ~35 minutes to complete (rate
in the 100-200k range). Not sure if this is a problem, but use curl
--max-time passing it an hour to bound things. Note that curl is run
on the controller (via system_checked). timeout etc.
Note that the token must be supplied with </path/to/token and not
@/path/to/token. The latter appears to the server as a file upload
rather than a text field in a form which doesn't work. In early
attempts I thought that the trailing \n in /path/to/token might be an
issue and hence wrote a big comment. However having discovered < vs @
I am no longer 100% sure that is the case, but I left the comment
anyway since I can observe on the wire that the \n is included in the
upload (but each test takes ~35 mins and there is a ratelimit on the
server side too).
A final niggle is that the descripton field in the web ui ends up as:
80516:\ git://xenbits.xen.org/xen.git\ 9937763265d9597e5f2439249b16d995842cdf0
(i.e. spaces are \ escaped). I've confirmed with curl --trace-ascii
the the uploaded data is not escaped (this is from an earlier attempt
which did not include the flight number):
Due to the limitations on the numbers of uploads I've not experimented
with possible fixes yet (e.g. URL escaping the upload). Worst case we
either live with it or adjust the syntax to avoid the problematic
characters.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Ian Campbell [Tue, 19 Jan 2016 12:48:08 +0000 (12:48 +0000)]
make-flight: Support specifying a mini-os tree+revision
This is useful for standalone or adhoc use as well as (presumably)
bisection.
There is no ap-* or cr-daily-* integration here because I didn't need
it (i.e. I'm not intending to create a new mini-os branch here).
In order to cope with Xen <= 4.5 where extras/mini-os exists but is
part of xen.git and not something cloned from elsewhere add a
$optional argument (itself optional) to dir_identify_vcs which if true
causes dir_identify_vcs to return 'none' instead of failing.
Previously dir_identify_vcs failed with:
bash: line 5: fail: command not found
because the fail command is undefined. Instead echo fail and use that
to trigger the $optional handling.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Mon, 18 Jan 2016 15:54:15 +0000 (15:54 +0000)]
stop allowing libvirt failures
In Feb/Mar 2015 (not long after adding the libvirt tests) we appear to
have added test-@@-libvirt@@ to the set of allowed failures in
response to some issues with libvirtd crashing.
However looking at the history of test-@@-libvirt@@ on all branches
both in the COLO and in Cambridge (which was the production instance
back then) I don't see any evidence that this issue is still ongoing
(which matches my recollection of it having been fixed).
Therefore remove the entries allowing libvirt failures.
This effectively reverts:
00023a5af6ff allow files: Allow all libvirt test failures on other branches 83b8c8eafb18 allow.all: Do not regard libvirt guest start failures as regressions
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Wed, 6 Jan 2016 11:08:43 +0000 (11:08 +0000)]
sg-report-job-history: alternate color of osstest column only when it changes
Currently the bgcolor of the osstest column alternates on each line,
rather than only when it changes as the other revision columns do.
A given flight might touch multiple osstest revisions (although in
practice they rarely do) but it seems reasonable to simply consider
any change as a change.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Wed, 20 Jan 2016 15:06:20 +0000 (15:06 +0000)]
Debian: erase-other-disks: erase partitions first
It seems that when sdX is zeroed there is some chance that sdX[0-9]
will disappear before we get to them.
When partman comes along and recreates the partitions it is likely
that they will occupy the same disk space as before (since d-i's
autopartition is deterministic), meaning that LVM will find the old
PV headers again.
This is in particular problematic on multi disk systems where we end
up with an LV spanning sda5 and sdb. sdb is successfully erased here
but sda5 is not, however LVM will still find the LV with missing PV,
which is sufficient to trigger partman-lvm's checks for erasing
devices which weren't explicitly listed, resulting in:
!! ERROR: Unable to automatically remove LVM data
Because the volume group(s) on the selected device also consist of physical
volumes on other devices, it is not considered safe to remove its LVM data
automatically. If you wish to use this device for partitioning, please remove
its LVM data first.
which cannot be preseeded around.
If the autopartitioning is not deterministic (as might be the case
when installing a different version of Debian to last time) then
going from layout A -> B -> A' risks B (by chance) not destroying the
headers created by A, meaning that A' will find them and suffer again
from the problem above. This is handled via the use of
ts-host-install-twice which will cause A' to run twice, i.e. A -> B
-> (A' -> A''). In this case A' will fail as above, but A'' will
startup seeing the partition layout put in place by A' (which matches
A) and erase those partitions, leading to success later on.
Also erase partitions for all sd/hd? not just sda+hda.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Wed, 20 Jan 2016 15:06:19 +0000 (15:06 +0000)]
Debian: erase-other-disks: add a log() helper
Writing it out each time is too verbose.
At the same time log the set of devices present before and after each
batch of erasing, with a udev settle before the second to ensure any
changes to /dev have happened.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Fri, 15 Jan 2016 13:35:30 +0000 (13:35 +0000)]
ts-debian-install: increase time allowed for xen-create-image
This step is consistently timing out when run on cubietruck-*. Judging
from the logs it appears to be completing during the 30s slack added
by tcmdex (i.e. after the timeout message the rest of the output
appears in the test step log).
Looking at the results on arndale-* (which looks to pass reasonably
reliably) I see that the regular test-armhf-armhf-xl job takes around
550s to do the xen-create-image while test-armhf-armhf-xl-rtds
typically takes around 1100s (twice as long).
On cubietruck-braque test-armhf-armhf-xl uses 900s. One could
therefore extrapolate that test-armhf-armhf-xl-rtds might need more
than 1800s and not be too surprised that it appears to need something
a bit more than 2000s in practice. 2500s seems like sufficient
headroom.
For comparisson with arm on x86 godello takes around 210s in the
normal case and 680s with RTDS (>3x slower) while nocera takes 265s
and 640s (2.4x). (Those are from nearby but not identical flights in
order to match up the host).
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Fri, 15 Jan 2016 12:23:58 +0000 (12:23 +0000)]
Allow longer timeout when creating backing file for a raw disk.
I noticed this dd timiung out when recommissioning the 3 cubietrucks
(picasso, metzinger, gleizes) but looking at the log shows this has
been happening on braque too.
The current code assumes 65MB/s arriving at a timeout of 153s for the
10G file. On arndale-* the logs indicate that it is achieving 95MB/s
and taking 105-107s which results in a warning but not a failure:
execution took 105 seconds [**>153.846153846154/2**]
In experiments on a local cubietruck I observed it achieving a much
lower throughput of 40MB/s, which seems to be consistent with what
others are seeing:
https://groups.google.com/forum/#!category-topic/cubieboard/troubleshooting/7R4HlCDNCTU
Therefore calculate the timeout assuming a throughput of 20MB/s, in
practice for a 10GB file this will result in a 500s timeout.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Fri, 18 Dec 2015 12:02:27 +0000 (12:02 +0000)]
standalone-generate-dump-flight-runvars: include cri-getconfig
Commit fb373a2096dc "cri-getconfig: Break out exec_resetting_sigint."
refactored this functionality, and asserted that cri-getconfig is the
one library which everything includes.
standalone-generate-dump-flight-runvars appears to have been the
exception to that rule.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Thu, 17 Dec 2015 13:29:40 +0000 (13:29 +0000)]
mg-allocate: In planner mode, pre-check the arguments
Now, attempting to allocate a nonexistent host fails immediately with
a sensible message, rather than queueing up and then reporting the
message only later:
mariner:testing.git> OSSTEST_CONFIG=/u/iwj/.xen-osstest/config:local-config.test-database_iwj ./mg-allocate -U 1h spong
2015-12-17 17:05:14 Z pre-checking resources (dry run)...
2015-12-17 17:05:14 Z (precheck) task 196916 static iwj@mariner: iwj@mariner manual
*** no candidates for spong! ***
mariner:testing.git>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Thu, 17 Dec 2015 13:17:27 +0000 (13:17 +0000)]
mg-allocate: Better error handling when no candidates
Spot when our db search revealed no candidates for the resources to
allocate, and:
- when doing an immediate allocation, call it an error
- when doing a planned allocation, cause it to prevent allocation
on this iteration, and print a suitably unreassuring message
Previously it would simply say `nothing available'.
Implement this as follows:
- Report lack of candidates as $ok=-1 from alloc_1rescand
- In alloc_1res, return this -1 as with any non-zero $ok
- Handle the new $ok at all the call sites, in particular
- In plan(), rename `allok' to `worstok' and have it be
the worst relevant $ok value. If $ok gives -1, return
undef, rather than a booking list, to the allocator core.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Thu, 17 Dec 2015 13:27:30 +0000 (13:27 +0000)]
Executive DB retry: Avoid an undefined warning
If something other than the DB statements inside need_retry throws an
exception, ->err will normally be undef (because
$dbh_tests->begin_work will clear it, if nothing else).
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Thu, 17 Dec 2015 13:16:37 +0000 (13:16 +0000)]
mg-schema-test-database: Borrow shares properly
Previously, the test database would be generated in a broken state:
resources share-host/foo/{1,2,...} exist but the resource host/foo/0
is allocated to magic/xdbref rather than to magic/shared. This causes
various resource allocation machinery to crash. (Even if the host is
entirely un-borrowed.)
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v2: Expand commit message.
Ian Jackson [Thu, 17 Dec 2015 12:10:07 +0000 (12:10 +0000)]
mg-schema-test-database: Wipe previous local plan data
Whatever is in the user's cwd is unlikely to correspond to anything
real. In principle it might be possible to obtain an official copy
from the real daemons, and massage it, or something, but that's a lot
of work.
Instead, just remove it when we start the test db daemons.
In principle it would be more correct to remove it when we set up the
test db, because it is at that point that we create the new view of
the world. Removing the old plan data when we start daemons means
that if the user, who is testing, restarts the daemons, the
newly-created queue daemon does not have information about allocations
made with the previous daemon, and instead regards those allocations
as rogue.
However, removing the file only when the daemons are started means
that if the user has saved a data-plan.pl in their cwd for some other
reason we don't remove it unless the user is actually going to run the
daemons. So I think this is preferable.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Thu, 17 Dec 2015 12:09:44 +0000 (12:09 +0000)]
mg-schema-test-database: Provide some timeouts which are better for testing
The default timeouts mean that after starting a test db queue daemon
and a test db allocation attempt, we have to wait two minutes.
Lower timeouts increase the risk that we might lose noncritical races
and allocate resources to the `wrong' tasks. And they reduce the
duration of an outage which will cause a planned allocation attempt to
fail.
I think we don't care about those problems for test instances.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Mon, 4 Jan 2016 16:17:15 +0000 (16:17 +0000)]
ms-* html generation: Provide right title for projection
When ms-queuedaemon generates a resource-projection.html, it sometimes
does so from data-plan.pl (see proc report-plan). This means that
ms-planner does not get a reliable indication of whether it is being
run for the plan or the projection, and the resource-project.html
sometimes claims to be the plan.
Fix with a new ms-planner option -W which tells it what to put in the
title, defaulting to the value passed to -w.
DEPLOYMENT NOTE:
The new ms-planner works with the old queuedaemon, so when upgrading,
it is OK to simply update the daemons-testing.git and then restart the
ms-queuedaemon.
If it is necessary to downgrade, rewinding to the old commit with a
running ms-queuedaemon will cause errors from the old ms-planner being
passed -w -- but these errors are trapped and ignored. So in this
case reports will be out of date until ms-queuedaemon is also
restarted.
In either case nothing will go badly wrong.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Tue, 15 Dec 2015 18:26:15 +0000 (18:26 +0000)]
tcl daemons: Fix reentrancy hazard in chan-read
If the callback called by chan-read sets up a different read handler,
and the data for that other read handler arrives before chan-read
returns, chan-read would go round its loop again and eat and process
the new data. This is wrong.
Instead, return from chan-read after processing one result from
`gets'. If there is more to do, with this handler, the filevent will
arrange for us to be reentered.
This is most easily done by changing the `while' into an `if', and all
of the `continue's into `return's. (There are no `break's.)
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Tue, 15 Dec 2015 16:08:44 +0000 (16:08 +0000)]
Database locking: Tcl: Cover LOCK TABLEs with catch
Previously we would retry only the body, but not LOCK TABLEs.
We got away with it before because of the heavyweight locking of even
long-running read-only transactions, but now the LOCK TABLEs can fail
(at least in a mixed-version system, and perhaps even in a system with
only new code).
Additionally, if one of the LOCK TABLEs fails, the code's use of the
db handle becomes stuck because of the failed transaction: the error
is caught by the daemon's main loop error handler, but the db handle
is not subjected to ROLLBACK and all future attempts to use it will
fail.
So: move the LOCK TABLEs (and the SET TRANSACTION) into the catch, so
that deadlocks in LOCK TABLEs are retried (after ROLLBACK).
The COMMIT remains outside the eval but this should be unaffected by
DB deadlocks if the LOCK TABLEs are right.
Note that this code does not attempt to distinguish DB deadlock errors
from other errors. Arguably this is quite wrong. Fixing it to
distinguish deadlocks is awkward because pg_execute does not leave the
error information anywhere it can be found. Contrary to what the
documentation seems to imply, it does not set errorCode (!)
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Tue, 15 Dec 2015 15:36:51 +0000 (15:36 +0000)]
Database locking: Perl: Increase retry count
It seems to me that this deadlock might actually become fairly common
in some setups. There is little harm in trying it for 100s rather
than 20s, and there maybe some benefit.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Tue, 15 Dec 2015 15:14:34 +0000 (15:14 +0000)]
Database locking: Perl: Retry all deadlocks in PostgreSQL
Previously we would retry all COMMITs but nothing else. This is
correct for SQLite3 but not for PostgreSQL.
We got away with it before because of the heavyweight locking of even
long-running read-only transactions, but now the LOCK TABLEs can fail
(at least in a mixed-version system, and perhaps even in a system with
only new code).
So: cover all of the database work in db_retry with the eval, and
explicitly ask the JobDB adaptation layer (via a new need_retry
method) whether to go around again. We tell the JobDB layer whether
the problem was during commit, so that we can avoid making any overall
semantic change to the interaction with SQLite3.
In the PostgreSQL case, the db handle can be asked whether there was
an error and what the error code was. Deadlock has its own error
code.
(One side effect here is that db_retry_retry, which sets
$db_retry_stop='retry', is now no longer affected by the retry count
in db_retry. But there are no callers and that may be more right
anyway. db_retry_abort always exits the loop, as before.)
adding a sleep(2) to the loop Osstest::JobDB::Executive::begin_work,
and running a second copy of the rune with the tables to lock in the
other order.
Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
---
v2: Mention db_retry_retry in commit message.
Ian Jackson [Thu, 10 Dec 2015 15:31:37 +0000 (15:31 +0000)]
Schema: When creating, check that no updates are applied
If you try to run mg-schema-create on an existing instance it bombs
out right at the beginning because it tries to create the `flights'
table, which already exists.
But in the future the `flights' table might be removed in an update,
which would remove this safety catch. Then running the create might
partially succeed, leaving debris a production instance.
Detect this situation by looking for applied schema updates, and
bombing out if there are any.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v4: Add comment.
Ian Jackson [Thu, 10 Dec 2015 13:26:00 +0000 (13:26 +0000)]
Schema: Support database schema updates
See schema/README.schema, introduced in this patch, for the design.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v4: Add comment about test db safety catch.
v3: Fix spurious message from ./mg-schema-updates apply.
Fix grammar error in README.updates.
v2: Slight increase schema update name length format.
Docs fixes:
Change erroneous `three' to `four'.
Change `state' to `status' throghout.
Explain scope of <status>.
Sort out (and renumber) `Update order for Populate-then-rely'.
Sort out "Statuses" explanations.
Encourage use of DML update, rather than ad-hoc scripts,
for populating new columns.
Ian Jackson [Thu, 10 Dec 2015 12:13:58 +0000 (12:13 +0000)]
Schema: Remove SET OWNER and GRANT/REVOKE from schema/initial.sql
Really, we don't want the initial schema setup to mess about with
permissions. Instead, we simply expect to run the creation as the
correct role user.
So:
- Remove the code in mg-schema-test-database to remove the
permission settings from initial.sql;
- Instead, run exactly that code on initial.sql and commit the
result.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Fri, 11 Dec 2015 16:13:00 +0000 (16:13 +0000)]
Executive DB: Reduce strength of DB locks
The purpose of these locks is partly to prevent transactions being
aborted (which I'm not sure the existing code would in practice cope
with, although this is a bug) and also to avoid bugs due to the fact
that
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
does not mean that the transactions are necessarily serialisable!
http://www.postgresql.org/docs/8.3/static/transaction-iso.html
In SQL in general it is possible for read-only transactions to
conflict with writing transactions.
However, in PostgreSQL this is not a problem because Postgres uses
multi-version concurrency control: it retains the old version of the
data while the read transaction is open:
http://www.postgresql.org/docs/8.3/static/transaction-iso.html
So a read transaction cannot cause a write transaction to abort, nor
vice versa. So there is no need to have the database explicit table
locks prevent concurrent read access.
Preventing concurrent read access means that simple and urgent updates
can be unnecessarily delayed by long-running reader transactions in
the history reporters and archaeologists.
So, reduce the lock mode from ACCESS EXCLUSIVE to ACCESS. This still
conflicts with all kinds of updates and prospective updates, but no
longer with SELECT:
http://www.postgresql.org/docs/8.3/static/explicit-locking.html
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v2: Fix grammar and typo in commit message.
Ian Jackson [Fri, 11 Dec 2015 16:04:11 +0000 (16:04 +0000)]
Executive DB: Eliminate SQL locking for read-only transactions
Our transactions generally run with
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
(which, incidentally, does not mean that the transactions are
necessarily serialisable!)
In SQL in general it is possible for a read-only transaction to fail
and need to be retried because some writer has updated things.
However, in PostgreSQL this is not possible because Postgres uses
multi-version concurrency control: it retains the old version of the
data while the read transaction is open:
http://www.postgresql.org/docs/8.3/static/transaction-iso.html
(And, of course, SQLite uses MVCC too, and all transactions in SQLite
are fully serialisable.)
So it is not necessary for these read-only operations to take out
locks. When they do so they can unnecessarily block other important
work for long periods of time.
With this change, we go further from the ability to support databases
other than PostgreSQL and SQLite. However, such support was very
distant anyway because of differences in SQL syntax and semantics, our
reliance in Executive mode on Postgres's command line utilities, and
so on.
We retain the db_retry framing because (a) although the retry loop is
not necessary in these cases, the transaction framing is (b) it will
make it slightly easier to reverse this decision in the future if we
ever decide to do so (c) it is less code churn.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v2: Fix minor error in in commit message
If OSSTEST_TASK is not set, we construct a <refkey> from the username
and the nodename, and look for a such a static task. If OSSTEST_TASK
/is/ set would require it to contain `<taskid> <type> <refkey>'.
In this patch, permit OSSTEST_TASK to be set simply to the <refkey>.
This is much more convenient and doesn't involve manually looking up
taskids. The risk of error seems negligible.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Mon, 7 Dec 2015 16:33:57 +0000 (16:33 +0000)]
mg-schema-test-database: Safety catch in JobDB database open
When we open the `osstest' database, see whether this is a parent DB
(main DB) from which a test DB has been spawned by this user.
If it has, bomb out, unless the user has specified a suitable regexp
matching the DB name in the env var
OSSTEST_DB_USEREAL_IGNORETEST
This means that when a test database is in play, the user who created
it cannot accidentally operate on the real DB.
The safety catch does not affect Tcl programs, which get the DB config
directly, but in general that just means sg-execute-flight and
sg-run-job which already have a fair amount of safety catch because
they demand flight numbers.
mg-schema-test-database hits this feature over the head. We assume
that the caller of mg-schema-test-database knows what they are doing;
particularly, that if they create nested test DBs (!), they do not
need the assitance of this feature to stop themselves operating
mg-schema-test-database incorrectly. Anyone who creates nested test
DBs will hopefully recognise the potential for confusion!
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v3: Fix unclarity in a comment.
Ian Jackson [Mon, 7 Dec 2015 16:37:03 +0000 (16:37 +0000)]
mg-schema-test-database: Change username for back-to-main-db xref
The `username' of the xdbref task in the test db, referring to the
main db, is changed to `PARENT' (from `<username>@<nodename>').
Currently this is purely cosmetic, but it is going to be useful to
distinguish the two cases:
* This is a test DB and contains references to a parent
* This is a parent DB (probably the main DB) which contains
references to child test DB(s).
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v3: Fix `DBS' in commit message to `DB(s)'.
v2: New patch
Ian Jackson [Fri, 4 Dec 2015 18:24:44 +0000 (18:24 +0000)]
mg-schema-test-database: Sort out daemons; provide `daemons' subcommand
We arrange for the test configuration to look for the daemons on a
different host and port, and we provide a convenient way to run such a
pair of daemons.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v2: Moved setting of *Daemon{Host,Port} to this patch (was
previously in `mg-schema-test-database: New script')
Ian Jackson [Fri, 4 Dec 2015 17:57:54 +0000 (17:57 +0000)]
mg-schema-test-database: New script
This allows a user in non-standalone mode to make a whole new test
database, which is largely a clone of the original database.
The new db refers to the same resources (hosts), and more-or-less
safely borrows some of those hosts.
Currently we don't do anything about the queue and owner daemons.
This means that queue-daemon-based resource allocation is broken when
clients are pointed at the test db. But non-queue-based allocation
(eg, ./mg-allocate without -U) works, and the test db can be used for
db-related experiments and even support individual ts-* scripts (other
than ts-hosts-allocate of course).
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v2: Do not set *Daemon{Host,Port} - move this chunk to a later patch
Ian Jackson [Wed, 25 Nov 2015 15:34:04 +0000 (15:34 +0000)]
cri-getconfig: Provide debugging for get_psql_cmd
This allows us to execute only the first <some number> SQL
invocations. The first non-executed one is dumped, instead, by having
get_psql_command print a rune involving ./mg-debug-fail (which the
caller will then execute).
The locking makes things work roughly-correctly if get_psql_cmd is run
in multiple processes at once: it is not defined exactly which
invocations get which counter values, but they will all work properly
and get exactly one counter value each.
If set -x is in force, turn it off for get_psql_cmd: our perl rune is
uninteresting to see repeated ad infinitum in debugging output.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Fri, 4 Dec 2015 18:12:38 +0000 (18:12 +0000)]
mg-debug-fail: New utility script for debugging
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v2: Use "egrep ''" rather than "egrep .". Both sanitise
missing-final-newline but "egrep ''" will print blank lines,
which is desirable here.
Ian Jackson [Fri, 4 Dec 2015 18:00:48 +0000 (18:00 +0000)]
Configuration: No longer set password=<~/.xen-osstest/db-password>
Instead, expect the user to provide ~/.pgpass.
This is a good idea because we don't really want to be handling
passwords ourselves if we can help it. And, we are shortly going to
want to do some exciting mangling of the database access
configuration, which would be complicated by the presence of this
password expansion.
This may break for some users of existing Executive (non-standalone)
setups which are using production-config-cambridge or the default
built-in configuration.
DEPLOYMENT NOTE: After this passes the push gate in Cambridge,
/export/home/osstest/.{xen-,}osstest/db-password should be deleted to
avoid confusion in the future.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Campbell [Thu, 3 Dec 2015 14:55:48 +0000 (14:55 +0000)]
cr-try-bisect-adhoc: Set OSSTEST_PRIORITY=-30
This makes adhoc bisects slightly more important than smoke tests, on
the basis that a smoke test can choose another host while an adhoc
bisect cannot.
Document this is README.planner and while there make a note of the
usage of OSSTEST_RESOURCE_WAITSTART by cr-try-bisect.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Fri, 4 Dec 2015 13:57:59 +0000 (13:57 +0000)]
production-config-cambridge: Use new squid proxy
Specify both HttpProxy and DebianMirrorProxy. In my tests this seems
to improve some of the apparently-intercepting-proxy-related failures,
and it will certainly improve logging.
I set DebianMirrorProxy too so that queries to security.d.o go through
the proxy. Ideally we would have a apt cache that could be used as an
http proxy rather than as an origin server; when that happens we can
set DebianMirrorProxy to point to it and do away with DebianMirrorHost
(as we do in Massachusetts).
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Campbell [Wed, 2 Dec 2015 16:05:04 +0000 (16:05 +0000)]
cr-try-bisect-adhoc: Set laundered_testid so graph URL is correct
Otherwise the testid is missing from the filename, resulting in e.g.
http://osstest.test-lab.xenproject.org/~osstest/pub/results-adhoc/bisect/xen-unstable/test-amd64-amd64-qemuu-nested-intel..svg
Instead of test-amd64-amd64-qemuu-nested-intel.debian-hvm-install-l1-l2.svg
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Fri, 27 Nov 2015 16:36:23 +0000 (16:36 +0000)]
sg-run-job: Coalesce a couple of repetitions
Fold `guest-localmigrate.2' into `guest-localmigrate/x10' and move
`guest-start.2' to after `guest-start.repeat' (reversing the contents
of the latter so that the start comes before the stop).
(guest-start.2 is still necessary because the start/stop test leaves
the guest stopped, whereas the subsequent destroy test ought happen
with the guest running.)
This change will allow the heisenbug compensator to see more of these
failures as the same failures.
The overall effect includes a reduction of the number of localhost
migrations from 11 to 10, but this is better than leaving a misleading
testid containing the string `x10' (or changing the testid).
It is best to fold this way, keeping the testid of the step which
previously had most of the regressions, because: the alternative,
keeping the testid of the low-repetition step, would allow osstest to
use previous lucky passes of the low-repetition step to justify
current failures of the now-high-repetition step.
To check that the effect of the patch is as intended, I ran a before
and after run with OSSTEST_SIMULATE=1, and (a) collected and sedded
and diffed the sg-run-job transcripts and (b) looked in the db.
I also ran a real test (65261 in the Xen Project test lab) with a very
similar version, which passed, and will re-run that before pushing.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
---
v2: Do not increment count of migration tests so as to make
testid misleading.
Do the change to the start/stop test differently.
Ian Jackson [Mon, 30 Nov 2015 13:35:54 +0000 (13:35 +0000)]
cs-bisection-step: Limit size of revision log included in reports
There is a limit in cr-daily-branch, but none in cs-bisection-step.
adhoc-revtuple-generator could usefully have this built in but that's
not so simple, so do it again here. We already slurp the whole thing
into core so from a resource usage point of view we might as well do
the length check here too.
Reported-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v2: Fix typo in message.
Ian Jackson [Fri, 27 Nov 2015 11:36:05 +0000 (11:36 +0000)]
README.email: Document `fail in 58948 REGR. vs. 63449'
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v2: Change an example number to improve clarity.
Ian Jackson [Wed, 25 Nov 2015 11:38:39 +0000 (11:38 +0000)]
Executive HTML output: More varieties of grey fruit, fewer bananas
The use of yellow for `preparing' and `running' is particularly
inappropriate in the projection summary, but is also rather misleading
when showing cancelled flights.
Use shades of grey for the different levels of in-progress-ness.
Darker shades are `more running', which seems to align better with the
shades in `Scheduled' in the flight summary.
Yellow now definitely means something is `broken', or worse.
The two places where this needs to be changed are actually
meaningfully different: report_run_getinfo works on job statuses,
whereas sg-report-flight handles only steps, which have many fewer
statuses.
Here are some samples of the ouput, from the Citrix Cambridge
instance:
Ian Campbell [Mon, 30 Nov 2015 11:58:42 +0000 (11:58 +0000)]
ts-debian-di-install: Don't set runvars for netboot kernel+ramdisk as outputs
Currently these runvars are either URLs provided by the definition
(e.g. make-flight) or output controller-relative paths created by the
execution (in the case where they aren't from the definition).
This wierd dual-semantics is confusing and wrong, and in particular is
broken if the test step is rerun (e.g. in standalone mode).
In the case where they are outputs only these paths is information
only. The information is already available in the full logs so
dropping the runvars here merely removes the information from the
summary table. It's not so useful that this is an issue.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Tue, 17 Nov 2015 17:13:46 +0000 (17:13 +0000)]
make-flight: move in-lined branch vs arch filtering into callbacks
No change to the output of standalone-generate-dump-flight-runvars
The inlined xenbranch vs arch filters remain where they are since they
are common (in that they reflect the addition and removal of arches)
and apply equally to all make-*-flight.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Tue, 17 Nov 2015 17:13:45 +0000 (17:13 +0000)]
make-flight: consolidate branch_filter_callback for builds and tests
Currently we have a test_matrix_branch_filter_callback which filters
jobs based on the $xenarch and $branch and a separate more adhoc
filter inline for the build jobs.
This has lead to things getting out of sync in the past (e.g. recently
we dropped armhf tests from the linux-3.10 and -3.14 branches but not
the build jobs).
Add a new build_matrix_branch_filter_callback and for make-flight
cause this and test_matrix_branch_filter_callback to use a common
helper.
The adhoc filtering in the build loop remains and will be tidied up
next.
For make-distros-flight just add a nop build filter alongside the test
filter.
No change to the output of standalone-generate-dump-flight-runvars.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Tue, 17 Nov 2015 17:13:44 +0000 (17:13 +0000)]
make-*flight: Reorder.
I just got tripped up again by putting a build job filter definition
after the call to create_build_jobs. Reorder the make-*flight scripts
to reduce the probability of me doing so any more times.
The general order of these scripts is now:
- job filter callbacks
- test job creation
- top-level code which drives the process.
No change to the output of standalone-generate-dump-flight-runvars.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>