Ian Campbell [Tue, 19 Jan 2016 12:48:08 +0000 (12:48 +0000)]
make-flight: Support specifying a mini-os tree+revision
This is useful for standalone or adhoc use as well as (presumably)
bisection.
There is no ap-* or cr-daily-* integration here because I didn't need
it (i.e. I'm not intending to create a new mini-os branch here).
In order to cope with Xen <= 4.5 where extras/mini-os exists but is
part of xen.git and not something cloned from elsewhere add a
$optional argument (itself optional) to dir_identify_vcs which if true
causes dir_identify_vcs to return 'none' instead of failing.
Previously dir_identify_vcs failed with:
bash: line 5: fail: command not found
because the fail command is undefined. Instead echo fail and use that
to trigger the $optional handling.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Mon, 18 Jan 2016 15:54:15 +0000 (15:54 +0000)]
stop allowing libvirt failures
In Feb/Mar 2015 (not long after adding the libvirt tests) we appear to
have added test-@@-libvirt@@ to the set of allowed failures in
response to some issues with libvirtd crashing.
However looking at the history of test-@@-libvirt@@ on all branches
both in the COLO and in Cambridge (which was the production instance
back then) I don't see any evidence that this issue is still ongoing
(which matches my recollection of it having been fixed).
Therefore remove the entries allowing libvirt failures.
This effectively reverts:
00023a5af6ff allow files: Allow all libvirt test failures on other branches 83b8c8eafb18 allow.all: Do not regard libvirt guest start failures as regressions
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Wed, 6 Jan 2016 11:08:43 +0000 (11:08 +0000)]
sg-report-job-history: alternate color of osstest column only when it changes
Currently the bgcolor of the osstest column alternates on each line,
rather than only when it changes as the other revision columns do.
A given flight might touch multiple osstest revisions (although in
practice they rarely do) but it seems reasonable to simply consider
any change as a change.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Wed, 20 Jan 2016 15:06:20 +0000 (15:06 +0000)]
Debian: erase-other-disks: erase partitions first
It seems that when sdX is zeroed there is some chance that sdX[0-9]
will disappear before we get to them.
When partman comes along and recreates the partitions it is likely
that they will occupy the same disk space as before (since d-i's
autopartition is deterministic), meaning that LVM will find the old
PV headers again.
This is in particular problematic on multi disk systems where we end
up with an LV spanning sda5 and sdb. sdb is successfully erased here
but sda5 is not, however LVM will still find the LV with missing PV,
which is sufficient to trigger partman-lvm's checks for erasing
devices which weren't explicitly listed, resulting in:
!! ERROR: Unable to automatically remove LVM data
Because the volume group(s) on the selected device also consist of physical
volumes on other devices, it is not considered safe to remove its LVM data
automatically. If you wish to use this device for partitioning, please remove
its LVM data first.
which cannot be preseeded around.
If the autopartitioning is not deterministic (as might be the case
when installing a different version of Debian to last time) then
going from layout A -> B -> A' risks B (by chance) not destroying the
headers created by A, meaning that A' will find them and suffer again
from the problem above. This is handled via the use of
ts-host-install-twice which will cause A' to run twice, i.e. A -> B
-> (A' -> A''). In this case A' will fail as above, but A'' will
startup seeing the partition layout put in place by A' (which matches
A) and erase those partitions, leading to success later on.
Also erase partitions for all sd/hd? not just sda+hda.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Wed, 20 Jan 2016 15:06:19 +0000 (15:06 +0000)]
Debian: erase-other-disks: add a log() helper
Writing it out each time is too verbose.
At the same time log the set of devices present before and after each
batch of erasing, with a udev settle before the second to ensure any
changes to /dev have happened.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Fri, 15 Jan 2016 13:35:30 +0000 (13:35 +0000)]
ts-debian-install: increase time allowed for xen-create-image
This step is consistently timing out when run on cubietruck-*. Judging
from the logs it appears to be completing during the 30s slack added
by tcmdex (i.e. after the timeout message the rest of the output
appears in the test step log).
Looking at the results on arndale-* (which looks to pass reasonably
reliably) I see that the regular test-armhf-armhf-xl job takes around
550s to do the xen-create-image while test-armhf-armhf-xl-rtds
typically takes around 1100s (twice as long).
On cubietruck-braque test-armhf-armhf-xl uses 900s. One could
therefore extrapolate that test-armhf-armhf-xl-rtds might need more
than 1800s and not be too surprised that it appears to need something
a bit more than 2000s in practice. 2500s seems like sufficient
headroom.
For comparisson with arm on x86 godello takes around 210s in the
normal case and 680s with RTDS (>3x slower) while nocera takes 265s
and 640s (2.4x). (Those are from nearby but not identical flights in
order to match up the host).
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Fri, 15 Jan 2016 12:23:58 +0000 (12:23 +0000)]
Allow longer timeout when creating backing file for a raw disk.
I noticed this dd timiung out when recommissioning the 3 cubietrucks
(picasso, metzinger, gleizes) but looking at the log shows this has
been happening on braque too.
The current code assumes 65MB/s arriving at a timeout of 153s for the
10G file. On arndale-* the logs indicate that it is achieving 95MB/s
and taking 105-107s which results in a warning but not a failure:
execution took 105 seconds [**>153.846153846154/2**]
In experiments on a local cubietruck I observed it achieving a much
lower throughput of 40MB/s, which seems to be consistent with what
others are seeing:
https://groups.google.com/forum/#!category-topic/cubieboard/troubleshooting/7R4HlCDNCTU
Therefore calculate the timeout assuming a throughput of 20MB/s, in
practice for a 10GB file this will result in a 500s timeout.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Fri, 18 Dec 2015 12:02:27 +0000 (12:02 +0000)]
standalone-generate-dump-flight-runvars: include cri-getconfig
Commit fb373a2096dc "cri-getconfig: Break out exec_resetting_sigint."
refactored this functionality, and asserted that cri-getconfig is the
one library which everything includes.
standalone-generate-dump-flight-runvars appears to have been the
exception to that rule.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Thu, 17 Dec 2015 13:29:40 +0000 (13:29 +0000)]
mg-allocate: In planner mode, pre-check the arguments
Now, attempting to allocate a nonexistent host fails immediately with
a sensible message, rather than queueing up and then reporting the
message only later:
mariner:testing.git> OSSTEST_CONFIG=/u/iwj/.xen-osstest/config:local-config.test-database_iwj ./mg-allocate -U 1h spong
2015-12-17 17:05:14 Z pre-checking resources (dry run)...
2015-12-17 17:05:14 Z (precheck) task 196916 static iwj@mariner: iwj@mariner manual
*** no candidates for spong! ***
mariner:testing.git>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Thu, 17 Dec 2015 13:17:27 +0000 (13:17 +0000)]
mg-allocate: Better error handling when no candidates
Spot when our db search revealed no candidates for the resources to
allocate, and:
- when doing an immediate allocation, call it an error
- when doing a planned allocation, cause it to prevent allocation
on this iteration, and print a suitably unreassuring message
Previously it would simply say `nothing available'.
Implement this as follows:
- Report lack of candidates as $ok=-1 from alloc_1rescand
- In alloc_1res, return this -1 as with any non-zero $ok
- Handle the new $ok at all the call sites, in particular
- In plan(), rename `allok' to `worstok' and have it be
the worst relevant $ok value. If $ok gives -1, return
undef, rather than a booking list, to the allocator core.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Thu, 17 Dec 2015 13:27:30 +0000 (13:27 +0000)]
Executive DB retry: Avoid an undefined warning
If something other than the DB statements inside need_retry throws an
exception, ->err will normally be undef (because
$dbh_tests->begin_work will clear it, if nothing else).
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Thu, 17 Dec 2015 13:16:37 +0000 (13:16 +0000)]
mg-schema-test-database: Borrow shares properly
Previously, the test database would be generated in a broken state:
resources share-host/foo/{1,2,...} exist but the resource host/foo/0
is allocated to magic/xdbref rather than to magic/shared. This causes
various resource allocation machinery to crash. (Even if the host is
entirely un-borrowed.)
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v2: Expand commit message.
Ian Jackson [Thu, 17 Dec 2015 12:10:07 +0000 (12:10 +0000)]
mg-schema-test-database: Wipe previous local plan data
Whatever is in the user's cwd is unlikely to correspond to anything
real. In principle it might be possible to obtain an official copy
from the real daemons, and massage it, or something, but that's a lot
of work.
Instead, just remove it when we start the test db daemons.
In principle it would be more correct to remove it when we set up the
test db, because it is at that point that we create the new view of
the world. Removing the old plan data when we start daemons means
that if the user, who is testing, restarts the daemons, the
newly-created queue daemon does not have information about allocations
made with the previous daemon, and instead regards those allocations
as rogue.
However, removing the file only when the daemons are started means
that if the user has saved a data-plan.pl in their cwd for some other
reason we don't remove it unless the user is actually going to run the
daemons. So I think this is preferable.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Thu, 17 Dec 2015 12:09:44 +0000 (12:09 +0000)]
mg-schema-test-database: Provide some timeouts which are better for testing
The default timeouts mean that after starting a test db queue daemon
and a test db allocation attempt, we have to wait two minutes.
Lower timeouts increase the risk that we might lose noncritical races
and allocate resources to the `wrong' tasks. And they reduce the
duration of an outage which will cause a planned allocation attempt to
fail.
I think we don't care about those problems for test instances.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Mon, 4 Jan 2016 16:17:15 +0000 (16:17 +0000)]
ms-* html generation: Provide right title for projection
When ms-queuedaemon generates a resource-projection.html, it sometimes
does so from data-plan.pl (see proc report-plan). This means that
ms-planner does not get a reliable indication of whether it is being
run for the plan or the projection, and the resource-project.html
sometimes claims to be the plan.
Fix with a new ms-planner option -W which tells it what to put in the
title, defaulting to the value passed to -w.
DEPLOYMENT NOTE:
The new ms-planner works with the old queuedaemon, so when upgrading,
it is OK to simply update the daemons-testing.git and then restart the
ms-queuedaemon.
If it is necessary to downgrade, rewinding to the old commit with a
running ms-queuedaemon will cause errors from the old ms-planner being
passed -w -- but these errors are trapped and ignored. So in this
case reports will be out of date until ms-queuedaemon is also
restarted.
In either case nothing will go badly wrong.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Tue, 15 Dec 2015 18:26:15 +0000 (18:26 +0000)]
tcl daemons: Fix reentrancy hazard in chan-read
If the callback called by chan-read sets up a different read handler,
and the data for that other read handler arrives before chan-read
returns, chan-read would go round its loop again and eat and process
the new data. This is wrong.
Instead, return from chan-read after processing one result from
`gets'. If there is more to do, with this handler, the filevent will
arrange for us to be reentered.
This is most easily done by changing the `while' into an `if', and all
of the `continue's into `return's. (There are no `break's.)
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Tue, 15 Dec 2015 16:08:44 +0000 (16:08 +0000)]
Database locking: Tcl: Cover LOCK TABLEs with catch
Previously we would retry only the body, but not LOCK TABLEs.
We got away with it before because of the heavyweight locking of even
long-running read-only transactions, but now the LOCK TABLEs can fail
(at least in a mixed-version system, and perhaps even in a system with
only new code).
Additionally, if one of the LOCK TABLEs fails, the code's use of the
db handle becomes stuck because of the failed transaction: the error
is caught by the daemon's main loop error handler, but the db handle
is not subjected to ROLLBACK and all future attempts to use it will
fail.
So: move the LOCK TABLEs (and the SET TRANSACTION) into the catch, so
that deadlocks in LOCK TABLEs are retried (after ROLLBACK).
The COMMIT remains outside the eval but this should be unaffected by
DB deadlocks if the LOCK TABLEs are right.
Note that this code does not attempt to distinguish DB deadlock errors
from other errors. Arguably this is quite wrong. Fixing it to
distinguish deadlocks is awkward because pg_execute does not leave the
error information anywhere it can be found. Contrary to what the
documentation seems to imply, it does not set errorCode (!)
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Tue, 15 Dec 2015 15:36:51 +0000 (15:36 +0000)]
Database locking: Perl: Increase retry count
It seems to me that this deadlock might actually become fairly common
in some setups. There is little harm in trying it for 100s rather
than 20s, and there maybe some benefit.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Tue, 15 Dec 2015 15:14:34 +0000 (15:14 +0000)]
Database locking: Perl: Retry all deadlocks in PostgreSQL
Previously we would retry all COMMITs but nothing else. This is
correct for SQLite3 but not for PostgreSQL.
We got away with it before because of the heavyweight locking of even
long-running read-only transactions, but now the LOCK TABLEs can fail
(at least in a mixed-version system, and perhaps even in a system with
only new code).
So: cover all of the database work in db_retry with the eval, and
explicitly ask the JobDB adaptation layer (via a new need_retry
method) whether to go around again. We tell the JobDB layer whether
the problem was during commit, so that we can avoid making any overall
semantic change to the interaction with SQLite3.
In the PostgreSQL case, the db handle can be asked whether there was
an error and what the error code was. Deadlock has its own error
code.
(One side effect here is that db_retry_retry, which sets
$db_retry_stop='retry', is now no longer affected by the retry count
in db_retry. But there are no callers and that may be more right
anyway. db_retry_abort always exits the loop, as before.)
adding a sleep(2) to the loop Osstest::JobDB::Executive::begin_work,
and running a second copy of the rune with the tables to lock in the
other order.
Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
---
v2: Mention db_retry_retry in commit message.
Ian Jackson [Thu, 10 Dec 2015 15:31:37 +0000 (15:31 +0000)]
Schema: When creating, check that no updates are applied
If you try to run mg-schema-create on an existing instance it bombs
out right at the beginning because it tries to create the `flights'
table, which already exists.
But in the future the `flights' table might be removed in an update,
which would remove this safety catch. Then running the create might
partially succeed, leaving debris a production instance.
Detect this situation by looking for applied schema updates, and
bombing out if there are any.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v4: Add comment.
Ian Jackson [Thu, 10 Dec 2015 13:26:00 +0000 (13:26 +0000)]
Schema: Support database schema updates
See schema/README.schema, introduced in this patch, for the design.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v4: Add comment about test db safety catch.
v3: Fix spurious message from ./mg-schema-updates apply.
Fix grammar error in README.updates.
v2: Slight increase schema update name length format.
Docs fixes:
Change erroneous `three' to `four'.
Change `state' to `status' throghout.
Explain scope of <status>.
Sort out (and renumber) `Update order for Populate-then-rely'.
Sort out "Statuses" explanations.
Encourage use of DML update, rather than ad-hoc scripts,
for populating new columns.
Ian Jackson [Thu, 10 Dec 2015 12:13:58 +0000 (12:13 +0000)]
Schema: Remove SET OWNER and GRANT/REVOKE from schema/initial.sql
Really, we don't want the initial schema setup to mess about with
permissions. Instead, we simply expect to run the creation as the
correct role user.
So:
- Remove the code in mg-schema-test-database to remove the
permission settings from initial.sql;
- Instead, run exactly that code on initial.sql and commit the
result.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Fri, 11 Dec 2015 16:13:00 +0000 (16:13 +0000)]
Executive DB: Reduce strength of DB locks
The purpose of these locks is partly to prevent transactions being
aborted (which I'm not sure the existing code would in practice cope
with, although this is a bug) and also to avoid bugs due to the fact
that
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
does not mean that the transactions are necessarily serialisable!
http://www.postgresql.org/docs/8.3/static/transaction-iso.html
In SQL in general it is possible for read-only transactions to
conflict with writing transactions.
However, in PostgreSQL this is not a problem because Postgres uses
multi-version concurrency control: it retains the old version of the
data while the read transaction is open:
http://www.postgresql.org/docs/8.3/static/transaction-iso.html
So a read transaction cannot cause a write transaction to abort, nor
vice versa. So there is no need to have the database explicit table
locks prevent concurrent read access.
Preventing concurrent read access means that simple and urgent updates
can be unnecessarily delayed by long-running reader transactions in
the history reporters and archaeologists.
So, reduce the lock mode from ACCESS EXCLUSIVE to ACCESS. This still
conflicts with all kinds of updates and prospective updates, but no
longer with SELECT:
http://www.postgresql.org/docs/8.3/static/explicit-locking.html
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v2: Fix grammar and typo in commit message.
Ian Jackson [Fri, 11 Dec 2015 16:04:11 +0000 (16:04 +0000)]
Executive DB: Eliminate SQL locking for read-only transactions
Our transactions generally run with
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
(which, incidentally, does not mean that the transactions are
necessarily serialisable!)
In SQL in general it is possible for a read-only transaction to fail
and need to be retried because some writer has updated things.
However, in PostgreSQL this is not possible because Postgres uses
multi-version concurrency control: it retains the old version of the
data while the read transaction is open:
http://www.postgresql.org/docs/8.3/static/transaction-iso.html
(And, of course, SQLite uses MVCC too, and all transactions in SQLite
are fully serialisable.)
So it is not necessary for these read-only operations to take out
locks. When they do so they can unnecessarily block other important
work for long periods of time.
With this change, we go further from the ability to support databases
other than PostgreSQL and SQLite. However, such support was very
distant anyway because of differences in SQL syntax and semantics, our
reliance in Executive mode on Postgres's command line utilities, and
so on.
We retain the db_retry framing because (a) although the retry loop is
not necessary in these cases, the transaction framing is (b) it will
make it slightly easier to reverse this decision in the future if we
ever decide to do so (c) it is less code churn.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v2: Fix minor error in in commit message
If OSSTEST_TASK is not set, we construct a <refkey> from the username
and the nodename, and look for a such a static task. If OSSTEST_TASK
/is/ set would require it to contain `<taskid> <type> <refkey>'.
In this patch, permit OSSTEST_TASK to be set simply to the <refkey>.
This is much more convenient and doesn't involve manually looking up
taskids. The risk of error seems negligible.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Mon, 7 Dec 2015 16:33:57 +0000 (16:33 +0000)]
mg-schema-test-database: Safety catch in JobDB database open
When we open the `osstest' database, see whether this is a parent DB
(main DB) from which a test DB has been spawned by this user.
If it has, bomb out, unless the user has specified a suitable regexp
matching the DB name in the env var
OSSTEST_DB_USEREAL_IGNORETEST
This means that when a test database is in play, the user who created
it cannot accidentally operate on the real DB.
The safety catch does not affect Tcl programs, which get the DB config
directly, but in general that just means sg-execute-flight and
sg-run-job which already have a fair amount of safety catch because
they demand flight numbers.
mg-schema-test-database hits this feature over the head. We assume
that the caller of mg-schema-test-database knows what they are doing;
particularly, that if they create nested test DBs (!), they do not
need the assitance of this feature to stop themselves operating
mg-schema-test-database incorrectly. Anyone who creates nested test
DBs will hopefully recognise the potential for confusion!
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v3: Fix unclarity in a comment.
Ian Jackson [Mon, 7 Dec 2015 16:37:03 +0000 (16:37 +0000)]
mg-schema-test-database: Change username for back-to-main-db xref
The `username' of the xdbref task in the test db, referring to the
main db, is changed to `PARENT' (from `<username>@<nodename>').
Currently this is purely cosmetic, but it is going to be useful to
distinguish the two cases:
* This is a test DB and contains references to a parent
* This is a parent DB (probably the main DB) which contains
references to child test DB(s).
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v3: Fix `DBS' in commit message to `DB(s)'.
v2: New patch
Ian Jackson [Fri, 4 Dec 2015 18:24:44 +0000 (18:24 +0000)]
mg-schema-test-database: Sort out daemons; provide `daemons' subcommand
We arrange for the test configuration to look for the daemons on a
different host and port, and we provide a convenient way to run such a
pair of daemons.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v2: Moved setting of *Daemon{Host,Port} to this patch (was
previously in `mg-schema-test-database: New script')
Ian Jackson [Fri, 4 Dec 2015 17:57:54 +0000 (17:57 +0000)]
mg-schema-test-database: New script
This allows a user in non-standalone mode to make a whole new test
database, which is largely a clone of the original database.
The new db refers to the same resources (hosts), and more-or-less
safely borrows some of those hosts.
Currently we don't do anything about the queue and owner daemons.
This means that queue-daemon-based resource allocation is broken when
clients are pointed at the test db. But non-queue-based allocation
(eg, ./mg-allocate without -U) works, and the test db can be used for
db-related experiments and even support individual ts-* scripts (other
than ts-hosts-allocate of course).
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v2: Do not set *Daemon{Host,Port} - move this chunk to a later patch
Ian Jackson [Wed, 25 Nov 2015 15:34:04 +0000 (15:34 +0000)]
cri-getconfig: Provide debugging for get_psql_cmd
This allows us to execute only the first <some number> SQL
invocations. The first non-executed one is dumped, instead, by having
get_psql_command print a rune involving ./mg-debug-fail (which the
caller will then execute).
The locking makes things work roughly-correctly if get_psql_cmd is run
in multiple processes at once: it is not defined exactly which
invocations get which counter values, but they will all work properly
and get exactly one counter value each.
If set -x is in force, turn it off for get_psql_cmd: our perl rune is
uninteresting to see repeated ad infinitum in debugging output.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Fri, 4 Dec 2015 18:12:38 +0000 (18:12 +0000)]
mg-debug-fail: New utility script for debugging
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v2: Use "egrep ''" rather than "egrep .". Both sanitise
missing-final-newline but "egrep ''" will print blank lines,
which is desirable here.
Ian Jackson [Fri, 4 Dec 2015 18:00:48 +0000 (18:00 +0000)]
Configuration: No longer set password=<~/.xen-osstest/db-password>
Instead, expect the user to provide ~/.pgpass.
This is a good idea because we don't really want to be handling
passwords ourselves if we can help it. And, we are shortly going to
want to do some exciting mangling of the database access
configuration, which would be complicated by the presence of this
password expansion.
This may break for some users of existing Executive (non-standalone)
setups which are using production-config-cambridge or the default
built-in configuration.
DEPLOYMENT NOTE: After this passes the push gate in Cambridge,
/export/home/osstest/.{xen-,}osstest/db-password should be deleted to
avoid confusion in the future.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Campbell [Thu, 3 Dec 2015 14:55:48 +0000 (14:55 +0000)]
cr-try-bisect-adhoc: Set OSSTEST_PRIORITY=-30
This makes adhoc bisects slightly more important than smoke tests, on
the basis that a smoke test can choose another host while an adhoc
bisect cannot.
Document this is README.planner and while there make a note of the
usage of OSSTEST_RESOURCE_WAITSTART by cr-try-bisect.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Fri, 4 Dec 2015 13:57:59 +0000 (13:57 +0000)]
production-config-cambridge: Use new squid proxy
Specify both HttpProxy and DebianMirrorProxy. In my tests this seems
to improve some of the apparently-intercepting-proxy-related failures,
and it will certainly improve logging.
I set DebianMirrorProxy too so that queries to security.d.o go through
the proxy. Ideally we would have a apt cache that could be used as an
http proxy rather than as an origin server; when that happens we can
set DebianMirrorProxy to point to it and do away with DebianMirrorHost
(as we do in Massachusetts).
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Campbell [Wed, 2 Dec 2015 16:05:04 +0000 (16:05 +0000)]
cr-try-bisect-adhoc: Set laundered_testid so graph URL is correct
Otherwise the testid is missing from the filename, resulting in e.g.
http://osstest.test-lab.xenproject.org/~osstest/pub/results-adhoc/bisect/xen-unstable/test-amd64-amd64-qemuu-nested-intel..svg
Instead of test-amd64-amd64-qemuu-nested-intel.debian-hvm-install-l1-l2.svg
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Fri, 27 Nov 2015 16:36:23 +0000 (16:36 +0000)]
sg-run-job: Coalesce a couple of repetitions
Fold `guest-localmigrate.2' into `guest-localmigrate/x10' and move
`guest-start.2' to after `guest-start.repeat' (reversing the contents
of the latter so that the start comes before the stop).
(guest-start.2 is still necessary because the start/stop test leaves
the guest stopped, whereas the subsequent destroy test ought happen
with the guest running.)
This change will allow the heisenbug compensator to see more of these
failures as the same failures.
The overall effect includes a reduction of the number of localhost
migrations from 11 to 10, but this is better than leaving a misleading
testid containing the string `x10' (or changing the testid).
It is best to fold this way, keeping the testid of the step which
previously had most of the regressions, because: the alternative,
keeping the testid of the low-repetition step, would allow osstest to
use previous lucky passes of the low-repetition step to justify
current failures of the now-high-repetition step.
To check that the effect of the patch is as intended, I ran a before
and after run with OSSTEST_SIMULATE=1, and (a) collected and sedded
and diffed the sg-run-job transcripts and (b) looked in the db.
I also ran a real test (65261 in the Xen Project test lab) with a very
similar version, which passed, and will re-run that before pushing.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
---
v2: Do not increment count of migration tests so as to make
testid misleading.
Do the change to the start/stop test differently.
Ian Jackson [Mon, 30 Nov 2015 13:35:54 +0000 (13:35 +0000)]
cs-bisection-step: Limit size of revision log included in reports
There is a limit in cr-daily-branch, but none in cs-bisection-step.
adhoc-revtuple-generator could usefully have this built in but that's
not so simple, so do it again here. We already slurp the whole thing
into core so from a resource usage point of view we might as well do
the length check here too.
Reported-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v2: Fix typo in message.
Ian Jackson [Fri, 27 Nov 2015 11:36:05 +0000 (11:36 +0000)]
README.email: Document `fail in 58948 REGR. vs. 63449'
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v2: Change an example number to improve clarity.
Ian Jackson [Wed, 25 Nov 2015 11:38:39 +0000 (11:38 +0000)]
Executive HTML output: More varieties of grey fruit, fewer bananas
The use of yellow for `preparing' and `running' is particularly
inappropriate in the projection summary, but is also rather misleading
when showing cancelled flights.
Use shades of grey for the different levels of in-progress-ness.
Darker shades are `more running', which seems to align better with the
shades in `Scheduled' in the flight summary.
Yellow now definitely means something is `broken', or worse.
The two places where this needs to be changed are actually
meaningfully different: report_run_getinfo works on job statuses,
whereas sg-report-flight handles only steps, which have many fewer
statuses.
Here are some samples of the ouput, from the Citrix Cambridge
instance:
Ian Campbell [Mon, 30 Nov 2015 11:58:42 +0000 (11:58 +0000)]
ts-debian-di-install: Don't set runvars for netboot kernel+ramdisk as outputs
Currently these runvars are either URLs provided by the definition
(e.g. make-flight) or output controller-relative paths created by the
execution (in the case where they aren't from the definition).
This wierd dual-semantics is confusing and wrong, and in particular is
broken if the test step is rerun (e.g. in standalone mode).
In the case where they are outputs only these paths is information
only. The information is already available in the full logs so
dropping the runvars here merely removes the information from the
summary table. It's not so useful that this is an issue.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Tue, 17 Nov 2015 17:13:46 +0000 (17:13 +0000)]
make-flight: move in-lined branch vs arch filtering into callbacks
No change to the output of standalone-generate-dump-flight-runvars
The inlined xenbranch vs arch filters remain where they are since they
are common (in that they reflect the addition and removal of arches)
and apply equally to all make-*-flight.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Tue, 17 Nov 2015 17:13:45 +0000 (17:13 +0000)]
make-flight: consolidate branch_filter_callback for builds and tests
Currently we have a test_matrix_branch_filter_callback which filters
jobs based on the $xenarch and $branch and a separate more adhoc
filter inline for the build jobs.
This has lead to things getting out of sync in the past (e.g. recently
we dropped armhf tests from the linux-3.10 and -3.14 branches but not
the build jobs).
Add a new build_matrix_branch_filter_callback and for make-flight
cause this and test_matrix_branch_filter_callback to use a common
helper.
The adhoc filtering in the build loop remains and will be tidied up
next.
For make-distros-flight just add a nop build filter alongside the test
filter.
No change to the output of standalone-generate-dump-flight-runvars.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Tue, 17 Nov 2015 17:13:44 +0000 (17:13 +0000)]
make-*flight: Reorder.
I just got tripped up again by putting a build job filter definition
after the call to create_build_jobs. Reorder the make-*flight scripts
to reduce the probability of me doing so any more times.
The general order of these scripts is now:
- job filter callbacks
- test job creation
- top-level code which drives the process.
No change to the output of standalone-generate-dump-flight-runvars.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Tue, 17 Nov 2015 17:13:43 +0000 (17:13 +0000)]
mfi-common: Drop armhf build jobs on older linux-* branches
test_matrix_branch_filter_callback was recently updated to exclude
testing of linux-3.10 (54f237784d4b) and 3.14 (b0c5663a03e7), however
this only excludes test and not build jobs so we were still trying to
build the arm kernel there.
This changes the build job filtering to be in sync with the test job
filtering, the net result is to remove build-armhf* from linux-3.10
and linux-3.14.
A subsequent patch will try and combine those two filters into one to
prevent this skew happening again.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Tue, 24 Nov 2015 15:39:00 +0000 (15:39 +0000)]
Debian: Move sysvinit-core install from preseed_create to preseed_base
preseed_create is used only for physical host installs, but we want
this workaround to apply to at least ts-debian-hvm-install'd guests
which are going to be used for nested testing.
I can't see any harm in doing this globally for all Debian HVM guests,
as well as PV guests installed using Debian Installer.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Mon, 23 Nov 2015 11:54:11 +0000 (11:54 +0000)]
Set {ident}_suite runvar when install a Debian guest.
Currently those places which want this open code a lookup of the
{ident}_suite runvar with a fallback to the configuration file.
However selecthost was missing such a lookup in the case where it is
constructing a nested L1 host (which begins from the selectguest
template), which lead to ts-xen-install on Jessie missing the
installation of libnl-route-3-200.
Fix this by providing debian_guest_suite($gho) which as well as
initialising $gho->{Suite} stores an {ident}_suite runvar (taking care
to handle the case where one is already set by e.g. make-flight). For
convenience debian_guest_suite() also returns the suite name.
ts-debian-install, ts-debian-di-install and ts-debian-hvm-install now
use debian_guest_suite instead of open coding the lookup.
The final piece of the puzzle is to have selectguest() pickup the
{ident}_suite runvar (if it is set) and initialise $gho->{Suite} from
it.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Mon, 23 Nov 2015 15:53:53 +0000 (15:53 +0000)]
sg-run-job: Make nested-layer-descend insist on $ok
per-host-ts does nothing if !$ok, so per-host-prep's individual steps
become no-ops if the host prep fails, which prevents blundering on.
But per-host-prep does not throw.
The other call site explicitly avoids calling the recipe script if
!$ok. nested-layer-descend is for calling from within a recipe, so we
need to throw an exception to abort the script, if !$ok.
Reported-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Campbell [Tue, 17 Nov 2015 10:58:39 +0000 (10:58 +0000)]
make-flight: Stop testing ARM for linux-3.14 flights
Since d4dba6183d61 "ts-kernel-build: Include dtbs in dist file" we
know require a "make dtbs_install" target, which was only added after
3.14.
None of the ARM h/w in the XenProject test colo can run with a kernel
this old. We do have a system in the Citrix instance in Cambridge
which does but the complexity vs benefit of testing this doesn't
warrant continuing to test 3.14.
Our current baseline for other ARM testing is a 3.16 based
linux-arm-xen branch.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Campbell [Wed, 14 Oct 2015 09:45:36 +0000 (10:45 +0100)]
standalone: only rotate logs if savelog is available
`savelog' comes from the `debianutils' package and so is unlikely to
be available elsewhere. Revert to the old behaviour of clobbering the
logs in this case.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Tested-by: Robert Ho <robert.hu@intel.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Mon, 16 Nov 2015 15:21:12 +0000 (15:21 +0000)]
logm: Introduce $logm_prefix, and annotate nested guest messages
This allows code elsewhere to annotate log messages which might
otherwise be confusing. The variable should be localised, and the
value should always start with a space, if not empty.
Use this to annotate the calls to selecthost and selectguest from
within selecthost-for-an-L1. Otherwise some of the log messages can
be very confusing.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Mon, 16 Nov 2015 12:33:51 +0000 (12:33 +0000)]
Log capture pathnames: Include host name (and nested L0 name)
Rather than using $gho->{Name} (and, sometimes, separately
$ho->{Name}), use hostnamepath().
This means that the hostname is always included in the standard way.
The filename structure for xenctx and console snapshots changes
slightly.
In two cases this means that the hostname is now included where it
wasn't previously. That is helpful if due to some insanity the same
guest is present on more than one host.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Mon, 16 Nov 2015 12:26:11 +0000 (12:26 +0000)]
Nested hosts: Print L0 as well as L1 parent name in selectguest
Now that we have hostnamepath_list, we can remove the ad-hoc
expansion "$gn on $...{Name}" with a recipe which ascends through the
applicable nesting levels.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Tue, 17 Nov 2015 11:48:50 +0000 (11:48 +0000)]
ms-flights-summary: Improve the overview table
- Align the queued/preparing/running/etc. totals into table columns.
- Use <strong> to highlight `queued' jobs.
- Use <strong> to highlight the end time if it is the flight end,
not just a phase end.
- No longer print the `total unqueued' separately.
- Show the `common info'.
Replace much of the HTML generation with plain literal or
almost-literal HTML, since the HTML is complex to generate but easy to
write.
An example of the output can be seen here:
http://xenbits.xen.org/people/iwj/2015/flights-summary.html
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Mon, 16 Nov 2015 12:19:54 +0000 (12:19 +0000)]
Nested hosts: Use hostnamepath() in create_webfile
create_webfile needs a pathname in the shared public-html directory.
These paths need to be (a) stable (b) unique across all running jobs.
We achieve this by basing the filenames on the hostname and (for a
guest) the guest name.
But for an L2 guest we need to include the physical host name too,
because the L1 `host' is not unique.
Fix this by using hostnamepath(), replacing the open-coded single
iteration.
Reported-by: Ian Campbell <ian.campbell@citrix.com> CC: Robert Ho <robert.hu@intel.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>