Ian Jackson [Thu, 3 Sep 2020 10:58:30 +0000 (11:58 +0100)]
host reuse: New protocol between sg-run-job and ts-host-reuse
Abolish post-test-ok (which runs only if successful) and replace it
with final (which sets the runvar to indicate finality, and runs
regardless).
This allows a subsequent job which reuses the host to see that this
job had finished using the host. This is relevant for builds, where a
host can be reused even after a failed job.
"Lies", where we claim the use of the host was done, are
avoided (barring unlikely races) because selecthost de-finalises the
runvar.
Ian Jackson [Fri, 28 Aug 2020 13:38:17 +0000 (14:38 +0100)]
resource reporting: Report host reuse/sharing in job report
Compatibility: in principle this might generate erroneous reports
which omit sharing/reuse information for allocations made by jobs
using older versions of osstest.
However, we do not share or reuse hosts across different osstest
versions, so this cannot occur.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Wed, 7 Oct 2020 16:36:50 +0000 (17:36 +0100)]
host lifecycle: Fix detection of concurrent jobs
The previous algorithm was wrong here.
This commit was originally considerably later than the previous one.
I'm avoiding squshing this commit, to make future archaeology easier.
The effect of the bug is to report other tasks as live too often, so
hosts show up as shared rather than reused.
Ian Jackson [Tue, 25 Aug 2020 17:34:42 +0000 (18:34 +0100)]
host lifecycle: Machinery, db, for tracking relevant events
When we reuse test hosts, we want to be able to give a list of the
other jobs which might be responsible for any problem.
In principle it would be possible to do this by digging into the
db's history tables like sg-report-host-history does, but this is
quite slow and also I don't have enough confidence in that approach to
use it for this application.
So instead, track the host lifecycle explicitly.
The approach taken is a hybrid one. I first considered two and a half
approaches:
1. Permanently record all host/share allocations and share state
changes in a host history table. But it is nontrivial to update
all the allocation machinery to keep this table up to date. It is
also nontrivial to extract the necessary information from such a
table: the allocation information would have to be correlated,
using timestamps, with the steps table. That's slow and complex.
We had such a table but it was never used for these reasons;
I dropped that empty table recently.
1b. Like 1 but explicitly put a lifecycle sequence number in the
allocations table,. This would make it easy to find relevant
events but would involve even more complicated logic during
allocation.
2. Record the host's lifecycle information in a file on the host.
This means it gets wiped whenever the host does and makes finding
the relevant jobs easy: read the file during logs capture, and
we'll find everything of relevance. It then has to be permanently
stored somewhere it can be used for logging and archaeology: a
per-job runvar giving the relevant host history, up to the point
where that job finished. does that job nicely. However, this
has a serious problem: if the host crashes hard, we may not be
able to recover the complete information about why! We really
want to the information recorded outside the host in question.
So I've taken a hybrid approach: effectively replicate the per-host
file from (2), but put the information in the database. This
necessites a call to clear the host lifecycle history, which we make
at the *end* of the host install. As a bonus this might let us more
easily identify if there are particular jobs that leave hosts in
states that are hard to recover from, and it will make total host
failure quite obvious because the host install log report will have a
list of the failed attempts (longer in each successive job).
For build jobs we only record the setup job, and concurrent jobs, in
the runvar. This does not seem to have been a problem so far, and
this avoids having to do work on other allocations (eg, mg-allocate).
It also avoids having very long lists of previous builds listed in
every build job.
Test jobs are only shared within a flight and with much more limited
scope so the same considerations don't arise. But by the same token,
we also do not need to adjust mg-allocate etc., since the user ought
not to allocate shares of test hosts unless they know what they are
doing.
In this commit we introduce:
* The database table
* The runvar syntax
* The function for recording the lifecycle events
We have what amounts to an ad-hoc compression scheme for the
information in the lifecycle runvars. Otherwise this data might get
quite voluminous, which can makes various other db queries slow.
There isn't a very good way to represent out-of-job tasks in the
lifecycle runvar. We could maybe put in something from the tasks
table, but the entry in the tasks table might be gone by now and that
would involve quoting (and it might be quite large).
But this will only matter when a shared/reused host has been manually
messed with, and recording the task is sufficient to
(1) note the fact of such interference
(2) if the task is static, or still going when the job reports,
can actually be put in the report.
(3) failing that provide something which could be grepped for in logs
We do not call the recording function yet, so the db update is merely
Preparatory.
There is a bug in this patch: the calculation of $olive is wrong.
This will be fixed in a moment.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Fri, 21 Aug 2020 14:22:19 +0000 (15:22 +0100)]
test host reuse: Switch to principled sharing scope runvar scheme
* When selecthost is passed an @host ident, indicating prep work,
engage restricted runvar access. If no call to sharing_for_build
was made, this means it can access only the runvars in
the default value of @accessible_runvar_pats.
* Make the sharetype for host reuse be based on the values of
precisely those same runvars, rather than using an adhoc scheme.
The set of covered runvars is bigger now as a result of testing...
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
The syslog server, and its port, is used for things that happen in
this job, but the syslog server is torn down and a new one started,
when the host is reused.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Fri, 21 Aug 2020 11:47:02 +0000 (12:47 +0100)]
runvar access: Provide runvar_glob
We will need this because when runvar access is restricted, accessing
via %r directly won't work. We want to see what patterns the code is
interested in (so that interest in a nonexistent runvar is properly
tracked).
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Fri, 21 Aug 2020 16:40:07 +0000 (17:40 +0100)]
shared/reuse: Rely on @ for ts-host-ping-check
Remove the check for SharedReady.
The existence of this check is perplexing. It was introduced in
ts-host-ping-check: Do not run if host is being reused
in 8f1dc3f7c401 (from 2015).
At that time we only share build hosts, and build hosts never ran this
script. So I don't understand what that was hoping to achieve. Maybe
it made some difference in a now-lost pre-rebase situation.
Anyway, in our current tree I think we want to rerun the
ts-host-ping-check when we reuse a test host. My change to add @ to
parts of per-host-prep in sg-run-job deliberately omitted the step
with testid host-ping-check-xen/@.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Tue, 21 May 2019 16:37:42 +0000 (17:37 +0100)]
shared/reuse: Use @ for freebsd host prep
These are all the relevant call sites for ts-freebsd-host-install and
ts-freebsd-build-prep. (There's a ts-freebsd-host-install in
ts-memdisk-try-append but that's for host examination and does not
uee or want sharing or reuse.)
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Wed, 22 May 2019 16:42:16 +0000 (17:42 +0100)]
sg-run-job: Detect improper use of @ iffail with run-ts
Only per-host-ts understands this. This is a bit of a bear trap, so
arrange to bail rather than putting strange step status values with
`@' at the front in the database...
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Tue, 21 May 2019 16:06:50 +0000 (17:06 +0100)]
ts-host-reuse: New script, to do reuse state changes
This will be made part of the test job recipes.
We calculate the sharing scope (sharetype) by reference to a lot of
runvars, etc.
This version of the script is rather far from the finished working
one, but it seems better to preserve the actual history for how it got
the way it is.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Wed, 8 Nov 2017 16:29:07 +0000 (16:29 +0000)]
plan_search: Track last sharing state to determine $share_reuse
What matters for the purpose of $share_reuse is not whether the host
is actually being _shared_ (ie, there are other concurrent allocations
and therefore a concurrent Event with Share information). What we
really want to know is whether the *last* use of this host was a
suitable sharing setup - because we actually want to know if we will
be able to skip our setup.
So track that explicitly. (The slightly odd structure, where there
are two loops in one, means that we reset $last_eshare when we go onto
the next $req ie the next host to check.)
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Fri, 27 Oct 2017 16:52:49 +0000 (17:52 +0100)]
host allocation: selecthost: allow sort-of-selection of prospective hosts
If one passes a trueish value for $prospective, selecthost does not
worry about whether any host has actually been selected. It does a
limited amount of prep work.
This will be useful if we want to know some of the non-host-specific
information selecthost computes - in particular, $ho->{Suite} etc.
No functional change with existing callers.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Fri, 27 Oct 2017 14:42:39 +0000 (15:42 +0100)]
host allocation: *_shared_mark_ready: Make $sharetype check optional
We are going to want to be able to set shares to other than ready,
without double-checking the sharetype.
The change to the UPDATE statement makes no difference because
resource_check_allocated_core has just got that sharetype out of the
db. (This does remove one safety check against bugs, sadly.)
No functional change for existing callers.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Fri, 3 Nov 2017 17:40:42 +0000 (17:40 +0000)]
ts-hosts-allocate-Executive: Fix handling of failed preps for same sharing
This code was previously unreachable. It ought to be executed when
all the shares are allocatable or prep: in that case, we can unshare
and re-share the host.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Mon, 24 Aug 2020 17:54:18 +0000 (18:54 +0100)]
Debian: osstest-erase-other-disks: Slightly guard against races
Apparently it can happen that something decides to rescan a partition
table, removing a partition block device, while it is being zeroed:
osstest-erase-other-disks-6081: hd devices present after: /dev/hd*
osstest-erase-other-disks-6081: Erasing /dev/sda
osstest-erase-other-disks-6081: Erasing /dev/sda1
osstest-erase-other-disks-6081: /dev/sda1 is no longer a block device!
To try to narrow the window during which this race occurs, do not care
if the thing we just zeroed no longer exists after we zeroed it.
We still bomb out if it exists but is not a block device - that would
probably mean we had written it out as a file.
This is all quite unfortunate.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Tue, 25 Aug 2020 11:02:13 +0000 (12:02 +0100)]
abolish "kernkind"; desupport non-pvops kernels
This was for distinguishing the old-style Xenolinux kernels from pvops
kernels.
We have not actually tested any non-pvops kernels for a very very long
time. Delete this now because the runvar is slightly in the way of
test host reuse.
(Sorry for the wide CC but it seems better to make sure anyone who
might object can do so.)
All this machinery exists just to configure the guest console
device (Xenolinux used "xvc" rather than "hvc") and the guest root
block device (Xenolinux stole "hda"/"sda" rather than using "xvda").
Specifically, in this commit:
* In what is now target_setup_rootdev_console_inittab, do not
look at any kernkind runvar and simply do what we would if
it were "pvops" or unset, as it is in all current jobs.
* Remove the runvar from all jobs creation and example runes.
(This has no functional change even for jobs running with
the previous osstest code because we have defaulted to "pvops"
for a very long time.)
We retain the setting of the shell variable "kernbuild", because that
ends up in build jobs' names. All our kernel build jobs now end in
-pvops and I intend to retain that name component since abolishing it
is nontrivial.
We move this earlier. This is OK because it depends only on the
console runvar (inside the sub; this is set by target_kernkind_check),
$ho and $gho (which are set by this point); and $mountpoint$ (which is
set by access().
No functional change.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Mon, 30 Oct 2017 11:36:16 +0000 (11:36 +0000)]
show_abs_time: Represent undef $timet as <undef>
This can happen, for example, if a badly broken flight has steps which
are STARTING and have NULL in the start time column, and is then
reported using sg-report-flight.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Thu, 1 Oct 2020 14:17:44 +0000 (15:17 +0100)]
Tolerate lack of platform-specific hosts in old Xen branches
Right now we have a situation where these can't all be made to work
because because some older Xen branches are hard to make work on
current Debian stable, and we have some hardware (which we have tagged
as specific "platforms") which doesn't work with oldstable.
This seems like a general problem, so fix it this way.
Note that we still treat these failed allocations as failures, so they
are subject to regression analysis and ought not to appear willy-nilly
on existing branches.
Runvar dump shows the addition of this runvar
hostalloc_missing_expected=1
to
qemu-upstream-4.6-testing
xen-4.6-testing
...
qemu-upstream-4.14-testing
xen-4.14-testing
inclusive.
Set MF_SIMULATE_PLATFORMS to a suitable value if it is
not *set*. (Distinguishing unset from set to empty.)
I have verified that this, plus the preceding commits to
cri-getplatforms, produces no change in the output of
MF_SIMULATE_PLATFORMS='' OSSTEST_CONFIG=standalone-config-example eatmydata ./standalone-generate-dump-flight-runvars
Without the MF_SIMULATE_PLATFORMS setting it adds several new jobs to
each flight, name things like this:
test-amd64-$arch1-xl-simplat-$arch2-$suite
The purpose of this right now is to provide a way to dry-run test the
next change.
Ian Jackson [Thu, 1 Oct 2020 15:36:17 +0000 (16:36 +0100)]
cri-getplatforms: Honour new MF_SIMULATE_PLATFORMS env var
This is to be expanded by the shell, using eval, so that it can refer
to $xenarch, $suite and $blessing.
No functional change if this variable is unset, or empty. If it is
set to a single space, cri-getplatforms produces no output (as it does
anyway in standalone mode).
Ian Jackson [Thu, 1 Oct 2020 14:18:39 +0000 (15:18 +0100)]
ts-hosts-allocate-Executive: Allow to tolerate missing resources
Now, a job can specify that lack of a suitable host should be treated
as a plain test failure (ie, subject to the usual regression analysis)
rather than as an infrastructure or configuration problem.
This will be useful for some tests which don't work in some branches
because of lack of suitable hardware. We want to avoid encoding our
hardware availability situation in make-flight.
Ian Jackson [Thu, 1 Oct 2020 16:02:48 +0000 (17:02 +0100)]
sg-run-job: Preserve step state "fail" if set by test script
If the test script exits nonzero but after setting the step status to
'fail', we can leave it that way. This is particularly relevant if
the iffail in the job spec says 'broken' or something. After this
change, a step can decide to override that.
An alternative would be to have the step script exit zero, but of
course that would (generally) leave the job to continue running more
steps!
Ian Jackson [Thu, 24 Sep 2020 16:14:25 +0000 (16:14 +0000)]
TftiDiVersion: Update to latest installer for stretch
The stretch (Debian oldstable) kernel has been updated, causing our
Xen 4.10 tests (which are still using stretch) to break. This update
seems to fix it.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Ian Jackson <iwj@xenproject.org>
Ian Jackson [Wed, 19 Aug 2020 14:59:09 +0000 (15:59 +0100)]
schema: Provide index on flights by start time
We often use flight number as a proxy for ordering, but this is not
always appropriate and not always done (and sometimes it's a bit of a
bodge).
Provide an index to find flights by start time. This significantly
speeds up the host allocation $equivstatusq query, and the duration
estimator.
(I have tested this by creating a trial index in the production
database. That index can be dropped again, preferably after this
commit makes it to production.)
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Wed, 19 Aug 2020 12:00:58 +0000 (13:00 +0100)]
host allocation: Memoise duration estimates
We look at our own branch to estimate durations. If somehow we are
one of multiple concurrent flights on this branch with the appropriate
blessing, we don't mind not noticing the doing of our peer flights so
that if our estimates are a bit out of date.
So it is fine to use an estimate no older than our own runtime.
Right now we generate a new duration estimator during each queueing
round, because it contains a statement handle and we must disconnect
from the db while waiting. So the internal memo table gets thrown
away each time and is useless.
To actually memoise, pass our own hash which lives as long as we do.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Wed, 19 Aug 2020 11:13:23 +0000 (12:13 +0100)]
resource allocation: Provide OSSTEST_ALLOC_FAKE_PLAN test facility
Set this variable (to a data-plan.final.pl, say) and it becomes
possible to test host allocation programs without actually allocating
anything and without engaging with the queue system.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Wed, 19 Aug 2020 12:05:22 +0000 (13:05 +0100)]
ts-hosts-allocate-Executive: Fix broken call to $duration_estimator
The debug subref is passed to the constructor (and indeed we do that).
The final argument to the actual estimator is $uptoincl_testid (but we
didn't say $will_uptoincl_testid, so it is ignored).
The code was wrong, but with no effect. So no functional change.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>