Ian Jackson [Wed, 10 Feb 2021 16:41:12 +0000 (16:41 +0000)]
Disable updates for ssapshot.debian.org
security updates are a separate apt source.
The point of using snapshot is to avoid pulling in uncontrolled
updates, so we need to disable security updates.
The non-security SUITE-updates are disabled by this too. But
everything is on fire and I don't want another iteration while I
figure out the proper syntax for disabling only the security updates.
Ian Jackson [Tue, 9 Feb 2021 13:03:32 +0000 (13:03 +0000)]
mg-debian-installer-update: Honour redirect for dtbs
When using snapshots, we can get a redirect and then we don't
recurse. There doesn't seem to be a suitable option for wget, so do
this by hand before we call wget -m.
Ian Jackson [Fri, 22 Jan 2021 15:11:01 +0000 (15:11 +0000)]
make-flight: Stripy xenstored
Previously, we let the Xen build system and startup scripts choose
which xenstored to use. Before we upgraded to Debian buster, that
gave us C xentored tests on ARM. Since then, armhf and arm64 have
both had enough ocaml support and we haven't been testing C xenstored
at all !
Change this, by selecting between C xenstored and Ocaml xenstored
"at random". Actually, this is based on the job name. So the same
job in different branches will use the same xenstored - which helps
avoid confusion.
I have diffed the output of standalone-generate-dump-flight-runvars.
As expected, this addes a variable all_host_xenstored to every job.
To make sure we have enough diversity, I eyeballed the results. In
particular:
* The smoke tests now have 2 C and 2 Ocaml, one of each on
ARM and x86.
* XTF tests have 2 oxenstored and 3 C xenstored.
* The ovmf flight has one of each
* The seabios and libvirt flights look reasonably mixed.
Most other flights have enough jobs that I think things are diverse
enough without looking at them all in detail.
I think this lack of testing needs fixing for the Xen 4.15 release.
So after review I intend to push this to osstest pretest, and may
force push it even if shows regressions.
CC: Edwin Török <edvin.torok@citrix.com> CC: Andrew Cooper <Andrew.Cooper3@citrix.com> CC: Jürgen Groß <jgross@suse.com> CC: Wei Liu <wl@xen.org> Signed-off-by: Ian Jackson <iwj@xenproject.org> Release-Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
Ian Jackson [Thu, 19 Nov 2020 16:55:48 +0000 (16:55 +0000)]
sg-report-flight: Actually look at retest flights (part 2)
To avoid going down ratholes (especially for jobs which reuse outputs
from their previous selves), the primary flight/job finder in
sg-report-flight does not recurse indefinitely through build jobs.
Instead, it restricts the build jobs investigated to those within the
same flight as the job which might be of interest.
As a result, retest jobs are, unfortunately, discarded at this stage
because we insist that the build jobs we find did use the tree
revision we are investigating.
Fix this by recursing into the corresponding primary flight too.
In the $flightsq->fetchrow loop that's $xflight.
For the primary flight, ie the first half of the UNION, that's just
the fligth itself. So there this has no change.
For the retest flights, it is the flight that all the build jobs refer
to. Despite the CROSS JOIN, this will be unique for any particular
"retest flight", because the query on the runvars insists that all of
the (at least some) buildjob runvars for f1 point to f0. Ie, f1 has
no build jobs and refers to f0 for build jobs; so it can't refer to
any other f0' in the cross join.
With this change, a -retest flight can now actually be used to justify
a push.
Ian Jackson [Thu, 19 Nov 2020 16:24:32 +0000 (16:24 +0000)]
sg-report-flight: Actually look at retest flights (part 1)
The existing approach does not find retest flights. This is because
it starts by looking for flights whose runvars say they built the
version in question, and then looks to see if they contain the
relevant job.
Retest flights don't contain build jobs; they reuse the builds from
the primary flight.
Rather than making a fully general recursion scheme (which would
involve adding another index so we could quickly find all flights
which refer to this one), we add a one-level recursion.
This recursion is restricted to the flights named on the command line.
This means it takes nearly no time (as opposed to searching the whole
db for things that might be relevant - see above re need for an
index).
We filter the command line flights, looking ones which refer to the
only the primarily found flights as build jobs.
Ian Jackson [Fri, 13 Nov 2020 17:34:32 +0000 (17:34 +0000)]
cr-daily-branch: Sort out retest build reuse
Retest flights ought to reuse precisely the builds from the flight
which prompts the retests.
mg-adjust-flight-makexrefs is the wrong tool for this job. It can
often leave the retry flights with no build jobs and no references to
the main flights' build jobs, so the results are just broken jobs.
Ian Jackson [Fri, 23 Oct 2020 16:08:02 +0000 (17:08 +0100)]
host reuse fixes: Properly clear out old static tasks from history
The algorithm for clearing out old lifecycle entries was wrong: it
would delete all entries for non-live tasks.
In practice this would properly remove all the old entries for
non-static tasks, since ownd tasks typically don't releease things
until the task ends (and it becomes non-live). And it wouldn't remove
more than it should do unless some now-not-live task had an allocation
overlapping with us, which is not supposed to be possible if we are
doing a host wipe. But it would not remove static tasks ever, since
they are always live.
Change to a completely different algorithm:
* Check that only us (ie, $ttaskid) has (any shares of) this host
allocated. There's a function resource_check_allocated_core which
already does this and since we're conceptually part of Executive
it is proper for us to call it. This is just a sanity check.
* Delete all lifecycle entries predating the first entry made by
us. (We could just delete all entries other than ours, but in
theory maybe some future code could result in a siutation where
someone else could have had another share briefly at some point.)
This removes old junk from the "Tasks that could have affected" in
reports.
Ian Jackson [Thu, 22 Oct 2020 14:38:12 +0000 (15:38 +0100)]
starvation: Do not count more than half a flight as starved
This seems like a sensible rule.
This also prevents the following bizarre behaviour: when a flight has
a handful of jobs that cannot be run at all (eg because it's a
commissioning flight for only hosts of a particular arch), those jobs
can complete quite quickly. Even with a high X value because only a
smallish portion of the flight has finished, this can lead to a modest
threshhold value. This combines particularly badly with commissioning
flights, where the duraation estimates are often nonsense.
Ian Jackson [Thu, 22 Oct 2020 14:02:18 +0000 (15:02 +0100)]
host reuse fixes: Do not break host-reuse if no host allocated
If host allocation failed, or our dependency jobs failed, then we
won't have allocated a host. The host runvar will not be set.
In this case, we want to do nothing.
Ian Jackson [Mon, 19 Oct 2020 11:41:27 +0000 (12:41 +0100)]
Prefix guest LV names with the job name
This means that a subsequent test which reuses the same host will not
use the same LVs. This is a good idea because reusing the same LV
names in a subsequent job means relying on the "ad hoc run" cleanup
code. This is a bad idea because that code is rarely tested.
And because, depending on the situation, the old LVs may even still be
in use. For example, in a pair test, the guest's LVs will still be
set up for use with nbd.
It seems better to fix this by using a fresh LV rather than adding
more teardown code.
The "wear limit" on host reuse is what prevents the disk filling up
with LVs from old guests.
ts-debian-fixup needs special handling, because Debian's xen-tools'
xen-create-image utility hardcodes its notion of LV name construction.
We need to rename the actual LVs (perhaps overwriting old ones from a
previous ad-hoc run) and also update the config.
Ian Jackson [Wed, 21 Oct 2020 17:38:51 +0000 (18:38 +0100)]
host reuse fixes: Fix runvar entry for adhoc tasks
When processing an item from the host lifecycle table into the runvar,
we don't want to do all the processing of flight and job. Instead, we
should simply put the ?<taskid> into the runvar.
Previously this would produce ?<taskid>: which the flight reporting
code would choke on.
Ian Jackson [Wed, 21 Oct 2020 16:28:07 +0000 (17:28 +0100)]
host reuse fixes: Fix running of steps adhoc
When a ts script is run by hand, for adhoc testing, there is no
OSSTEST_TESTID variable in the environment and the script does not
know it's own step number. Such adhoc runs are not tracked as steps
in the steps table.
For host lifecycle purposes, treat these as ad-hoc out-of-flight uses,
based only on the taskid (which will usually be a person's personal
static task).
Without this, these adhoc runs fail with a constraint violating trying
to insert a flight/job/step row into the host lifecycle table: the
constraint requires the step to be specified but it is NULL.
Ian Jackson [Wed, 21 Oct 2020 15:14:17 +0000 (16:14 +0100)]
PDU/MSU: Retransmit on/off until PDU has changed
The main effect of this is that the transcript will actually show the
new PDU state. Previously we would call show(), but APC PDUs would
normally not change immediately, so the transcript would show the old
state.
This also guards against an unresponsive PDU or a packet getting lost.
I don't think we have ever seen that.
Ian Jackson [Wed, 21 Oct 2020 15:01:23 +0000 (16:01 +0100)]
share in jobdb: Move out-of-flight special case higher up
This avoids running the runvar computation loop outside flights.
This is good amongst other things because that loop prints warnings
about undef $flight and $job.
Ian Jackson [Fri, 16 Oct 2020 15:28:58 +0000 (16:28 +0100)]
known hosts handling: Ensure things are good for multi-host jobs
When a multi-host job reuses host(s) from earlier jobs, the set of
hosts set up in the on-host known_hosts files may be insufficient,
since the hosts we are using now may not have been in any of the
flight's runvars when the earlier job set them up.
So we need to update the known_hosts. We use the flight's current
set, which will include all of our hosts.
Ian Jackson [Fri, 16 Oct 2020 12:33:15 +0000 (13:33 +0100)]
host reuse: Reuse hosts only in same role (for now)
This is a workaround. There is a problem with hoat key setup in a
group of hosts, which means that when a pair test reuses a host set up
by a different test, we can get
Host key verification failed.
during the src-to-dst migration.
Ian Jackson [Thu, 8 Oct 2020 17:35:43 +0000 (18:35 +0100)]
cri-args-hostlists: Move flight_html_dir variable
This is only used in report_flight. We are going to want to call
report_flight from outside start_email, without having to set that
variable ourselves.
Put the call to this debugging/testing variable inside an eval. This
allows a wider variety of stunts. The one in-tree reference is
already compatible with this new semantics.
Ian Jackson [Thu, 15 Oct 2020 13:15:26 +0000 (14:15 +0100)]
sg-report-flight: Consider all blessings for "never pass"
$anypassq is used for the "never pass" check; the distinction between
this and simply "fail" is cosmetic (although it can be informative).
On non-"real" flights, it can easily happen that the flight never
passed *on this branch with this blessing* but has passed on real. So
the steps subquery does not find us an answer within reasonable time.
Work around this by always searching for "real". This keeps the
performance within acceptable bounds even during ad-hoc testing.
We don't actually use the row from this query, so the only effect is
that when the job passed in a "real" flight, we go on to the full
regresson analysis rather than short-circuiting and reporting "never
pass".
Ian Jackson [Mon, 5 Oct 2020 17:47:12 +0000 (18:47 +0100)]
sg-report-flight: Sharing reports: more task finished info
Other steps from jobs affecting this host either started after we are
running, and therefore didn't affect the stuff we're reporting, or
already in the db. Furthermore, any such effects for steps which have
finished must have completed by the max finished time But if there
are unfinished steps, we don't know the finish time.
Ian Jackson [Thu, 3 Sep 2020 10:58:30 +0000 (11:58 +0100)]
host reuse: New protocol between sg-run-job and ts-host-reuse
Abolish post-test-ok (which runs only if successful) and replace it
with final (which sets the runvar to indicate finality, and runs
regardless).
This allows a subsequent job which reuses the host to see that this
job had finished using the host. This is relevant for builds, where a
host can be reused even after a failed job.
"Lies", where we claim the use of the host was done, are
avoided (barring unlikely races) because selecthost de-finalises the
runvar.
Ian Jackson [Fri, 28 Aug 2020 13:38:17 +0000 (14:38 +0100)]
resource reporting: Report host reuse/sharing in job report
Compatibility: in principle this might generate erroneous reports
which omit sharing/reuse information for allocations made by jobs
using older versions of osstest.
However, we do not share or reuse hosts across different osstest
versions, so this cannot occur.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Wed, 7 Oct 2020 16:36:50 +0000 (17:36 +0100)]
host lifecycle: Fix detection of concurrent jobs
The previous algorithm was wrong here.
This commit was originally considerably later than the previous one.
I'm avoiding squshing this commit, to make future archaeology easier.
The effect of the bug is to report other tasks as live too often, so
hosts show up as shared rather than reused.
Ian Jackson [Tue, 25 Aug 2020 17:34:42 +0000 (18:34 +0100)]
host lifecycle: Machinery, db, for tracking relevant events
When we reuse test hosts, we want to be able to give a list of the
other jobs which might be responsible for any problem.
In principle it would be possible to do this by digging into the
db's history tables like sg-report-host-history does, but this is
quite slow and also I don't have enough confidence in that approach to
use it for this application.
So instead, track the host lifecycle explicitly.
The approach taken is a hybrid one. I first considered two and a half
approaches:
1. Permanently record all host/share allocations and share state
changes in a host history table. But it is nontrivial to update
all the allocation machinery to keep this table up to date. It is
also nontrivial to extract the necessary information from such a
table: the allocation information would have to be correlated,
using timestamps, with the steps table. That's slow and complex.
We had such a table but it was never used for these reasons;
I dropped that empty table recently.
1b. Like 1 but explicitly put a lifecycle sequence number in the
allocations table,. This would make it easy to find relevant
events but would involve even more complicated logic during
allocation.
2. Record the host's lifecycle information in a file on the host.
This means it gets wiped whenever the host does and makes finding
the relevant jobs easy: read the file during logs capture, and
we'll find everything of relevance. It then has to be permanently
stored somewhere it can be used for logging and archaeology: a
per-job runvar giving the relevant host history, up to the point
where that job finished. does that job nicely. However, this
has a serious problem: if the host crashes hard, we may not be
able to recover the complete information about why! We really
want to the information recorded outside the host in question.
So I've taken a hybrid approach: effectively replicate the per-host
file from (2), but put the information in the database. This
necessites a call to clear the host lifecycle history, which we make
at the *end* of the host install. As a bonus this might let us more
easily identify if there are particular jobs that leave hosts in
states that are hard to recover from, and it will make total host
failure quite obvious because the host install log report will have a
list of the failed attempts (longer in each successive job).
For build jobs we only record the setup job, and concurrent jobs, in
the runvar. This does not seem to have been a problem so far, and
this avoids having to do work on other allocations (eg, mg-allocate).
It also avoids having very long lists of previous builds listed in
every build job.
Test jobs are only shared within a flight and with much more limited
scope so the same considerations don't arise. But by the same token,
we also do not need to adjust mg-allocate etc., since the user ought
not to allocate shares of test hosts unless they know what they are
doing.
In this commit we introduce:
* The database table
* The runvar syntax
* The function for recording the lifecycle events
We have what amounts to an ad-hoc compression scheme for the
information in the lifecycle runvars. Otherwise this data might get
quite voluminous, which can makes various other db queries slow.
There isn't a very good way to represent out-of-job tasks in the
lifecycle runvar. We could maybe put in something from the tasks
table, but the entry in the tasks table might be gone by now and that
would involve quoting (and it might be quite large).
But this will only matter when a shared/reused host has been manually
messed with, and recording the task is sufficient to
(1) note the fact of such interference
(2) if the task is static, or still going when the job reports,
can actually be put in the report.
(3) failing that provide something which could be grepped for in logs
We do not call the recording function yet, so the db update is merely
Preparatory.
There is a bug in this patch: the calculation of $olive is wrong.
This will be fixed in a moment.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Fri, 21 Aug 2020 14:22:19 +0000 (15:22 +0100)]
test host reuse: Switch to principled sharing scope runvar scheme
* When selecthost is passed an @host ident, indicating prep work,
engage restricted runvar access. If no call to sharing_for_build
was made, this means it can access only the runvars in
the default value of @accessible_runvar_pats.
* Make the sharetype for host reuse be based on the values of
precisely those same runvars, rather than using an adhoc scheme.
The set of covered runvars is bigger now as a result of testing...
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
The syslog server, and its port, is used for things that happen in
this job, but the syslog server is torn down and a new one started,
when the host is reused.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Fri, 21 Aug 2020 11:47:02 +0000 (12:47 +0100)]
runvar access: Provide runvar_glob
We will need this because when runvar access is restricted, accessing
via %r directly won't work. We want to see what patterns the code is
interested in (so that interest in a nonexistent runvar is properly
tracked).
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>