]> xenbits.xensource.com Git - osstest.git/log
osstest.git
4 years agopdu-snmp: Refactor model handling
Ian Jackson [Tue, 27 Oct 2020 11:51:26 +0000 (11:51 +0000)]
pdu-snmp: Refactor model handling

This makes it easier to see waht is going on and to add new model(s).

No functional change.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agopdu-snmp: Centralise base OIDs
Ian Jackson [Tue, 27 Oct 2020 11:43:12 +0000 (11:43 +0000)]
pdu-snmp: Centralise base OIDs

Do not hardcoode .3 and .4 in the main logic.

No functional change.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agopdu-snmp: Rename from pdu-msw
Ian Jackson [Tue, 27 Oct 2020 11:39:17 +0000 (11:39 +0000)]
pdu-snmp: Rename from pdu-msw

We are going to make this script control PDUs other than APC ones.

No overall functional change for internal callers.  Anyone out-of-tree
using this script will need to change the name of the program they run.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agoREADME: Fix a typo
Ian Jackson [Tue, 27 Oct 2020 11:37:06 +0000 (11:37 +0000)]
README: Fix a typo

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agots-xen-build-prep: Install ninja
Anthony PERARD [Tue, 20 Oct 2020 09:35:49 +0000 (10:35 +0100)]
ts-xen-build-prep: Install ninja

QEMU upstream now requires ninja to build. (Probably since QEMU commit
09e93326e448 ("build: replace ninjatool with ninja"))

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agoproduction-config-cambridge: Set TftpDiVersion for buster
Ian Jackson [Tue, 8 Sep 2020 11:03:24 +0000 (12:03 +0100)]
production-config-cambridge: Set TftpDiVersion for buster

Just ran ./mg-debian-installer-update-all.

Signed-off-by: Ian Jackson <ijackson@chiark.greenend.org.uk>
4 years agohost reuse fixes: Properly clear out old static tasks from history
Ian Jackson [Fri, 23 Oct 2020 16:08:02 +0000 (17:08 +0100)]
host reuse fixes: Properly clear out old static tasks from history

The algorithm for clearing out old lifecycle entries was wrong: it
would delete all entries for non-live tasks.

In practice this would properly remove all the old entries for
non-static tasks, since ownd tasks typically don't releease things
until the task ends (and it becomes non-live).  And it wouldn't remove
more than it should do unless some now-not-live task had an allocation
overlapping with us, which is not supposed to be possible if we are
doing a host wipe.  But it would not remove static tasks ever, since
they are always live.

Change to a completely different algorithm:

 * Check that only us (ie, $ttaskid) has (any shares of) this host
   allocated.  There's a function resource_check_allocated_core which
   already does this and since we're conceptually part of Executive
   it is proper for us to call it.  This is just a sanity check.

 * Delete all lifecycle entries predating the first entry made by
   us.  (We could just delete all entries other than ours, but in
   theory maybe some future code could result in a siutation where
   someone else could have had another share briefly at some point.)

This removes old junk from the "Tasks that could have affected" in
reports.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agostarvation: Do not count more than half a flight as starved
Ian Jackson [Thu, 22 Oct 2020 14:38:12 +0000 (15:38 +0100)]
starvation: Do not count more than half a flight as starved

This seems like a sensible rule.

This also prevents the following bizarre behaviour: when a flight has
a handful of jobs that cannot be run at all (eg because it's a
commissioning flight for only hosts of a particular arch), those jobs
can complete quite quickly.  Even with a high X value because only a
smallish portion of the flight has finished, this can lead to a modest
threshhold value.  This combines particularly badly with commissioning
flights, where the duraation estimates are often nonsense.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agohost reuse fixes: Do not break host-reuse if no host allocated
Ian Jackson [Thu, 22 Oct 2020 14:02:18 +0000 (15:02 +0100)]
host reuse fixes: Do not break host-reuse if no host allocated

If host allocation failed, or our dependency jobs failed, then we
won't have allocated a host.  The host runvar will not be set.
In this case, we want to do nothing.

But we forgot to pass $noneok to selecthost.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agoreporting: Minor fix to reporting of tasks with no subtask
Ian Jackson [Wed, 21 Oct 2020 18:22:33 +0000 (19:22 +0100)]
reporting: Minor fix to reporting of tasks with no subtask

subtask can be NULL.  If so, do not include it.

This change fixes a warning and a minor cosmetic defect.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agoPrefix guest LV names with the job name
Ian Jackson [Mon, 19 Oct 2020 11:41:27 +0000 (12:41 +0100)]
Prefix guest LV names with the job name

This means that a subsequent test which reuses the same host will not
use the same LVs.  This is a good idea because reusing the same LV
names in a subsequent job means relying on the "ad hoc run" cleanup
code.  This is a bad idea because that code is rarely tested.

And because, depending on the situation, the old LVs may even still be
in use.  For example, in a pair test, the guest's LVs will still be
set up for use with nbd.

It seems better to fix this by using a fresh LV rather than adding
more teardown code.

The "wear limit" on host reuse is what prevents the disk filling up
with LVs from old guests.

ts-debian-fixup needs special handling, because Debian's xen-tools'
xen-create-image utility hardcodes its notion of LV name construction.
We need to rename the actual LVs (perhaps overwriting old ones from a
previous ad-hoc run) and also update the config.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agoIntroduce guest_mk_lv_name
Ian Jackson [Mon, 19 Oct 2020 11:40:23 +0000 (12:40 +0100)]
Introduce guest_mk_lv_name

This changes the way the disk name is constructed but not to any
overall effect.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agohost reuse fixes: Fix runvar entry for adhoc tasks
Ian Jackson [Wed, 21 Oct 2020 17:38:51 +0000 (18:38 +0100)]
host reuse fixes: Fix runvar entry for adhoc tasks

When processing an item from the host lifecycle table into the runvar,
we don't want to do all the processing of flight and job.  Instead, we
should simply put the ?<taskid> into the runvar.

Previously this would produce ?<taskid>: which the flight reporting
code would choke on.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agohost reuse fixes: Fix running of steps adhoc
Ian Jackson [Wed, 21 Oct 2020 16:28:07 +0000 (17:28 +0100)]
host reuse fixes: Fix running of steps adhoc

When a ts script is run by hand, for adhoc testing, there is no
OSSTEST_TESTID variable in the environment and the script does not
know it's own step number.  Such adhoc runs are not tracked as steps
in the steps table.

For host lifecycle purposes, treat these as ad-hoc out-of-flight uses,
based only on the taskid (which will usually be a person's personal
static task).

Without this, these adhoc runs fail with a constraint violating trying
to insert a flight/job/step row into the host lifecycle table: the
constraint requires the step to be specified but it is NULL.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agoPDU/MSU: Retransmit on/off until PDU has changed
Ian Jackson [Wed, 21 Oct 2020 15:14:17 +0000 (16:14 +0100)]
PDU/MSU: Retransmit on/off until PDU has changed

The main effect of this is that the transcript will actually show the
new PDU state.  Previously we would call show(), but APC PDUs would
normally not change immediately, so the transcript would show the old
state.

This also guards against an unresponsive PDU or a packet getting lost.
I don't think we have ever seen that.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agoPDU/MSW: Make show() return the value from get()
Ian Jackson [Wed, 21 Oct 2020 15:13:26 +0000 (16:13 +0100)]
PDU/MSW: Make show() return the value from get()

No-one uses this return value yet, so NFC.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agoPDU/MSW: Actually implement delayed-*
Ian Jackson [Wed, 21 Oct 2020 15:05:50 +0000 (16:05 +0100)]
PDU/MSW: Actually implement delayed-*

Nothing in our tree uses this but having it here is useful docs for
the protocol so I shan't just delete it.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agoPDU/MSW: Break out action_value()
Ian Jackson [Wed, 21 Oct 2020 15:05:36 +0000 (16:05 +0100)]
PDU/MSW: Break out action_value()

This is going to be useful in a moment.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agoPDU/MSW: Break out get()
Ian Jackson [Wed, 21 Oct 2020 15:03:05 +0000 (16:03 +0100)]
PDU/MSW: Break out get()

This is going to be useful in a moment.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agoPDU/MSW: Warn that SNMP status is often not immediately updated
Ian Jackson [Wed, 21 Oct 2020 14:41:33 +0000 (15:41 +0100)]
PDU/MSW: Warn that SNMP status is often not immediately updated

If you don't know this, it's very confusing.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agoPDU/IPMI: Retransmit, don't just wait
Ian Jackson [Wed, 21 Oct 2020 14:39:57 +0000 (15:39 +0100)]
PDU/IPMI: Retransmit, don't just wait

We have a system for which
   ipmitool -H sabro0m -U root -P XXXX -I lanplus power on
seems to work but doesn't take effect the first time.

Retransit each retry.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agoshare in jobdb: Move out-of-flight special case higher up
Ian Jackson [Wed, 21 Oct 2020 15:01:23 +0000 (16:01 +0100)]
share in jobdb: Move out-of-flight special case higher up

This avoids running the runvar computation loop outside flights.
This is good amongst other things because that loop prints warnings
about undef $flight and $job.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agoshare in jobdb: Break out $checkconstraints and move call
Ian Jackson [Wed, 21 Oct 2020 14:54:20 +0000 (15:54 +0100)]
share in jobdb: Break out $checkconstraints and move call

This must happen after we introduce our new row or it is not
effective!

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agoDo not mark hosts used for pair test as reusable
Ian Jackson [Fri, 16 Oct 2020 16:49:38 +0000 (17:49 +0100)]
Do not mark hosts used for pair test as reusable

We do not currently tear down the nbd, and that means the next test
cannot remove our LVs.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agosg-run-job: Allow per-job control of test host reuse
Ian Jackson [Fri, 16 Oct 2020 16:48:04 +0000 (17:48 +0100)]
sg-run-job: Allow per-job control of test host reuse

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agoRevert "host reuse: Reuse hosts only in same role (for now)"
Ian Jackson [Fri, 16 Oct 2020 16:35:22 +0000 (17:35 +0100)]
Revert "host reuse: Reuse hosts only in same role (for now)"

This workaround is no longer needed because I have fixed the problem
properly.

Also, it didn't work anyway, because at that point $ho isn't set, so
all this did was produce some Perl warnings.

This reverts commit f3668acae2c6201c680dc7b4e9085ab184136d7e.

4 years agoknown hosts handling: Ensure things are good for multi-host jobs
Ian Jackson [Fri, 16 Oct 2020 15:28:58 +0000 (16:28 +0100)]
known hosts handling: Ensure things are good for multi-host jobs

When a multi-host job reuses host(s) from earlier jobs, the set of
hosts set up in the on-host known_hosts files may be insufficient,
since the hosts we are using now may not have been in any of the
flight's runvars when the earlier job set them up.

So we need to update the known_hosts.  We use the flight's current
set, which will include all of our hosts.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agoknown_hosts handling: Fix over-broad SQL query
Ian Jackson [Fri, 16 Oct 2020 15:28:48 +0000 (16:28 +0100)]
known_hosts handling: Fix over-broad SQL query

This should match only "*_host" and "host".  We don't want it matching
"*host" without a "_".

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agohost reuse: Reuse hosts only in same role (for now)
Ian Jackson [Fri, 16 Oct 2020 12:33:15 +0000 (13:33 +0100)]
host reuse: Reuse hosts only in same role (for now)

This is a workaround.  There is a problem with hoat key setup in a
group of hosts, which means that when a pair test reuses a host set up
by a different test, we can get
   Host key verification failed.
during the src-to-dst migration.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agocr-daily-branch: Heuristics for when to do immediate retest flight
Ian Jackson [Mon, 12 Oct 2020 15:48:26 +0000 (16:48 +0100)]
cr-daily-branch: Heuristics for when to do immediate retest flight

Do not do a retest if it would involve retesting more than 10% of the
original flight, or if it wouldn't get a push even if the retests
pass.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agosg-report-flight: Include count of blockers, and of jobs, in mro
Ian Jackson [Mon, 12 Oct 2020 15:26:15 +0000 (16:26 +0100)]
sg-report-flight: Include count of blockers, and of jobs, in mro

The mro will now contain exactly one of "blockers" or "tolerable".

Nothing uses this yet.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agocr-daily-branch: Do not do immediate retry of failing xtf flights
Ian Jackson [Thu, 8 Oct 2020 19:02:33 +0000 (20:02 +0100)]
cr-daily-branch: Do not do immediate retry of failing xtf flights

CC: Andrew Cooper <Andrew.Cooper3@citrix.com>
Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agoHonour OSSTEST_SIMULATE_FAIL_RETRY for immediate retries
Ian Jackson [Thu, 8 Oct 2020 16:46:39 +0000 (17:46 +0100)]
Honour OSSTEST_SIMULATE_FAIL_RETRY for immediate retries

This is primarily useful for debugging the immediate-retry logic, but
it seems churlish to delete it again.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agocr-daily-branch: Immediately retry failing tests
Ian Jackson [Thu, 16 Jul 2020 14:51:41 +0000 (15:51 +0100)]
cr-daily-branch: Immediately retry failing tests

We exclude the self-tests because we don't want to miss breakage, and
the Xen smoke tests because they will be run again RSN anyway.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agocri-args-hostlists: Move flight_html_dir variable
Ian Jackson [Thu, 8 Oct 2020 17:35:43 +0000 (18:35 +0100)]
cri-args-hostlists: Move flight_html_dir variable

This is only used in report_flight.  We are going to want to call
report_flight from outside start_email, without having to set that
variable ourselves.

The variable isn't actually used in start_email.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agoIntroduce real-retry blessing
Ian Jackson [Fri, 17 Jul 2020 16:43:37 +0000 (17:43 +0100)]
Introduce real-retry blessing

Nothing produces this yet.  (There's play-retry as well of course but
we don't need to document that really.)

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agosg-report-flight: Nicer output for --refer-to-flight option
Ian Jackson [Thu, 8 Oct 2020 18:02:14 +0000 (19:02 +0100)]
sg-report-flight: Nicer output for --refer-to-flight option

Sort the flight summary lines together, before the URLs.  This makes
it considerably easier to read.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agosg-report-flight: Provide --refer-to-flight option
Ian Jackson [Fri, 17 Jul 2020 17:01:13 +0000 (18:01 +0100)]
sg-report-flight: Provide --refer-to-flight option

This just generates an extra heading and URL at the top of the output.
In particular, it doesn't affect the algorithms which calculate
regressions.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agosg-report-flight: Break out printout_flightheader
Ian Jackson [Fri, 17 Jul 2020 16:54:27 +0000 (17:54 +0100)]
sg-report-flight: Break out printout_flightheader

No functional change.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agocri-args-hostlists: Break out report_flight and publish_logs
Ian Jackson [Fri, 17 Jul 2020 16:50:31 +0000 (17:50 +0100)]
cri-args-hostlists: Break out report_flight and publish_logs

NFC.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agocri-args-hostlists: New debug var $OSSTEST_REPORT_JOB_HISTORY_RUN
Ian Jackson [Thu, 16 Jul 2020 15:09:21 +0000 (16:09 +0100)]
cri-args-hostlists: New debug var $OSSTEST_REPORT_JOB_HISTORY_RUN

No effect if this is empty.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agosg-report-job-history: eval $DAILY_BRANCH_PREEXEC_HOOK
Ian Jackson [Thu, 16 Jul 2020 14:59:17 +0000 (15:59 +0100)]
sg-report-job-history: eval $DAILY_BRANCH_PREEXEC_HOOK

Put the call to this debugging/testing variable inside an eval.  This
allows a wider variety of stunts.  The one in-tree reference is
already compatible with this new semantics.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agomg-execute-flight: Do not include the transcript in reports
Ian Jackson [Thu, 15 Oct 2020 13:16:02 +0000 (14:16 +0100)]
mg-execute-flight: Do not include the transcript in reports

These are large and not very useful.  A copy is in the tree if needed.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agosg-report-flight: Consider all blessings for "never pass"
Ian Jackson [Thu, 15 Oct 2020 13:15:26 +0000 (14:15 +0100)]
sg-report-flight: Consider all blessings for "never pass"

$anypassq is used for the "never pass" check; the distinction between
this and simply "fail" is cosmetic (although it can be informative).

On non-"real" flights, it can easily happen that the flight never
passed *on this branch with this blessing* but has passed on real.  So
the steps subquery does not find us an answer within reasonable time.

Work around this by always searching for "real".  This keeps the
performance within acceptable bounds even during ad-hoc testing.

We don't actually use the row from this query, so the only effect is
that when the job passed in a "real" flight, we go on to the full
regresson analysis rather than short-circuiting and reporting "never
pass".

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agoHonour OSSTEST_SIMULATE_FAIL in sg-run-job
Ian Jackson [Thu, 8 Oct 2020 16:15:25 +0000 (17:15 +0100)]
Honour OSSTEST_SIMULATE_FAIL in sg-run-job

This is a Tcl list of globs for <job>.<step>, and allows for
simulating particular test failures.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agoHonour OSSTEST_SIMULATE=2 to actually run dummy flight
Ian Jackson [Thu, 8 Oct 2020 16:01:50 +0000 (17:01 +0100)]
Honour OSSTEST_SIMULATE=2 to actually run dummy flight

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agosg-report-flight: Word-wrapping improvements to job and step names
Ian Jackson [Mon, 5 Oct 2020 16:23:54 +0000 (17:23 +0100)]
sg-report-flight: Word-wrapping improvements to job and step names

Use <wbr>.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agosg-report-flight: Sharing reports: more task finished info
Ian Jackson [Mon, 5 Oct 2020 17:47:12 +0000 (18:47 +0100)]
sg-report-flight: Sharing reports: more task finished info

Other steps from jobs affecting this host either started after we are
running, and therefore didn't affect the stuff we're reporting, or
already in the db.  Furthermore, any such effects for steps which have
finished must have completed by the max finished time  But if there
are unfinished steps, we don't know the finish time.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agotsreadconfig: Change misleading "setting" message
Ian Jackson [Fri, 2 Oct 2020 16:13:34 +0000 (17:13 +0100)]
tsreadconfig: Change misleading "setting" message

These are the *existing* runvars and it is confusing that we print
"setting" for them.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agoflight other job reporting: Further improvements to ordering
Ian Jackson [Thu, 3 Sep 2020 17:59:23 +0000 (18:59 +0100)]
flight other job reporting: Further improvements to ordering

We want to definitely put these NULLs last.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agoflight other job reporting: Put nulls last in the report
Ian Jackson [Thu, 3 Sep 2020 15:33:14 +0000 (16:33 +0100)]
flight other job reporting: Put nulls last in the report

Cosmetic change only, but this makes the results easier to understand.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agosg-report-flight: Improvements to other job (share/reuse) reporting
Ian Jackson [Fri, 2 Oct 2020 15:19:29 +0000 (16:19 +0100)]
sg-report-flight: Improvements to other job (share/reuse) reporting

* Prefer to show "prep" (purple) rather than "share".
* Show our own relationship, in particular to show if it was prep.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agosg-report-flight: Reformat slightly
Ian Jackson [Fri, 2 Oct 2020 15:19:04 +0000 (16:19 +0100)]
sg-report-flight: Reformat slightly

This is more regular and will make the next commit easier to
understand.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agohost reuse: New protocol between sg-run-job and ts-host-reuse
Ian Jackson [Thu, 3 Sep 2020 10:58:30 +0000 (11:58 +0100)]
host reuse: New protocol between sg-run-job and ts-host-reuse

Abolish post-test-ok (which runs only if successful) and replace it
with final (which sets the runvar to indicate finality, and runs
regardless).

This allows a subsequent job which reuses the host to see that this
job had finished using the host.  This is relevant for builds, where a
host can be reused even after a failed job.

"Lies", where we claim the use of the host was done, are
avoided (barring unlikely races) because selecthost de-finalises the
runvar.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agohost reuse: ts-host-reuse: Prepare for argument handling
Ian Jackson [Thu, 3 Sep 2020 10:57:29 +0000 (11:57 +0100)]
host reuse: ts-host-reuse: Prepare for argument handling

No functional change.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agohost reuse: sg-run-job: Reanme post-test-ok parameter
Ian Jackson [Thu, 3 Sep 2020 10:47:55 +0000 (11:47 +0100)]
host reuse: sg-run-job: Reanme post-test-ok parameter

This is more accurate.

No overall functional change.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agoresource reporting: Report host reuse/sharing in job report
Ian Jackson [Fri, 28 Aug 2020 13:38:17 +0000 (14:38 +0100)]
resource reporting: Report host reuse/sharing in job report

Compatibility: in principle this might generate erroneous reports
which omit sharing/reuse information for allocations made by jobs
using older versions of osstest.

However, we do not share or reuse hosts across different osstest
versions, so this cannot occur.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agoresource reporting, nfc: split a here document
Ian Jackson [Fri, 28 Aug 2020 13:07:57 +0000 (14:07 +0100)]
resource reporting, nfc: split a here document

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agosg-report-flight: Refactor runvar access
Ian Jackson [Thu, 27 Aug 2020 18:11:37 +0000 (19:11 +0100)]
sg-report-flight: Refactor runvar access

Collect the runvars query into local perl variables.  This will allow
us to reuse the information without going back to the db.

No functional change.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agohost lifecycle: Record lifecycle in db and runvar
Ian Jackson [Tue, 25 Aug 2020 19:13:22 +0000 (20:13 +0100)]
host lifecycle: Record lifecycle in db and runvar

This is just the calls to host_update_lifecycle_info.
Now the db table is Needed.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agohost lifecycle: Prevent referential integrity violation
Ian Jackson [Thu, 27 Aug 2020 17:48:36 +0000 (18:48 +0100)]
host lifecycle: Prevent referential integrity violation

We can't use normal constraints for either of these, sadly.

We can make the constraints into a single query which says "OK".

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agohost lifecycle: Fix detection of concurrent jobs
Ian Jackson [Wed, 7 Oct 2020 16:36:50 +0000 (17:36 +0100)]
host lifecycle: Fix detection of concurrent jobs

The previous algorithm was wrong here.

This commit was originally considerably later than the previous one.
I'm avoiding squshing this commit, to make future archaeology easier.
The effect of the bug is to report other tasks as live too often, so
hosts show up as shared rather than reused.

Signed-off-by: Ian Jackson <iwj@xenproject.org>
4 years agohost lifecycle: Machinery, db, for tracking relevant events
Ian Jackson [Tue, 25 Aug 2020 17:34:42 +0000 (18:34 +0100)]
host lifecycle: Machinery, db, for tracking relevant events

When we reuse test hosts, we want to be able to give a list of the
other jobs which might be responsible for any problem.

In principle it would be possible to do this by digging into the
db's history tables like sg-report-host-history does, but this is
quite slow and also I don't have enough confidence in that approach to
use it for this application.

So instead, track the host lifecycle explicitly.

The approach taken is a hybrid one.  I first considered two and a half
approaches:

 1. Permanently record all host/share allocations and share state
    changes in a host history table.  But it is nontrivial to update
    all the allocation machinery to keep this table up to date.  It is
    also nontrivial to extract the necessary information from such a
    table: the allocation information would have to be correlated,
    using timestamps, with the steps table.  That's slow and complex.
    We had such a table but it was never used for these reasons;
    I dropped that empty table recently.

 1b. Like 1 but explicitly put a lifecycle sequence number in the
    allocations table,.  This would make it easy to find relevant
    events but would involve even more complicated logic during
    allocation.

 2. Record the host's lifecycle information in a file on the host.
    This means it gets wiped whenever the host does and makes finding
    the relevant jobs easy: read the file during logs capture, and
    we'll find everything of relevance.  It then has to be permanently
    stored somewhere it can be used for logging and archaeology: a
    per-job runvar giving the relevant host history, up to the point
    where that job finished. does that job nicely.  However, this
    has a serious problem: if the host crashes hard, we may not be
    able to recover the complete information about why!  We really
    want to the information recorded outside the host in question.

So I've taken a hybrid approach: effectively replicate the per-host
file from (2), but put the information in the database.  This
necessites a call to clear the host lifecycle history, which we make
at the *end* of the host install.  As a bonus this might let us more
easily identify if there are particular jobs that leave hosts in
states that are hard to recover from, and it will make total host
failure quite obvious because the host install log report will have a
list of the failed attempts (longer in each successive job).

For build jobs we only record the setup job, and concurrent jobs, in
the runvar.  This does not seem to have been a problem so far, and
this avoids having to do work on other allocations (eg, mg-allocate).
It also avoids having very long lists of previous builds listed in
every build job.

Test jobs are only shared within a flight and with much more limited
scope so the same considerations don't arise.  But by the same token,
we also do not need to adjust mg-allocate etc., since the user ought
not to allocate shares of test hosts unless they know what they are
doing.

In this commit we introduce:
 * The database table
 * The runvar syntax
 * The function for recording the lifecycle events

We have what amounts to an ad-hoc compression scheme for the
information in the lifecycle runvars.  Otherwise this data might get
quite voluminous, which can makes various other db queries slow.

There isn't a very good way to represent out-of-job tasks in the
lifecycle runvar.  We could maybe put in something from the tasks
table, but the entry in the tasks table might be gone by now and that
would involve quoting (and it might be quite large).

But this will only matter when a shared/reused host has been manually
messed with, and recording the task is sufficient to
 (1) note the fact of such interference
 (2) if the task is static, or still going when the job reports,
      can actually be put in the report.
 (3) failing that provide something which could be grepped for in logs

We do not call the recording function yet, so the db update is merely
Preparatory.

There is a bug in this patch: the calculation of $olive is wrong.
This will be fixed in a moment.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agohsot reuse: Make share type hash more easily greppable
Ian Jackson [Fri, 21 Aug 2020 14:44:58 +0000 (15:44 +0100)]
hsot reuse: Make share type hash more easily greppable

Use - and _ to make up the base64 alphabet instead of + and /

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agohsot reuse: Hash the share type
Ian Jackson [Fri, 21 Aug 2020 10:25:06 +0000 (11:25 +0100)]
hsot reuse: Hash the share type

We don't really want to duplicate (triplicate, actually) lots of the
runvars.  This will make the runvars table needlessly bloated.

So hash the values.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agotest host reuse: Switch to principled sharing scope runvar scheme
Ian Jackson [Fri, 21 Aug 2020 14:22:19 +0000 (15:22 +0100)]
test host reuse: Switch to principled sharing scope runvar scheme

* When selecthost is passed an @host ident, indicating prep work,
  engage restricted runvar access.  If no call to sharing_for_build
  was made, this means it can access only the runvars in
  the default value of @accessible_runvar_pats.

* Make the sharetype for host reuse be based on the values of
  precisely those same runvars, rather than using an adhoc scheme.

The set of covered runvars is bigger now as a result of testing...

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agorunvar access: Introduce effects_gone_before_share_reuse
Ian Jackson [Fri, 21 Aug 2020 11:36:10 +0000 (12:36 +0100)]
runvar access: Introduce effects_gone_before_share_reuse

The syslog server, and its port, is used for things that happen in
this job, but the syslog server is torn down and a new one started,
when the host is reused.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agorunvar access: Introduce sharing_for_build
Ian Jackson [Fri, 21 Aug 2020 11:47:44 +0000 (12:47 +0100)]
runvar access: Introduce sharing_for_build

Builds don't have so much contingent setup.  We don't track the
runvars; we just rely on the share-* hostflag set in the job.

But selecthost() is going to automatically enable runvar access
control for shared/reused hosts.  So, provide a way to disable that.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agorunvar access: Use runvar_glob for dmrestrict runvar search
Ian Jackson [Thu, 20 Aug 2020 16:39:58 +0000 (17:39 +0100)]
runvar access: Use runvar_glob for dmrestrict runvar search

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agorunvar access: Provide runvar_glob
Ian Jackson [Fri, 21 Aug 2020 11:47:02 +0000 (12:47 +0100)]
runvar access: Provide runvar_glob

We will need this because when runvar access is restricted, accessing
via %r directly won't work.  We want to see what patterns the code is
interested in (so that interest in a nonexistent runvar is properly
tracked).

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agorunvar access: Introduce access control machinery
Ian Jackson [Fri, 21 Aug 2020 11:43:31 +0000 (12:43 +0100)]
runvar access: Introduce access control machinery

This will allow us to trap accesses, during test host setup, to
runvars which weren't included in ithe calculation of the sharing
scope.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agoTestSupport: Provide runvar_is_synth
Ian Jackson [Thu, 20 Aug 2020 20:32:48 +0000 (21:32 +0100)]
TestSupport: Provide runvar_is_synth

Internally we use an array %r_notsynth.  This allows us to avoid
adding code to store_runvar etc.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agosubst_netboot_template: Do not use all of %r
Ian Jackson [Thu, 20 Aug 2020 16:49:31 +0000 (17:49 +0100)]
subst_netboot_template: Do not use all of %r

Instead of copying all of %r into %v, have the template substitutor
fall back to %r from %v.

This is going to be important when we have host-reuse-related access
control to %r.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agohost reuse: Bump host share reuse bonus
Ian Jackson [Wed, 22 Nov 2017 11:39:39 +0000 (11:39 +0000)]
host reuse: Bump host share reuse bonus

In test jobs this is now contending with the variation bonus.

If we fail to vary properly this time, we get another go in the next
flight, so this is not so critical.

This increases the amount of test host reuse.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
4 years agohost reuse: Use literal for the hosts_infraprioritygroup runvar
Ian Jackson [Mon, 24 Aug 2020 11:03:11 +0000 (12:03 +0100)]
host reuse: Use literal for the hosts_infraprioritygroup runvar

At some point this might make the database smarter about indexing.
It's certainly clearer.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agohost reuse: Jiggle the infra-priority a bit, within a flight
Ian Jackson [Wed, 22 Nov 2017 11:38:05 +0000 (11:38 +0000)]
host reuse: Jiggle the infra-priority a bit, within a flight

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
4 years agohost allocation: Group jobs by their reuse parameters
Ian Jackson [Fri, 17 Nov 2017 16:49:42 +0000 (16:49 +0000)]
host allocation: Group jobs by their reuse parameters

This promotes reuse by arranging that jobs that can reuse a host get
to run consecutively.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
4 years agohost reuse: Reuse test hosts within a flight
Ian Jackson [Tue, 21 May 2019 16:06:24 +0000 (17:06 +0100)]
host reuse: Reuse test hosts within a flight

Mark the host shareable, and unshareable, as appropriate.

There is still a lot more cleanup and improvement to do.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
4 years agoshared/reuse: Rely on @ for ts-host-ping-check
Ian Jackson [Fri, 21 Aug 2020 16:40:07 +0000 (17:40 +0100)]
shared/reuse: Rely on @ for ts-host-ping-check

Remove the check for SharedReady.

The existence of this check is perplexing.  It was introduced in
  ts-host-ping-check: Do not run if host is being reused
in 8f1dc3f7c401 (from 2015).

At that time we only share build hosts, and build hosts never ran this
script.  So I don't understand what that was hoping to achieve.  Maybe
it made some difference in a now-lost pre-rebase situation.

Anyway, in our current tree I think we want to rerun the
ts-host-ping-check when we reuse a test host.  My change to add @ to
parts of per-host-prep in sg-run-job deliberately omitted the step
with testid host-ping-check-xen/@.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agohost reuse: sg-run-job: per-host prep: Use @ for per-host-ts
Ian Jackson [Thu, 20 Aug 2020 14:13:12 +0000 (15:13 +0100)]
host reuse: sg-run-job: per-host prep: Use @ for per-host-ts

These are the steps that will be skipped when we reuse a test host.

No functional change yet since we don't allocate the host shared yet.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agoshared/reuse: Use @ for freebsd host prep
Ian Jackson [Tue, 21 May 2019 16:37:42 +0000 (17:37 +0100)]
shared/reuse: Use @ for freebsd host prep

These are all the relevant call sites for ts-freebsd-host-install and
ts-freebsd-build-prep.  (There's a ts-freebsd-host-install in
ts-memdisk-try-append but that's for host examination and does not
uee or want sharing or reuse.)

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agoshared/reuse: Use @ for ts-host-install
Ian Jackson [Mon, 30 Oct 2017 18:09:41 +0000 (18:09 +0000)]
shared/reuse: Use @ for ts-host-install

Pass @ from sg-run-job.  These are all the call sites for
ts-host-install-*, so we can lose the open-coded test for SharedReady.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
4 years agoshared/reuse: Use @ for ts-xen-build-prep
Ian Jackson [Wed, 22 May 2019 15:44:40 +0000 (16:44 +0100)]
shared/reuse: Use @ for ts-xen-build-prep

Pass @ from sg-run-job.  This is the only call site for
ts-xen-build-prep, so it can lose the open-coded test for SharedReady.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agosg-run-job: Detect improper use of @ iffail with run-ts
Ian Jackson [Wed, 22 May 2019 16:42:16 +0000 (17:42 +0100)]
sg-run-job: Detect improper use of @ iffail with run-ts

Only per-host-ts understands this.  This is a bit of a bear trap, so
arrange to bail rather than putting strange step status values with
`@' at the front in the database...

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agosg-run-job: New @ iffail tag for prep tasks
Ian Jackson [Wed, 22 May 2019 15:34:42 +0000 (16:34 +0100)]
sg-run-job: New @ iffail tag for prep tasks

Currently no users sites, so no functional change.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
4 years agots-hosts-allocate-Executive print sharing info in debug output
Ian Jackson [Mon, 6 Nov 2017 18:07:39 +0000 (18:07 +0000)]
ts-hosts-allocate-Executive print sharing info in debug output

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
4 years agohost allocation: selecthost(): Support @IDENT for reuse
Ian Jackson [Tue, 21 May 2019 16:30:43 +0000 (17:30 +0100)]
host allocation: selecthost(): Support @IDENT for reuse

This is the first part of a central way to control host reuse, rather
than having to write code in each ts-* script to check Shared etc.

No functional change with existing callers.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agots-host-reuse: Add some missing runvars to the host sharing control
Ian Jackson [Mon, 20 Nov 2017 16:12:56 +0000 (16:12 +0000)]
ts-host-reuse: Add some missing runvars to the host sharing control

Add some missing runvars to the host sharing control.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agots-host-reuse: Do not depend on bios
Ian Jackson [Mon, 20 Nov 2017 16:07:32 +0000 (16:07 +0000)]
ts-host-reuse: Do not depend on bios

Weirdly, this is only used for guests.  Really, it should be a
target_var, not a raw runvar applying to all guests, since it can be
guest-specific.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
4 years agots-host-reuse: tolerate unremoveable lv
Ian Jackson [Fri, 17 Nov 2017 14:05:34 +0000 (14:05 +0000)]
ts-host-reuse: tolerate unremoveable lv

It might be a symlink in the pair tests.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
4 years agots-host-reuse: New script, to do reuse state changes
Ian Jackson [Tue, 21 May 2019 16:06:50 +0000 (17:06 +0100)]
ts-host-reuse: New script, to do reuse state changes

This will be made part of the test job recipes.

We calculate the sharing scope (sharetype) by reference to a lot of
runvars, etc.

This version of the script is rather far from the finished working
one, but it seems better to preserve the actual history for how it got
the way it is.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agots-hosts-allocate-Executive: Better message for hosts abandoned mid-test
Ian Jackson [Mon, 6 Nov 2017 17:23:34 +0000 (17:23 +0000)]
ts-hosts-allocate-Executive: Better message for hosts abandoned mid-test

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
4 years agoresource reporting, nfc: Break out report_rogue_task_description
Ian Jackson [Fri, 28 Aug 2020 15:53:18 +0000 (16:53 +0100)]
resource reporting, nfc: Break out report_rogue_task_description

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agoresource reporting: Print username when listing "rogue tasks"
Ian Jackson [Fri, 28 Aug 2020 15:45:53 +0000 (16:45 +0100)]
resource reporting: Print username when listing "rogue tasks"

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
4 years agoplan_search: Track last sharing state to determine $share_reuse
Ian Jackson [Wed, 8 Nov 2017 16:29:07 +0000 (16:29 +0000)]
plan_search: Track last sharing state to determine $share_reuse

What matters for the purpose of $share_reuse is not whether the host
is actually being _shared_ (ie, there are other concurrent allocations
and therefore a concurrent Event with Share information).  What we
really want to know is whether the *last* use of this host was a
suitable sharing setup - because we actually want to know if we will
be able to skip our setup.

So track that explicitly.  (The slightly odd structure, where there
are two loops in one, means that we reset $last_eshare when we go onto
the next $req ie the next host to check.)

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
4 years agoplan search: Move $share_compat_ok further up the file
Ian Jackson [Wed, 8 Nov 2017 16:43:34 +0000 (16:43 +0000)]
plan search: Move $share_compat_ok further up the file

We are going to want to use this outside the loop.

No functional change.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
4 years agoplan_search: Use plan's Wear information rather than tracking it ourselves
Ian Jackson [Wed, 8 Nov 2017 16:39:37 +0000 (16:39 +0000)]
plan_search: Use plan's Wear information rather than tracking it ourselves

There is no reason not to use this information from the plan.
Not computing it ourselves saves some confusing logic here.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
4 years agoplan_search: Improve debugging of $share_compat_ok->()
Ian Jackson [Wed, 8 Nov 2017 16:36:07 +0000 (16:36 +0000)]
plan_search: Improve debugging of $share_compat_ok->()

No change other than to debugging output.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
4 years agoplan_search: Break out $share_compat_ok
Ian Jackson [Wed, 8 Nov 2017 16:16:29 +0000 (16:16 +0000)]
plan_search: Break out $share_compat_ok

No functional change.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
4 years agohost allocation: *_shared_mark_ready: Only prod when $newstate is ready
Ian Jackson [Mon, 30 Oct 2017 17:25:43 +0000 (17:25 +0000)]
host allocation: *_shared_mark_ready: Only prod when $newstate is ready

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>