Objectives
The new dispatcher design is intended to make it easier to adapt the
dispatcher flow to new boards, new mechanisms and new deployments. It
also shifts support to do less work on the dispatcher, make fewer
assumptions about the test in the dispatcher configuration and put more
flexibility into the hands of the test writer.
Note
The new code is still developing, some areas are absent,
some areas will change substantially before it will work.
All details here need to be seen only as examples and the
specific code may well change independently. This documentation
is aimed at LAVA developers - although some content covers user
facing actions, the syntax and parameters for these actions
are still subject to change and do not constitute an API. In
particular, the sample jobs supporting the unit tests do not
represent a submission format, rather a generated format based
on (as yet unwritten) server-side conversions.
Design
Start with a Job which is broken up into a Deployment, a Boot, a Test
and a Submit class:
Job |
|
|
|
|
Deployment |
|
|
|
|
DeployAction |
|
|
|
|
DownloadAction |
|
|
|
ChecksumAction |
|
|
|
MountAction |
|
|
|
CustomiseAction |
|
|
|
TestDefAction |
|
|
|
UnmountAction |
|
|
BootAction |
|
|
|
TestAction |
|
|
|
SubmitAction |
|
The Job manages the Actions using a Pipeline structure. Actions
can specialise actions by using internal pipelines and an Action
can include support for retries and other logical functions:
DownloadAction |
|
|
HttpDownloadAction |
|
FileDownloadAction |
If a Job includes one or more Test definitions, the Deployment can then
extend the Deployment to overlay the LAVA test scripts without needing
to mount the image twice:
DeployAction |
|
|
|
OverlayAction |
|
|
|
MultinodeOverlayAction |
|
|
LMPOverlayAction |
The TestDefinitionAction has a similar structure with specialist tasks
being handed off to cope with particular tools:
TestDefinitionAction |
|
|
|
RepoAction |
|
|
|
GitRepoAction |
|
|
BzrRepoAction |
|
|
TarRepoAction |
|
|
UrlRepoAction |
Following the code flow
Filename |
Role |
lava/dispatcher/commands.py |
Command line arguments, call to YAML parser |
lava_dispatcher/pipeline/device.py |
YAML Parser to create the Device object |
lava_dispatcher/pipeline/parser.py |
YAML Parser to create the Job object |
....pipeline/actions/deploy/ |
Handlers for different deployment strategies |
....pipeline/actions/boot/ |
Handlers for different boot strategies |
....pipeline/actions/test/ |
Handlers for different LavaTestShell strategies |
....pipeline/actions/deploy/image.py |
DeployImage strategy creates DeployImageAction |
....pipeline/actions/deploy/image.py |
DeployImageAction.populate adds deployment
actions to the Job pipeline |
*repeat for each strategy* |
each populate function adds more Actions |
....pipeline/action.py |
Pipeline.run_actions() to start |
The deployment is determined from the device_type specified in the Job
(or the device_type of the specified target) by reading the list of
support methods from the device_types YAML configuration.
Each Action can define an internal pipeline and add sub-actions in the
Action.populate function.
Particular Logic Actions (like RetryAction) require an internal pipeline
so that all actions added to that pipeline can be retried in the same
order. (Remember that actions must be idempotent.) Actions which fail
with a JobError or InfrastructureError can trigger Diagnostic actions.
See Logical actions.
actions:
deploy:
allow:
- image
boot:
allow:
- image
This then matches the python class structure:
actions/
deploy/
image.py
The class defines the list of Action classes needed to implement this
deployment. See also Dispatcher actions.
Pipeline construction and flow
- One device per job. One top level pipeline per job
- loads only the configuration required for this one job.
- A NewDevice is built from the target specified (commands.py)
- A Job is generated from the YAML by the parser.
- The top level Pipeline is constructed by the parser.
- Strategy classes are initialised by the parser
- Strategy classes add the top level Action for that strategy to the
top level pipeline.
- Top level pipeline calls populate() on each top level Action added.
- Each Action.populate() function may construct one internal
pipeline, based on parameters.
- internal pipelines call populate() on each Action added.
- Parser iterates over each Strategy
- Parser adds the FinalizeAction to the top-level pipeline
- Loghandlers are set up
- Job validates the completed pipeline
- Dynamic data can be added to the context
- If --validate not specified, the job runs.
- Each run() function can add dynamic data to the context and/or
results to the pipeline.
- Pipeline iterates through actions
- Job ends, check for errors
- Completed pipeline is available.
Using strategy classes
Strategies are ways of meeting the requirements of the submitted job within
the limits of available devices and code support.
If an internal pipeline would need to allow for optional actions, those
actions still need to be idempotent. Therefore, the pipeline can include
all actions, with each action being responsible for checking whether
anything actually needs to be done. The populate function should avoid
using conditionals. An explicit select function can be used instead.
Whenever there is a need for a particular job to use a different Action
based solely on job parameters or device configuration, that decision
should occur in the Strategy selection using classmethod support.
Where a class is used in lots of different strategies, identify whether
there is a match between particular strategies always needing particular
options within the class. At this point, the class can be split and
particular strategies use a specialised class implementing the optional
behaviour and calling down to the base class for the rest.
If there is no clear match, for example in testdef.py where any
particular job could use a different VCS or URL without actually being
a different strategy, a select function is preferable. A select handler
allows the pipeline to contain only classes supporting git repositories
when only git repositories are in use for that job.
The list of available strategies can be determined in the codebase from
the module imports in the strategies.py file for each action type.
This results in more classes but a cleaner (and more predictable)
pipeline construction.
Lava test shell scripts
Note
See Refactoring review criteria - it is a mistake to think of the LAVA
test support scripts as an overlay - the scripts are an
extension to the test. Wherever possible, current
deployments are being changed to supply the extensions
alongside the deployment instead of overlaying, and thereby
altering, the deployment.
The LAVA scripts are a standard addition to a LAVA test and are handled as
a single unit. Using idempotent actions, the test script extension can
support LMP or MultiNode or other custom requirements without requiring
this support to be added to all tests. The extensions are created during
the deploy strategy and specific deployments can override the
ApplyExtensionAction to unpack the extension tarball alongside the
test during the deployment phase and then mount the extension inside the
image. The tarball itself remains in the output directory and becomes
part of the test records. The checksum of the overlay is added to the
test job log.
Result bundle identifiers
Old style result bundles are assigned a text based UUID during submission.
This has several issues:
- The UUID is not sequential or predictable, so finding this one, the
next one or the previous one requires a database lookup for each. The
new dispatcher model will not have a persistent database connection.
- The UUID is not available to the dispatcher whilst running the job, so
cannot be cross-referenced to logs inside the job.
- The UUID makes the final URL of individual test results overly long,
unmemorable and complex, especially as the test run is also given
a separate UUID in the old dispatcher model.
The new dispatcher creates a pipeline where every action within the
pipeline is guaranteed to have a unique level string which is strictly
sequential, related directly to the type of action and shorter than a
UUID. To make a pipeline result unique on a per instance basis, the only
requirement is that the result includes the JobID which is a sequential
number, passed to the job in the submission YAML. This could also have
been a UUID but the JobID is already a unique ID for this instance.
When bundles are downloaded, the database query will need to assign a
UUID to that downloaded file but the file will also include the job
number and the query can also insert the source of the bundle in a
comment in the YAML. This will allow bundles to be uploaded to a different
instance using lava-tool without the risk of collisions.
It is also possible that the results could provide a link back to the
original job log file and other data - if the original server is visible
to users of the server to which the bundle was later uploaded.
Refactoring review criteria
The refactored dispatcher has different objectives to the original and
any assumptions in the old code must be thrown out. It is very easy to
fall into the old way of writing dispatcher code, so these criteria are
to help developers control the development of new code. Any of these
criteria can be cited in a code review as reasons for a review to be
improved.
Keep the dispatcher dumb
There is a temptation to make the dispatcher clever but this only
restricts the test writer from doing their own clever tests by hard
coding commands into the dispatcher codebase. If the dispatcher needs
some information about the test image, that information must be
retrieved from the job submission parameters, not by calculating
in the dispatcher or running commands inside the test image. Exceptions
to this are the metrics already calculated during download, like file
size and checksums. Any information about the test image which is
permanent within that image, e.g. the partition UUID strings or the
network interface list, can be identified by the process creating that
image or by a script which is run before the image is compressed and
made available for testing. If a test uses a tarball instead of an image,
the test must be explicit about the filesystem to use when
unpacking that tarball for use in the test as well as the size and
location of the partition to use.
LAVA will need to implement some safeguards for tests which still need
to deploy any test data to the media hosting the bootloader (e.g. fastboot,
SD card or UEFI) in order to avoid overwriting the bootloader itself.
Therefore, although SD card partitions remain available for LAVA tests
where no other media are supportable by the device, those tests can
only use tarballs and pre-defined partitions on the SD card. The
filesystem to use on those partitions needs to be specified by the test
writer.
Avoid defaults in dispatcher code
Constants and defaults are going to need an override somewhere for some
device or test, eventually. Code defensively and put constants into
the utilities module to support modification. Put defaults into the
YAML, not the python code. It is better to have an extra line in the
device_type than a string in the python code as this can later be
extended to a device or a job submission.
Let the test fail and diagnose later
Avoid guessing in LAVA code. If any operation in the dispatcher
could go in multiple paths, those paths must be made explicit to the
test writer. Report the available data, proceed according to the job
definition and diagnose the state of the device afterwards, where
appropriate.
Avoid trying to be helpful in the test image. Anticipating an error
and trying to code around it is a mistake. Possible solutions include
but are not limited to:
- Provide an optional, idempotent, class which only acts if a specific
option is passed in the job definition. e.g. AutoLoginAction.
- Provide a diagnostic class which triggers if the expected problem
arises. Report on the actual device state and document how to improve
the job submission to avoid the problem in future.
- Split the deployment strategy to explicitly code for each possible
path.
AutoLogin is a good example of the problem here. For too long, LAVA has
made assumptions about the incoming image, requiring hacks like
linaro-overlay packages to be added to basic bootstrap images or
disabling passwords for the root user. These helpful steps act to
make it harder to use unchanged third party images in LAVA tests.
AutoLogin is the de facto default for non-Linaro images.
Another example is the assumption in various parts of LAVA that the
test image will raise a network interface and repeatedly calling ping
on the assumption that the interface will appear, somehow, eventually.
Treat the deployment as a black box
LAVA has claimed to do this for a long time but the refactored
dispatcher is pushing this further. Do not think of the LAVA scripts
as an overlay, the LAVA scripts are extensions. When a test wants
an image deployed, the LAVA extensions should be deployed alongside the
image and then mounted to create a /lava-$hostname/ directory. Images
for testing within LAVA are no longer broken up or redeployed but must
be deployed intact. This avoids LAVA needing to know anything about
issues like SELinux or specific filesystems but may involve multiple
images for systems like Android where data may exist on different physical
devices.
Only protect the essential components
LAVA has had a tendency to hardcode commands and operations and there
are critical areas which must still be protected from changes in the
test but these critical areas are restricted to:
- The dispatcher.
- Unbricking devices.
Any process which has to run on the dispatcher itself must be
fully protected from mistakes within tests. This means that all
commands to be executed by the dispatcher are hardcoded into the dispatcher
python code with only limited support for overriding parameters or
specifying tainted user data.
Tests are prevented from requiring new software to be installed on any
dispatcher which is not already a dependency of lava-dispatcher.
Issues arising from this need to be resolved using MultiNode.
Until such time as there is a general and reliable method of deploying
and testing new bootloaders within LAVA tests, the bootloader / firmware
installed by the lab admin is deemed sacrosanct and must not be altered
or replaced in a test job. However, bootloaders are generally resilient
to errors in the commands, so the commands given to the bootloader remain
accessible to test writers.
It is not practical to scan all test definitions for potentially harmful
commands. If a test inadvertently corrupts the SD card in such a way that
the bootloader is corrupted, that is an issue for the lab admins to
take up with the test submitter.
Give the test writer enough rope
Within the provisos of Only protect the essential components, the test writer
needs to be given enough rope and then let LAVA diagnose issues
after the event.
There is no reason to restrict the test writer to using LAVA commands
inside the test image - as long as the essential components remain
protected.
Examples:
- KVM devices need to protect the QEMU command line because these
commands run on the dispatcher
- VM devices running on an arndale do not need the command line
to be coded within LAVA. There have already been bug reports on this
issue.
Diagnostic subclasses report on the state of the device after some
kind of error. This reporting can include:
- The presence or absence of expected files (like /dev/disk/by-id/
or /proc/net/pnp).
- Data about running processes or interfaces, e.g. ifconfig
It is a mistake to attempt to calculate data about a test image - instead,
require that the information is provided and diagnose the actual
information if the attempt to use the specified information fails.
Guidance
- If the command is to run inside a deployment, require that the
full command line can be specified by the test writer. Remember:
Avoid defaults in dispatcher code. It is recommended to have default commands where
appropriate but these defaults need to support overrides in the job
submission. This includes using a locally built binary instead of an
executable installed in /usr/bin or similar.
- If the command is run on a dispatcher, require that the binary
to be run on the dispatcher is actually installed on the dispatcher.
If /usr/bin/git does not exist, this is a validation error. There
should be no circumstances where a tool required on the dispatcher
cannot be identified during validation of the pipeline.
- An error from running the command on the dispatcher with user-specified
parameters is a JobError.
- Where it is safe to do so, offer overrides for supportable
commandline options.
The codebase itself will help identify how much control is handed over
to the test writer. self.run_command() is a dispatcher call and
needs to be protected. connection.sendline() is a deployment
call and does not need to be protected.
Providing gold standard images
Test writers are strongly recommended to only use a known working
setup for their job. A set of gold standard jobs will be defined in
association with the QA team. These jobs will provide a known baseline
for test definition writers, in a similar manner as the existing QA test
definitions provide a base for more elaborate testing.
There will be a series of images provided for as many device types as
practical, covering the basic deployments. Test definitions will be
required to be run against these images before the LAVA team will spend
time investigating bugs arising from tests. These images will provide a
measure of reassurance around the following issues:
- Kernel fails to load NFS or ramdisk.
- Kernel panics when asked to use secondary media.
- Image containing a different kernel to the gold standard fails
to deploy.
Note
It is imperative that test writers understand that a gold
standard deployment for one device type is not necessarily
supported for a second device type. Some devices will
never be able to support all deployment methods due to
hardware constraints or the lack of kernel support. This is
not a bug in LAVA.
If a particular deployment is supported but not stable on a
device type, there will not be a gold standard image for that
deployment. Any issues in the images using such deployments
on that type are entirely down to the test writer to fix.
The refactoring will provide Diagnostic subclasses which point at
these issues and recommend that the test is retried using the standard
kernel, dtb, initramfs, rootfs and other components.
The reason to give developers enough rope is precisely so that kernel
developers are able to fix issues in the test images before problems
show up in the gold standard images. Test writers need to work with the
QA team, using the gold standard images.
Creating a gold standard image
Part of the benefit of a standard image is that the methods for building
the image - and therefore the methods for updating it, modifying it and
preparing custom images based upon it - must be documented clearly.
Where possible, standard tools familiar to developers of the OS concerned
should be used, e.g. debootstrap for Debian based images. The image can
also be a standard OS install. Gold standard images are not “Linaro”
images and should not require Linaro tools. Use AutoLogin support where
required instead of modifying existing images to add Linaro-specific
tools.
All gold standard images need to be kept up to date with the base OS as
many tests will want to install extra software on top and it will waste
time during the test if a lot of other packages need to be updated at
the same time. An update of a gold standard image still needs to be
tested for equivalent or improved performance compared to the current
image before replacing it.
The documentation for building and updating the image needs to be
provided alongside the image itself as a README. This text file should
also be reproduced on a wiki page and contain a link to that page. Any
wiki can be used - if a suitable page does not already exist elsewhere,
use wiki.linaro.org.
Other gold standard components
The standard does not have to be a complete OS image - a kernel with a
DTB (and possibly an initrd) can also count as a standard ramdisk image.
Similarly, a combination of kernel and rootfs can count as a standard
NFS configuration.
The same requirement exists for documenting how to build, modify and
update all components of the “image” and the set of components need to
be tested as a whole to represent a test using the standard.
Connections
A Connection is approximately equivalent to an automated login session
on the device or within a virtual machine hosted by a device.
Each connection needs to be supported by a TestJob, the output of each
connection is viewed as the output of that TestJob.
Typically, LAVA provides a serial connection to the board but other
connections can be supported, including SSH or USB. Each connection
method needs to be supported by software in LAVA, services within the
software running on the device and other infrastructure, e.g. a serial
console server.
Note
Avoid defaults in dispatcher code - although serial is the traditional and
previously default way of connecting to LAVA devices, it must be
specified in the test job YAML.
The action which is responsible for creating the connection must
specify the connection method.
- boot:
method: qemu
media: tmpfs
connection: serial
failure_retry: 2
Support for particular connection methods needs to be implemented at a
device level, so the device also declares support for particular
connection methods.
deploy:
methods:
tftp
ssh
boot:
connections:
- serial
- ssh
methods:
qemu:
Most devices are capable of supporting SSH connections, as long as:
- the device can be configured to raise a usable network interface
- the device is booted into a suitable software environment
USB connections are planned for Android support but are not yet
implemented.
Primary and Secondary connections
Primary connection
A Primary Connection is roughly equivalent to having an SSH login on a
running machine. The device needs to be powered on, running an appropriate
daemon and with appropriate keys enabled for access. The TestJob for
a primary connection then skips the deploy stage and uses a boot method
to establish the connection. A device providing a primary connection
in LAVA only provides access to that connection via a single submitted
TestJob at a time - a Multinode job can make multiple connections but
other jobs will see the device as busy and not be able to start their
connections.
Warning
All primary connections raise issues of
Persistence - the test writer is solely responsible
for deleting any sensitive data copied, prepared or
downloaded using a primary connection. Do not leave
sensitive data for the next TestJob to find.
It is not necessarily required that a device offering a primary
connection is permanently powered on as the only connections being
made to the device are done via the scheduler which ensures that only
one TestJob can use any one device at a time. Depending on the amount
of time required to boot the device, it is supported to have a device
offering primary connections which is powered down between jobs.
A Primary Connection is established by the dispatcher and is therefore
constrained in the options which are available to the client requesting
the connection and the TestJob has no control over the arguments
passed to the daemon.
Both Primary and Secondary connections are affected by Security
issues due to the requirements of automation.
Secondary connection
Secondary connections are a way to have two simultaneous connections
to the same physical device, equivalent to two logins. Each connection
needs to be supported by a TestJob, so a Multinode group needs to be
created so that the output of each connection can be viewed as the output
of a single TestJob, just as if you had two terminals. The second
connection does not have to use the same connection method as the current
connection and many devices can only support secondary connections over
a network interface, for example SSH or telnet.
A Secondary Connection has a deploy step and the device is already
providing output over the primary connection, typically serial, before
the secondary connection is established. This is closer to having the
machine on your desk. The TestJob supplies the kernel and rootfs or
image to boot the device and can optionally use the secondary connection
to push other files to the device (for example, an ssh secondary
connection would use scp).
A Secondary Connection can have control over the daemon via the deployment
using the primary connection. The client connection is still made by the
dispatcher.
Both Primary and Secondary connections are affected by Security
issues due to the requirements of automation.
The device providing a Secondary Connection is running a TestJob and
the deployment will be erased when the job completes.
Connections and hacking sessions
A hacking session using a Secondary connection is the only
situation where the client is configurable by the user and the
daemon can be controlled by the test image. It is possible to adjust
the hacking session test definitions to use different commands and
options - as long as both daemon and client use compatible options.
As such, a hacking session user retains security over their private
keys at the cost of the loss of automation.
Hacking sessions can be used with primary or secondary connections,
depending on the use case.
Warning
Remember that in addition to issues related to the
Persistence of a primary connection device, hacking
sessions on primary connections also have all of the issues
of a shared access device - do not copy, prepare or download
sensitive data when using a shared access device.
Devices supporting Primary Connections
A device offering a primary connection needs a particular configuration
in the device dictionary table:
- Only primary connection deployment methods defined in the
deploy_methods parameter, e,g, ssh.
- Support in the device_type template to replace the list of deployment
methods with the list supplied in the deploy_methods parameter.
- No serial connection support in the boot connections list.
- No methods in the boot parameters.
This prevents other jobs being submitted which would cause the device
to be rebooted or have a different deployment prepared. This can be
further enhanced with device tag support.
Devices supporting Secondary Connections
There are fewer requirements of a device supporting secondary
connections:
- Primary and Secondary connections are mutually exclusive, so one
device cannot serve primary and secondary.
- The physical device must support the connection hardware requirements.
- The test image deployed needs to install and run the software
requirements of the connection, this would be a
JobError Exception
SSH as the primary connection
Certain devices can support SSH as the primary connection - the
filesystems on such devices are not erased at the end of a TestJob and
provide Persistence for certain tasks. (This is the equivalent
of the dummy-ssh device in the old dispatcher.) These devices declare
this support in the device configuration:
deploy:
# primary connection device has only connections as deployment methods
methods:
ssh
boot:
connections: # not serial
- ssh
TestJobs then use SSH as a boot method which simply acts as a login to
establish a connection:
- deploy:
to: ssh
os: debian
- boot:
method: ssh
connection: ssh
failure_retry: 2
The deploy action in this case simply prepares the LAVA overlay
containing the test shell definitions and copies those to a
pre-determined location on the device. This location will be removed
at the end of the TestJob.
Security
A primary SSH connection from the dispatcher needs to be controlled through
the device configuration, allowing the use of a private SSH key which
is at least hidden from test writers. (Only protect the essential components).
The key is declared as a path on the dispatcher, so is device-specific.
Devices on the same dispatcher can share the same key or may have a
unique key - all keys still need to not have any passphrase - as long
as all devices supported by the SSH host have the relevant keys
configured as authorized for login as root.
LAVA provides a default (completely insecure) private key which can be
used for these connections. This key is installed within lava-dispatcher
and is readable by anyone inspecting the lava-dispatcher codebase in git.
(This has not been changed in the refactoring.)
It is conceivable that a test image could be suitably configured before
being submitted to LAVA, with a private key included inside a second job
which deploys normally and executes the connection instead of
running a test definition. However, anyone with access to the test image
would still be able to obtain the private key. Keys generated on a per
job basis would still be open for the lifetime of the test job itself,
up to the job timeout specified. Whilst this could provide test writers
with the ability to control the options and commands used to create the
connection, any additional security is minimal and support for this has
not been implemented, yet.
Persistence
Devices supporting primary SSH connections have persistent deployments
and this has implications, some positive, some negative - depending on
your use case.
- Fixed OS - the OS you get is the OS of the device and this
must not be changed or upgraded.
- Package interference - if another user installs a conflicting
package, your test can fail.
- Process interference - another process could restart (or crash)
a daemon upon which your test relies, so your test will fail.
- Reusable scripts - scripts and utilities your test leaves behind
can be reused (or can interfere) with subsequent tests.
- Lack of reproducibility - an artifact from a previous test can
make it impossible to rely on the results of a subsquent test, leading
to wasted effort with false positives and false negatives.
Only use persistent deployments when essential and always take
great care to avoid interfering with other tests. Users who deliberately
or frequently interfere with other tests can have their submit privilege
revoked.
Disposable chroot deployments
Some devices can support mechanisms like LVM snapshots which allow
for a self-contained environment to be unpacked for a single session
and then discarded at the end of the session. These deployments do not
suffer the same entanglement issues as simple SSH deployments and can
provide multiple environments, not just the OS installed on the SSH
host system.
This support is similar to how distributions can offer “porter boxes”
which allow upstream teams and community developers to debug platform
issues in a native environment. It also allows tests to be run on a
different operating system or different release of an operating system.
Unlike distribution “porter boxes”, LAVA does not allow more than one
TestJob to have access to any one device at the same time.
A device supporting disposable chroots will typically follow the
configuration of Devices supporting Primary Connections. The device
will show as busy whenever a job is active, but although it is
possible to use a secondary connection as well, the deployment
methods of the device would have to disallow access to the media upon
which the chroots are installed or deployed or upon which the software
to manage the chroots is installed. e.g. a device offering disposable
chroots on SATA could offer ramdisk or NFS tests.
LAVA support for disposable chroots is implemented via schroot
(forming the replacement for the dummy-schroot device in the old
dispatcher).
Typical device configuration
deploy:
# list of deployment methods which this device supports
methods:
ssh:
schroot:
- unstable
- trusty
- jessie
boot:
connections:
- ssh
Optional device configuration allowing secondary connections:
deploy:
# list of deployment methods which this device supports
methods:
tftp:
ssh:
schroot:
- unstable
- trusty
- jessie
boot:
connections:
- serial
- ssh
The test job YAML would simply specify:
- deploy:
to: ssh
chroot: unstable
os: debian
- boot:
method: ssh
connection: ssh
failure_retry: 2
Note
The OS still needs to be specified, LAVA
does not guess based
on the chroot name. There is nothing to stop an schroot
being named testing but actually being upgraded or
replaced with something else.
The deployment of an schroot involves unpacking the schroot into a
logical volume with LVM. It is an InfrastructureError Exception
if this step fails, for example if the volume group has insufficient
available space.
schroot also supports directories and tarballs but LVM is recommended
as it avoids problems of Persistence.
Using secondary connections with VM groups
One example of the use of a secondary connection is to launch a VM on
a device already running a test image. This allows the test writer to
control both the kernel on the bare metal and the kernel in the VM.
The implementation of VMGroups created a role for a delayed start
Multinode job. This would allow one job to operate over serial, publish
the IP address, start an SSH server and signal the second job that a
connection is ready to be established. This may be useful for situations
where a debugging shell needs to be opened around a virtualisation
boundary.
There is an option for downloading or preparing the guest VM image on the
host device within a test shell, prior to the VM delayed start. Alternatively,
a deploy stage can be used which would copy a downloaded image from the
dispatcher to the host device.
Each connection is a different job in a multinode group so that the output
of each connection is tracked separately and can be monitored separately.
Sequence
The host device is deployed with a test image and booted.
LAVA then manages the download of the files necessary to create
the secondary connection.
- e.g. for QEMU, this would be a bootable image file
LAVA also creates a suitable overlay containing the test definitions
to be run inside the virtual machine.
The test image must start whatever servers are required to
provide the secondary connections, e.g. ssh. It does not matter
whether this is done using install steps in the test definition or
pre-existing packages in the test image or manual setup. The server
must be configured to allow the (insecure) LAVA automation SSH
private key to login as authorized - this key is available in the
/usr/lib/python2.7/dist-packages/lava_dispatcher/device/dynamic_vm_keys
directory when lava-dispatcher is installed or in the lava-dispatcher
git tree.
The test image on the host device starts a test definition over the
existing (typically serial) connection. At this point, the image file
and overlay for the guest VM are available on the host for the
host device test definition to inspect, although only the image
file should actually be modified.
The test definition includes a signal to the LAVA MultiNode API
which allows the VM to start. The signal includes an identifier for
which VM to start, if there is more than one.
The second job in the multinode group waits until the signal is
received from the coordinator. Upon receipt of the signal, the
lava dispatch process running the second job will initiate the
secondary connection to the host device, e.g. over SSH, using the
specified private key. The connection is used to run a set of
commands in the test image running on the host device. It is a
TestError if any of these commands fail. The last of these commands
must hold the connection open for as long as the test writer
needs to execute the task inside the VM. Once those tasks are
complete, the test definition running in the test image on the host
device signals that the VM has completed.
The test writer is given full control over the commands issued inside the
test image on the host device, including those commands which are responsible
for launching the VM. The test writer is also responsible for making the
overlay available inside the VM. This could be by passing arguments
to the commands to mount the overlay alongside the VM or by unpacking
the overlay inside the VM image before calling QEMU. If set in the job
definition, the test writer can ask LAVA to unpack the overlay inside the
image file for the VM and this will be done on the host device before
the host device boots the test image - however, this will require an
extra boot of the host device, e.g. using the dynamic master support.
Basic use cases
Prebuilt files can be downloaded, kernel, ramdisk, dtb, rootfs or
complete image. These will be downloaded to the host device and the
paths to these files substituted into the commands issued to start the
VM, in the same way as with bootloader like u-boot. This provides support
for tests within the VM using standard, packaged tools. To simplify
these tests further, it is recommended to use NFS for the root
filesystem of the host device boot - it leads to a quicker deployment
as the files for the VM can be downloaded directly to the NFS share
by the dispatcher. Deployments of the host device system to secondary
media, e.g. SATA, require additional steps and the job will take
longer to get to a point where the VM can be started.
The final launch of the VM will occur using a shell script (which will
then be preserved in the results alongside the overlay), containing the
parsed commands.
Advanced use cases
It is possible to use a test shell to build files to be used when
launching the VM. This allows for a test shell to operate on the
host device, building, downloading or compiling whatever files are
necessary for the operation of the VM, directly controlled by the
test shell.
To avoid confusion and duplication, LAVA does not support downloading
some files via the dispatcher and some via the test shell. If there
are files needed for the test job which are not to be built or generated
within the test shell, the test shell will need to use wget or
curl or some other tool present in the test image to obtain the
files. This also means that LAVA is not able to verify that such
URLs are correct during the validation of the job, so test writers need
to be aware that LAVA will not be able to fail a job early if the URL
is incorrect as would happen in the basic use case.
Any overlay containing the test definitions and LAVA test scripts which
are to be executed inside the VM after the VM has booted still needs to
be downloaded from the dispatcher. The URL of this overlay (a single
tarball containing all files in a self-contained directory) will be
injected into the test shell files on the host device, in a similar
way to how the MultiNode API provides dynamic data from other
devices in the group.
The test writer is responsible for extracting this tarball so that it
is present or is bind mounted into the root directory of the VM so that
the scripts can be launched immediately after login.
The test shell needs to create the final shell script, just as the
basic use case does. This allows the dispatcher running the VM to connect
to the host device and use a common interface to launch the VM in each
use case.
LAVA initiates and controls the connection to the VM, using this script,
so that all output is tracked in the multinode job assigned to the VM.
Sample job definition for the VM job
# second half of a new-style VM group job
# each connection is a different job
# even if only one physical device is actually powered up.
device_type: kvm-arm
job_name: wandboard-qemu
timeouts:
job:
minutes: 15
action:
minutes: 5
priority: medium
target_group: asd243fdgdfhgf-45645hgf
group_size: 2
parameters:
# the test definition on the host device manages how
# the overlay is applied to the VM image.
overlay: manual # use automatic for LAVA to do the overlay
# An ID appended to the signal to start this VM to distinguish
# it from any other VMs which may start later or when this one
# completes.
vm_id: gdb_session
actions:
- boot:
# as kvm-arm, this happens in a test image via
# the other half of this multinode job
timeout:
minutes: 3
# alternative to u-boot
connection: ssh
method: vm
# any way to launch a vm
commands:
# full access to the commands to run on the other device
- qemu-system-arm -hda {IMAGE}
type: qemu
- test:
name: kvm-basic-singlenode
timeout:
minutes: 5
definitions:
- repository: git://git.linaro.org/qa/test.git
from: git
path: ubuntu/smoke-tests-basic.yaml
name: smoke-tests
Device configuration design
Device configuration, as received by lava_dispatch has moved to YAML
and the database device configuration has moved to Jinja2 templates.
This method has a much larger scope of possible methods, related to the
pipeline strategies as well as allowing simple overrides and reuse of
common device configuration stanzas.
There is no need for the device configuration to include the
hostname in the YAML as there is nothing on the dispatcher to check
against - the dispatcher uses the command line arguments and the
supplied device configuration. The configuration includes all the data
the dispatcher needs to be able to run the job on the device attached
to the specified ports.
The device type configuration on the dispatcher is replaced by a
device type template on the server which is used to generate the
YAML device configuration sent to the dispatcher.
Device Dictionary
The normal admin flow for individual devices will be to make changes
to the device dictionary of that device. In time, an editable
interface will exist within the admin interface. Initially, changes
to the dictionary are made from the command line with details being
available in a read-only view in the admin interface.
The device dictionary acts as a set of variables inside the template,
in a very similar manner to how Django handles HTML templates. In turn,
a device type template will extend a base template.
It is a bug in the template if a missing value causes a broken device
configuration to be generated. Values which are not included in the
specified template will be ignored.
Once the device dictionary has been populated, the scheduler can be
told that the device is a pipeline device in the admin interface.
Note
Several parts of this process still need helpers and tools
or may give unexpected errors - there is a lot of ongoing
work in this area.
Exporting an existing device dictionary
If the local instance has a working pipeline device called mypanda,
the device dictionary can be exported:
$ sudo lava-server manage device-dictionary --hostname mypanda --export
{% extends 'panda.yaml' %}
{% set power_off_command = '/usr/bin/pduclient --daemon tweetypie --hostname pdu --command off --port 08' %}
{% set hard_reset_command = '/usr/bin/pduclient --daemon tweetypie --hostname pdu --command reboot --port 08' %}
{% set connection_command = 'telnet droopy 4001' %}
{% set power_on_command = '/usr/bin/pduclient --daemon tweetypie --hostname pdu --command on --port 08' %}
This dictionary declares that the device inherits the rest of the device
configuration from the panda device type. Settings specific to this
one device are then specified.
Reviewing an existing device dictionary
To populate the full configuration using the device dictionary and the
associated templates, use the review option:
$ sudo lava-server manage device-dictionary --hostname mypanda --review
Example device configuration review
device_type: beaglebone-black
commands:
connect: telnet localhost 6000
hard_reset: /usr/bin/pduclient --daemon localhost --hostname pdu --command reboot --port 08
power_off: /usr/bin/pduclient --daemon localhost --hostname pdu --command off --port 08
power_on: /usr/bin/pduclient --daemon localhost --hostname pdu --command on --port 08
parameters:
bootm:
kernel: '0x80200000'
ramdisk: '0x81600000'
dtb: '0x815f0000'
bootz:
kernel: '0x81000000'
ramdisk: '0x82000000'
dtb: '0x81f00000'
actions:
deploy:
# list of deployment methods which this device supports
methods:
# - image # not ready yet
- tftp
boot:
# list of boot methods which this device supports.
methods:
- u-boot:
parameters:
bootloader_prompt: U-Boot
boot_message: Booting Linux
send_char: False
# interrupt: # character needed to interrupt u-boot, single whitespace by default
# method specific stanza
oe:
commands:
- setenv initrd_high '0xffffffff'
- setenv fdt_high '0xffffffff'
- setenv bootcmd 'fatload mmc 0:3 0x80200000 uImage; fatload mmc 0:3 0x815f0000 board.dtb;
bootm 0x80200000 - 0x815f0000'
- setenv bootargs 'console=ttyO0,115200n8 root=/dev/mmcblk0p5 rootwait ro'
- boot
nfs:
commands:
- setenv autoload no
- setenv initrd_high '0xffffffff'
- setenv fdt_high '0xffffffff'
- setenv kernel_addr_r '{KERNEL_ADDR}'
- setenv initrd_addr_r '{RAMDISK_ADDR}'
- setenv fdt_addr_r '{DTB_ADDR}'
- setenv loadkernel 'tftp ${kernel_addr_r} {KERNEL}'
- setenv loadinitrd 'tftp ${initrd_addr_r} {RAMDISK}; setenv initrd_size ${filesize}'
- setenv loadfdt 'tftp ${fdt_addr_r} {DTB}'
# this could be a pycharm bug or a YAML problem with colons. Use : for now.
# alternatively, construct the nfsroot argument from values.
- setenv nfsargs 'setenv bootargs console=ttyO0,115200n8 root=/dev/nfs rw nfsroot={SERVER_IP}:{NFSROOTFS},tcp,hard,intr ip=dhcp'
- setenv bootcmd 'dhcp; setenv serverip {SERVER_IP}; run loadkernel; run loadinitrd; run loadfdt; run nfsargs; {BOOTX}'
- boot
ramdisk:
commands:
- setenv autoload no
- setenv initrd_high '0xffffffff'
- setenv fdt_high '0xffffffff'
- setenv kernel_addr_r '{KERNEL_ADDR}'
- setenv initrd_addr_r '{RAMDISK_ADDR}'
- setenv fdt_addr_r '{DTB_ADDR}'
- setenv loadkernel 'tftp ${kernel_addr_r} {KERNEL}'
- setenv loadinitrd 'tftp ${initrd_addr_r} {RAMDISK}; setenv initrd_size ${filesize}'
- setenv loadfdt 'tftp ${fdt_addr_r} {DTB}'
- setenv bootargs 'console=ttyO0,115200n8 root=/dev/ram0 ip=dhcp'
- setenv bootcmd 'dhcp; setenv serverip {SERVER_IP}; run loadkernel; run loadinitrd; run loadfdt; {BOOTX}'
- boot
Importing configuration using a known template
To add or update the device dictionary, a file using the same syntax as
the export content can be imported into the database:
$ sudo lava-server manage device-dictionary --hostname mypanda --import mypanda.yaml
(The file extension is unnecessary and the content is not actually YAML
but will be rendered as YAML when the templates are used.)
Creating a new template
Start with the base.yaml template and use the structure of that
template to ensure that your template remains valid YAML.
Start with a complete device configuration (in YAML) which works on the
lava-dispatch command line, then iterate over changes in the template
to produce the same output.
Note
A helper is being planned for this step.
Running lava-dispatch directly
lava-dispatch only accepts a YAML file for pipeline jobs - the old
behaviour of looking up the file based on the device hostname has been
dropped. The absolute or relative path to the YAML file must be
specified to the --target option. --output-dir must also be
specified:
sudo lava-dispatch --target devices/fred.conf panda-ramdisk.yaml --output-dir=/tmp/test