Summary and Setup
Prerequisite
This tutorial guides you through the the fundamentals of designing and building an analysis workflow. It assumes no previous knowledge or experience of workflows or Common Workflow Language (CWL), but does assume some experience with the Unix command line.
Before following this tutorial, you should be comfortable working in
a Unix command line environment and familiar with fundamental commands
(cd
, mv
, mkdir
, etc), piping and
redirection, and simple Bash scripting, such as might be gained from
following the Software
Carpentry lesson, The Unix
Shell.
You might also have some experience with running tasks on a remote
machine (by ssh
connection) and in a cluster (high
performance computing) environment.
CWL is based upon YAML. At any time, if you find yourself being confused by the YAML syntax, considering reviewing this guide on the subset of YAML used in CWL.
If you have previously written a workflow description, in CWL or another language, you may want to look instead at the User Guide.
Target Audience
This tutorial is aimed at researchers and research software engineers who would like to begin automating their analyses in workflows.
If you’re unsure whether this tutorial is a good fit for you check the prerequisites listed above.
You may also find our [learner profiles][audience] helpful.
These are also a useful resource during the lesson design process.
OS Setup
These lessons assume that you are using the freely available Visual
Studio Code application with the Benten extension along with the CWL
reference runner (cwltool
).
This tutorial requires three pieces of software to run and visualize the workflows: Docker, cwltool, and graphviz.
Please follow instructions for your OS by clicking on the relevant link below.
Windows Setup
WSL2 Installation
Prerequisite
Ensure you’re running Windows 10 or higher
To check your Windows version and build number, press the Windows
logo key + R, type winver
, select OK. You can
update to the latest Windows version by selecting “Start” >
“Settings” > “Windows Update” > “Check for updates”.
It is also expected you download the ‘terminal’ app from the Microsoft store:
Follow the wsl installation instructions.
You may also wish to go through Getting started with WSL2.
Choosing your Linux Flavour of Choice
For this tutorial, we expect you use the Ubuntu distribution as your WSL2 distribution of choice.
Confirm WSL2 is installed
Open PowerShell as Administrator and type in the following
You should see your linux distribution you have installed in the previous step.
Installing apt tools
For this tutorial, we will require a few linux tools to be installed.
Open up the ‘terminal’ app and select a new tab of the Ubuntu version you have just installed
Install Docker Desktop
Install Docker Desktop by following the instructions on the Docker Desktop Installation Page
- Accept the terms and conditions, if prompted
- Wait for Docker Desktop to finish starting
- Skip the tutorial, if prompted
VSCode Installation
Download and install VSCode
Install VSCode Extensions
Install the WSL Integration extension
Open WSL2 Extension in the marketplace
You should now see ‘WSL Targets’ under the ‘Remote Explorer’ tab on the left hand side of the screen.
Right-click Ubuntu to set it as the default distribution
Install Benten VSCode Extension
Open
Benten in the marketplace and click the Install
button.
If you are given the option to enable the extension on ‘WSL: Ubuntu’ please do so.
Install Redhat Yaml VSCode Extension
Open
RedHad Yaml in the marketplace and click the Install
button.
If you are given the option to enable the extension on ‘WSL: Ubuntu’ please do so.
Attribute CWL files to the yaml file type
Add the following chunk to the VSCode user settings json to attribute CWL to the YAML file type.
Install cwltool (latest)
First we will make a Python virtual environment by running the following commands in the terminal.
BASH
python3 -m venv env # Create a virtual environment named 'env' in the current directory
source env/bin/activate # Activate the 'env' environment
You will know that this worked as the terminal prompt will now have
(env)
at the beginning.
Reactivating the python virtual environment
Every time you launch VS Code or launch a new terminal, you must run
source env/bin/activate
to re-enable access to this Python
Virtual Environment.
Next, install cwltool by running the following in the terminal:
Continue to Confirm Software Installations after completing the setup for Windows.
MacOS Setup
VSCode Extensions
Install Benten Extension
Open
Benten in the marketplace and click the Install
button
or follow the directions.
Install Redhat Yaml VSCode Extension
Open
RedHad Yaml in the marketplace and click the Install
button.
Attribute CWL files to the yaml file type
Add the following chunk to the VSCode user settings json to attribute CWL to the YAML file type.
Conda Setup
Install prerequisites via conda
Reactivating the conda virtual environment
The virtual environment needs to be activated every time you start a
terminal using conda activate cwltutorial
.
Continue to Confirm Software Installations after completing the setup for MacOS.
Linux Setup
VSCode extensions
Install Benten VSCode Extension
Open
Benten in the marketplace and click the Install
button.
Install Redhat Yaml VSCode Extension
Open
RedHad Yaml in the marketplace and click the Install
button.
Attribute CWL files to the yaml file type
Add the following chunk to the VSCode user settings json to attribute CWL to the YAML file type.
Install Docker
Click here and then click on your linux flavour (i.e Ubuntu) to install the docker engine.
Enable docker usage as a non-root user
Follow the instructions in the Docker documentation to enable docker usage as a non-root user
You will need to logout for this to take effect.
Install cwltool
First we will make a Python virtual environment by running the following commands in the terminal.
BASH
python3 -m venv env # Create a virtual environment named 'env' in the current directory
source env/bin/activate # Activate the 'env' environment
You will know that this worked as the terminal prompt will now have
(env)
at the beginning.
Reactivating the python virtual environment
Every time you launch VS Code or launch a new terminal, you must run
source env/bin/activate
to re-enable access to this Python
Virtual Environment.
Next, install cwltool by running the following in the terminal:
Later we will make visualisations of our workflows. To support that
we need to install graphviz
.
Install graphviz
Here is the command for Debian-based Linux systems to install
graphviz
and other helpful programs:
For other Linux systems, check the graphviz download page
Continue to Confirm Software Installations after completing the setup for Linux.
Confirm Software Installations
Docker
To confirm docker is installed, run the following command to display the version number:
You should see something similar to the output shown below.
Client: Docker Engine - Community
Version: 20.10.13
API version: 1.41
Go version: go1.16.15
Git commit: a224086
Built: Thu Mar 10 14:08:15 2022
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.13
API version: 1.41 (minimum version 1.12)
Go version: go1.16.15
Git commit: 906f57f
Built: Thu Mar 10 14:06:05 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.5.10
GitCommit: 2a1d4dbdb2a1030dc5b01e96fb110a9d9f150ecc
runc:
Version: 1.0.3
GitCommit: v1.0.3-0-gf46b6ba
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Containers Requirements
To avoid having to wait during the class, please run the following which will download all the required software containers.
GitHub Requirements
This part assumes you have a GitHub.com account.
If you do not have a GitHub.com account you can create one here
The following steps will take you through the process of creating a SSH key pair to connect your terminal to GitHub. If you already have this set up, you may skip this part.
Opening files from the terminal
Whenever you are required to open a file from the terminal, you may
type code /path/to/file
.
This will also create the file if it does not yet exist.
SSH Key
After generating an account you will need to generate an SSH Key so that you can pull and push code from your terminal to GitHub.
In your terminal run the following command
# Create the .ssh directory
mkdir -p --mode 700 ~/.ssh
ssh-keygen -f ~/.ssh/github -t ed25519 -q -N ""
Add your public key to GitHub
Head to GitHub SSH and GPG keys.
Click New SSH key
.
set the title as the name of your computer (this can be anything), and
under the ‘Key’, copy the contents of ~/.ssh/github.pub
from your terminal.
Data Requirements
You will need to download some example files for this lesson. In this tutorial we will use RNA sequencing data.
Setting up a practice Git repository
For this tutorial some existing tools are needed to build the workflow. These existing tools will be imported via GitHub.
Make a repository on GitHub.com
Head to github.com/new and create a new repository.
Name it novice-tutorial-exercises
.
Make sure the repository is public
.
Check ‘Add a README file’, and choose
GNU General Public License v3.0
as your license.
Downloading sample and reference data
Create a new directory inside the
novice-tutorial-exercises
directory and download the
data:
Using subshells
By running the following chunk in brackets the console will return to the original working directory after the download is complete
BASH
mkdir rnaseq
(
cd rnaseq
wget --quiet https://zenodo.org/record/4541751/files/GSM461177_1_subsampled.fastqsanger
wget --quiet https://zenodo.org/record/4541751/files/GSM461177_2_subsampled.fastqsanger
wget --quiet https://zenodo.org/record/4541751/files/Drosophila_melanogaster.BDGP6.87.gtf
wget --quiet https://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.fa.gz
gunzip dm6.fa.gz # STAR index requires an uncompressed reference genome
)
Ignore data files in Git
Git is not designed to track large data files.
We should exclude the rnaseq folder from being tracked in this
repository.
echo 'rnaseq/' >> .gitignore
Add our changes to git
.gitignore
is a file that is tracked like all other
GitHub commands.
Likewise, we also need to let git know we would like to keep bio-cwl-tools as a submodule.
git add .gitignore bio-cwl-tools
git commit -m "Don't track files in rnaseq dir, added bio-cwl-tools as a submodule"
git push
STAR Genome index
To run the STAR aligner tool, index files generated from the reference genome are needed. The index files can be downloaded, or generated yourself from the unindexed reference genome.
These two options are detailed below – choose the one most appropriate to your set-up.
At least 9 GB of memory is required to generate the index, which will occupy 3.3GB of disk.
If your computer doesn’t have that much memory, then you can download
the directory by running the following in the rnaseq
directory:
To generate the genome index yourself: create a new file named
dm6-star-index.yaml
in the the
novice-tutorial-exercises
directory with the following
contents:
YAML
InputFiles:
- class: File
location: rnaseq/dm6.fa
format: http://edamontology.org/format_1929 # FASTA
IndexName: 'dm6-STAR-index'
Overhang: 36
Gtf:
class: File
location: rnaseq/Drosophila_melanogaster.BDGP6.87.gtf
Next use the CWL reference runner cwltool
that you
installed above and the CWL description for the indexing mode of the
STAR aligner that was downloaded in the bio-cwl-tools
directory to index the genome and place the result in the
rnaseq
directory alongside the other files:
cwltool --outdir rnaseq/ bio-cwl-tools/STAR/STAR-Index.cwl dm6-star-index.yaml
It should take 10-15 minutes for the index to be generated.
STAR Index Memory Requirements
To generate the genome index you will need at least 9 GB of RAM.
If you do not allocate enough RAM the tool will not crash, but the process will stick on the following step:
... sorting Suffix Array chunks and saving them to disk...
If this step does not finish within 10 minutes then it is likely the process has failed, and should be cancelled.
MacOS users can configure Docker Desktop to allocate more memory from the menu “Preferences” and then selecting “Resources”.
Windows users can configure WSL 2 to allocate more memory by opening the PowerShell and entering the following:
BASH
# turn off all wsl instances such as docker-desktop
wsl --shutdown
notepad "$env:USERPROFILE/.wslconfig"
In .wslconfig
add the following
[wsl2]
memory=9GB
Save the file and right-click the Docker icon in the notifications area (or System tray) and then click “Restart Docker…”
Validating your dataset
Let’s make sure the data we’ve downloaded is correct and in the right
structure.
We can do this with a checksum.
Setting the checksum.
Under the rnaseq folder create the file rnaseq.md5
with
the following contents.
BASH
3c8458cf67d71c22c4c420dd37e23ef2 Drosophila_melanogaster.BDGP6.87.gtf
036b68a2a8fe8725d48fb5fd89e2b8b2 GSM461177_1_subsampled.fastqsanger
87b09607057743ecf3e38448630421c9 GSM461177_2_subsampled.fastqsanger
80f57c0a3537d2e4cd9f1748e7b7da91 dm6-STAR-index/Genome
e1a0b37b3cae8af4871cfb8c293a98cb dm6-STAR-index/SA
eec51bc2096fbfd71f2cdb970529a6f9 dm6-STAR-index/SAindex
d8f7048f5f882af92c7fda0c711c70ad dm6-STAR-index/chrLength.txt
d727129afa13f91ae22bd35e29afd0c0 dm6-STAR-index/chrName.txt
63645d2aab7fd4c9ed356e071d7b9c2a dm6-STAR-index/chrNameLength.txt
f897e16cb9a8d9a4f1cf242938f97ac6 dm6-STAR-index/chrStart.txt
d5212320af99792dc517f1d4fa830ec1 dm6-STAR-index/exonGeTrInfo.tab
5ef2835f5c116f55dd20522567152bf8 dm6-STAR-index/exonInfo.tab
11aa105726dbb95e8f27714fdbd8e897 dm6-STAR-index/geneInfo.tab
faa97745744f087d06879800a531edea dm6-STAR-index/sjdbInfo.txt
8cd08db8cb269461db29d9eb23ad690b dm6-STAR-index/sjdbList.fromGTF.out.tab
fd0b412f47ca662ce68e08ca6939a2c7 dm6-STAR-index/sjdbList.out.tab
bdf26dee24f8cd1ca7b1bd854c882411 dm6-STAR-index/transcriptInfo.tab
5aadf7ccab5a6b674e76516bf75eaa09 dm6.fa
The following code chunk may be of assistance. You may also use a text editor.
BASH
(
cd rnaseq && \
{
echo "3c8458cf67d71c22c4c420dd37e23ef2 Drosophila_melanogaster.BDGP6.87.gtf"
echo "036b68a2a8fe8725d48fb5fd89e2b8b2 GSM461177_1_subsampled.fastqsanger"
echo "87b09607057743ecf3e38448630421c9 GSM461177_2_subsampled.fastqsanger"
echo "80f57c0a3537d2e4cd9f1748e7b7da91 dm6-STAR-index/Genome"
echo "e1a0b37b3cae8af4871cfb8c293a98cb dm6-STAR-index/SA"
echo "eec51bc2096fbfd71f2cdb970529a6f9 dm6-STAR-index/SAindex"
echo "d8f7048f5f882af92c7fda0c711c70ad dm6-STAR-index/chrLength.txt"
echo "d727129afa13f91ae22bd35e29afd0c0 dm6-STAR-index/chrName.txt"
echo "63645d2aab7fd4c9ed356e071d7b9c2a dm6-STAR-index/chrNameLength.txt"
echo "f897e16cb9a8d9a4f1cf242938f97ac6 dm6-STAR-index/chrStart.txt"
echo "d5212320af99792dc517f1d4fa830ec1 dm6-STAR-index/exonGeTrInfo.tab"
echo "5ef2835f5c116f55dd20522567152bf8 dm6-STAR-index/exonInfo.tab"
echo "11aa105726dbb95e8f27714fdbd8e897 dm6-STAR-index/geneInfo.tab"
echo "faa97745744f087d06879800a531edea dm6-STAR-index/sjdbInfo.txt"
echo "8cd08db8cb269461db29d9eb23ad690b dm6-STAR-index/sjdbList.fromGTF.out.tab"
echo "fd0b412f47ca662ce68e08ca6939a2c7 dm6-STAR-index/sjdbList.out.tab"
echo "bdf26dee24f8cd1ca7b1bd854c882411 dm6-STAR-index/transcriptInfo.tab"
echo "5aadf7ccab5a6b674e76516bf75eaa09 dm6.fa"
} >> rnaseq.md5
)
Checking your files
(
cd rnaseq && \
md5sum -c rnaseq.md5
)
We expect the following outputs from this command:
Drosophila_melanogaster.BDGP6.87.gtf: OK
GSM461177_1_subsampled.fastqsanger: OK
GSM461177_2_subsampled.fastqsanger: OK
dm6-STAR-index/Genome: OK
dm6-STAR-index/Log.out: OK
dm6-STAR-index/SA: OK
dm6-STAR-index/SAindex: OK
dm6-STAR-index/chrLength.txt: OK
dm6-STAR-index/chrName.txt: OK
dm6-STAR-index/chrNameLength.txt: OK
dm6-STAR-index/chrStart.txt: OK
dm6-STAR-index/exonGeTrInfo.tab: OK
dm6-STAR-index/exonInfo.tab: OK
dm6-STAR-index/geneInfo.tab: OK
dm6-STAR-index/genomeParameters.txt: OK
dm6-STAR-index/sjdbInfo.txt: OK
dm6-STAR-index/sjdbList.fromGTF.out.tab: OK
dm6-STAR-index/sjdbList.out.tab: OK
dm6-STAR-index/transcriptInfo.tab: OK
dm6.fa: OK
md5 Troubleshooting: Windows users only
If you see an error such as
md5sum: 'Drosophila_melanogaster.BDGP6.87.gtf'$'\r': No such file or directory
: FAILED open or reader.BDGP6.87.gtf
it may be because rnaseq.md5
has windows based line
endings.
The following code will install the dos2unix cli
tool and then convert the rnaseq.md5
file to unix-based
line endings.
Then retry the md5sum
command