Summary and Schedule

Prerequisite

This tutorial guides you through the the fundamentals of designing and building an analysis workflow. It assumes no previous knowledge or experience of workflows or Common Workflow Language (CWL), but does assume some experience with the Unix command line.

Before following this tutorial, you should be comfortable working in a Unix command line environment and familiar with fundamental commands (cd, mv, mkdir, etc), piping and redirection, and simple Bash scripting, such as might be gained from following the Software Carpentry lesson, The Unix Shell.

You might also have some experience with running tasks on a remote machine (by ssh connection) and in a cluster (high performance computing) environment.

CWL is based upon YAML. At any time, if you find yourself being confused by the YAML syntax, considering reviewing this guide on the subset of YAML used in CWL.

If you have previously written a workflow description, in CWL or another language, you may want to look instead at the User Guide.

Target Audience

This tutorial is aimed at researchers and research software engineers who would like to begin automating their analyses in workflows.

If you’re unsure whether this tutorial is a good fit for you check the prerequisites listed above.

You may also find our [learner profiles][audience] helpful.

These are also a useful resource during the lesson design process.

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.

OS Setup


These lessons assume that you are using the freely available Visual Studio Code application with the Benten extension along with the CWL reference runner (cwltool).

This tutorial requires three pieces of software to run and visualize the workflows: Docker, cwltool, and graphviz.

Please follow instructions for your OS by clicking on the relevant link below.

Windows Setup


WSL2 Installation

Prerequisite

Ensure you’re running Windows 10 or higher

To check your Windows version and build number, press the Windows logo key + R, type winver, select OK. You can update to the latest Windows version by selecting “Start” > “Settings” > “Windows Update” > “Check for updates”.

It is also expected you download the ‘terminal’ app from the Microsoft store:

Follow the wsl installation instructions.

You may also wish to go through Getting started with WSL2.

Choosing your Linux Flavour of Choice

For this tutorial, we expect you use the Ubuntu distribution as your WSL2 distribution of choice.

Confirm WSL2 is installed

Open PowerShell as Administrator and type in the following

BASH

wsl --list

You should see your linux distribution you have installed in the previous step.

Installing apt tools

For this tutorial, we will require a few linux tools to be installed.

Open up the ‘terminal’ app and select a new tab of the Ubuntu version you have just installed

BASH

sudo apt-get update -y -q && \
sudo apt-get install -y -q \
  python3-venv \
  wget \
  graphviz \
  nodejs \
  wslu

Install Docker Desktop

Install Docker Desktop by following the instructions on the Docker Desktop Installation Page

  • Accept the terms and conditions, if prompted
  • Wait for Docker Desktop to finish starting
  • Skip the tutorial, if prompted

Ensure Docker Desktop is Using the WSL2 Backend

  • From the top menu choose “Settings” > “General”

  • Make sure ‘Use the WSL 2 based engine’ is selected


VSCode Installation

Download and install VSCode

Install VSCode Extensions

Install the WSL Integration extension

Open WSL2 Extension in the marketplace

You should now see ‘WSL Targets’ under the ‘Remote Explorer’ tab on the left hand side of the screen.

Right-click Ubuntu to set it as the default distribution

Install Benten VSCode Extension

Open Benten in the marketplace and click the Install button.

If you are given the option to enable the extension on ‘WSL: Ubuntu’ please do so.

Install Redhat Yaml VSCode Extension

Open RedHad Yaml in the marketplace and click the Install button.

If you are given the option to enable the extension on ‘WSL: Ubuntu’ please do so.

Attribute CWL files to the yaml file type

Add the following chunk to the VSCode user settings json to attribute CWL to the YAML file type.

JSON

{
    "files.associations": {
        "*.cwl": "yaml"
    }
}

Install cwltool (latest)

First we will make a Python virtual environment by running the following commands in the terminal.

BASH

python3 -m venv env       # Create a virtual environment named 'env' in the current directory
source env/bin/activate   # Activate the 'env' environment

You will know that this worked as the terminal prompt will now have (env) at the beginning.

Reactivating the python virtual environment

Every time you launch VS Code or launch a new terminal, you must run source env/bin/activate to re-enable access to this Python Virtual Environment.

Next, install cwltool by running the following in the terminal:

BASH

pip install cwltool

Continue to Confirm Software Installations after completing the setup for Windows.


MacOS Setup


VSCode Installation

Download and install VSCode


VSCode Extensions

Install Benten Extension

Open Benten in the marketplace and click the Install button or follow the directions.

Install Redhat Yaml VSCode Extension

Open RedHad Yaml in the marketplace and click the Install button.

Attribute CWL files to the yaml file type

Add the following chunk to the VSCode user settings json to attribute CWL to the YAML file type.

JSON

{
    "files.associations": {
        "*.cwl": "yaml"
    }
}


Docker Installation

Install docker


Install Conda

Install Miniconda


Conda Setup

Update Conda sources configuration

Tell conda about which channels (sources) we will use

BASH

conda config --add channels conda-forge


Create a virtual environment

Create a virtual environment using conda

BASH

conda create --name cwltutorial python==3.11


Activate the virtual environment

BASH

conda activate cwltutorial


Install prerequisites via conda

BASH

conda install --yes \
  cwltool \
  graphviz \
  wget \
  git \
  nodejs

Reactivating the conda virtual environment

The virtual environment needs to be activated every time you start a terminal using conda activate cwltutorial.

Continue to Confirm Software Installations after completing the setup for MacOS.

Linux Setup



Install VSCode

Download and install VSCode


VSCode extensions

Install Benten VSCode Extension

Open Benten in the marketplace and click the Install button.

Install Redhat Yaml VSCode Extension

Open RedHad Yaml in the marketplace and click the Install button.

Attribute CWL files to the yaml file type

Add the following chunk to the VSCode user settings json to attribute CWL to the YAML file type.

JSON

{
    "files.associations": {
        "*.cwl": "yaml"
    }
}


Install Docker

Click here and then click on your linux flavour (i.e Ubuntu) to install the docker engine.


Enable docker usage as a non-root user

Follow the instructions in the Docker documentation to enable docker usage as a non-root user

You will need to logout for this to take effect.

Install cwltool

First we will make a Python virtual environment by running the following commands in the terminal.

BASH

python3 -m venv env       # Create a virtual environment named 'env' in the current directory
source env/bin/activate   # Activate the 'env' environment

You will know that this worked as the terminal prompt will now have (env) at the beginning.

Reactivating the python virtual environment

Every time you launch VS Code or launch a new terminal, you must run source env/bin/activate to re-enable access to this Python Virtual Environment.

Next, install cwltool by running the following in the terminal:

BASH

pip install cwltool

Later we will make visualisations of our workflows. To support that we need to install graphviz.

Install graphviz

Here is the command for Debian-based Linux systems to install graphviz and other helpful programs:

BASH

sudo apt-get install -y graphviz wget git nodejs

For other Linux systems, check the graphviz download page

Continue to Confirm Software Installations after completing the setup for Linux.

Confirm Software Installations


Docker

To confirm docker is installed, run the following command to display the version number:

BASH

docker version

You should see something similar to the output shown below.

Client: Docker Engine - Community
 Version:           20.10.13
 API version:       1.41
 Go version:        go1.16.15
 Git commit:        a224086
 Built:             Thu Mar 10 14:08:15 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.13
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.15
  Git commit:       906f57f
  Built:            Thu Mar 10 14:06:05 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.5.10
  GitCommit:        2a1d4dbdb2a1030dc5b01e96fb110a9d9f150ecc
 runc:
  Version:          1.0.3
  GitCommit:        v1.0.3-0-gf46b6ba
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Confirm cwltool is installed

To confirm cwltool is installed, run the following command to display the version number:

BASH

cwltool --version

You should see something similar to the output shown below.

/home/learner/env/bin/cwltool 3.1.20220224085855

GraphViz

To confirm graphviz is installed, run the following command to display the version number:

BASH

dot -V

You should see something similar to the output shown below.

dot - graphviz version 2.43.0 (0)

Containers Requirements


To avoid having to wait during the class, please run the following which will download all the required software containers.

BASH

docker pull quay.io/biocontainers/star:2.7.5c--0
docker pull quay.io/biocontainers/fastqc:0.11.9--hdfd78af_1
docker pull quay.io/biocontainers/cutadapt:3.7--py39hbf8eff0_1
docker pull quay.io/biocontainers/samtools:1.14--hb421002_0
docker pull quay.io/biocontainers/subread:2.0.6--he4a0461_0

GitHub Requirements


This part assumes you have a GitHub.com account.

If you do not have a GitHub.com account you can create one here

The following steps will take you through the process of creating a SSH key pair to connect your terminal to GitHub. If you already have this set up, you may skip this part.

Opening files from the terminal

Whenever you are required to open a file from the terminal, you may type code /path/to/file.

This will also create the file if it does not yet exist.

SSH Key

After generating an account you will need to generate an SSH Key so that you can pull and push code from your terminal to GitHub.

In your terminal run the following command

# Create the .ssh directory
mkdir -p --mode 700 ~/.ssh

ssh-keygen -f ~/.ssh/github -t ed25519 -q -N ""

Add your public key to GitHub

Head to GitHub SSH and GPG keys.

Click New SSH key.
set the title as the name of your computer (this can be anything), and under the ‘Key’, copy the contents of ~/.ssh/github.pub from your terminal.

Setting your SSH Config

Open up ~/.ssh/config and add in the following code chunk:

Host github.com
  IdentityFile /home/YOUR_USER_NAME/.ssh/github
  User git

Test your connection

ssh -T git@github.com

You should see a message like so

Hi alexiswl! You've successfully authenticated, but GitHub does not provide shell access.

Data Requirements


You will need to download some example files for this lesson. In this tutorial we will use RNA sequencing data.

Setting up a practice Git repository

For this tutorial some existing tools are needed to build the workflow. These existing tools will be imported via GitHub.

Make a repository on GitHub.com

Head to github.com/new and create a new repository.

Name it novice-tutorial-exercises.

Make sure the repository is public.

Check ‘Add a README file’, and choose GNU General Public License v3.0 as your license.

Clone the repository

BASH

git clone git@github.com/YOUR_GITHUB_USERNAME/novice-tutorial-exercises

Next, we need to move into our git repo:

BASH

cd novice-tutorial-exercises


Add in bio-cwl-tools as a submodule

Then import bio-cwl-tools with this command:

BASH

git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git


Downloading sample and reference data

Create a new directory inside the novice-tutorial-exercises directory and download the data:

Using subshells

By running the following chunk in brackets the console will return to the original working directory after the download is complete

BASH

mkdir rnaseq
(
  cd rnaseq
  wget --quiet https://zenodo.org/record/4541751/files/GSM461177_1_subsampled.fastqsanger
  wget --quiet https://zenodo.org/record/4541751/files/GSM461177_2_subsampled.fastqsanger
  wget --quiet https://zenodo.org/record/4541751/files/Drosophila_melanogaster.BDGP6.87.gtf
  wget --quiet https://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.fa.gz
  gunzip dm6.fa.gz  # STAR index requires an uncompressed reference genome
)


Ignore data files in Git

Git is not designed to track large data files.
We should exclude the rnaseq folder from being tracked in this repository.

echo 'rnaseq/' >> .gitignore

Add our changes to git

.gitignore is a file that is tracked like all other GitHub commands.

Likewise, we also need to let git know we would like to keep bio-cwl-tools as a submodule.

git add .gitignore bio-cwl-tools
git commit -m "Don't track files in rnaseq dir, added bio-cwl-tools as a submodule"
git push


STAR Genome index

To run the STAR aligner tool, index files generated from the reference genome are needed. The index files can be downloaded, or generated yourself from the unindexed reference genome.

These two options are detailed below – choose the one most appropriate to your set-up.

At least 9 GB of memory is required to generate the index, which will occupy 3.3GB of disk.

If your computer doesn’t have that much memory, then you can download the directory by running the following in the rnaseq directory:

BASH

wget --quiet --output-document - https://dataverse.nl/api/access/datafile/266295 | \
tar -C rnaseq -x --xz

To generate the genome index yourself: create a new file named dm6-star-index.yaml in the the novice-tutorial-exercises directory with the following contents:

YAML

InputFiles:
 - class: File
   location: rnaseq/dm6.fa
   format: http://edamontology.org/format_1929  # FASTA
IndexName: 'dm6-STAR-index'
Overhang: 36
Gtf:
 class: File
 location: rnaseq/Drosophila_melanogaster.BDGP6.87.gtf

Next use the CWL reference runner cwltool that you installed above and the CWL description for the indexing mode of the STAR aligner that was downloaded in the bio-cwl-tools directory to index the genome and place the result in the rnaseq directory alongside the other files:

cwltool --outdir rnaseq/ bio-cwl-tools/STAR/STAR-Index.cwl dm6-star-index.yaml

It should take 10-15 minutes for the index to be generated.

STAR Index Memory Requirements

To generate the genome index you will need at least 9 GB of RAM.

If you do not allocate enough RAM the tool will not crash, but the process will stick on the following step:

... sorting Suffix Array chunks and saving them to disk...

If this step does not finish within 10 minutes then it is likely the process has failed, and should be cancelled.

MacOS users can configure Docker Desktop to allocate more memory from the menu “Preferences” and then selecting “Resources”.

Windows users can configure WSL 2 to allocate more memory by opening the PowerShell and entering the following:

BASH

# turn off all wsl instances such as docker-desktop
wsl --shutdown

notepad "$env:USERPROFILE/.wslconfig"

In .wslconfig add the following

[wsl2]
memory=9GB

Save the file and right-click the Docker icon in the notifications area (or System tray) and then click “Restart Docker…”


Validating your dataset

Let’s make sure the data we’ve downloaded is correct and in the right structure.
We can do this with a checksum.

Setting the checksum.

Under the rnaseq folder create the file rnaseq.md5 with the following contents.

BASH

3c8458cf67d71c22c4c420dd37e23ef2  Drosophila_melanogaster.BDGP6.87.gtf
036b68a2a8fe8725d48fb5fd89e2b8b2  GSM461177_1_subsampled.fastqsanger
87b09607057743ecf3e38448630421c9  GSM461177_2_subsampled.fastqsanger
80f57c0a3537d2e4cd9f1748e7b7da91  dm6-STAR-index/Genome
e1a0b37b3cae8af4871cfb8c293a98cb  dm6-STAR-index/SA
eec51bc2096fbfd71f2cdb970529a6f9  dm6-STAR-index/SAindex
d8f7048f5f882af92c7fda0c711c70ad  dm6-STAR-index/chrLength.txt
d727129afa13f91ae22bd35e29afd0c0  dm6-STAR-index/chrName.txt
63645d2aab7fd4c9ed356e071d7b9c2a  dm6-STAR-index/chrNameLength.txt
f897e16cb9a8d9a4f1cf242938f97ac6  dm6-STAR-index/chrStart.txt
d5212320af99792dc517f1d4fa830ec1  dm6-STAR-index/exonGeTrInfo.tab
5ef2835f5c116f55dd20522567152bf8  dm6-STAR-index/exonInfo.tab
11aa105726dbb95e8f27714fdbd8e897  dm6-STAR-index/geneInfo.tab
faa97745744f087d06879800a531edea  dm6-STAR-index/sjdbInfo.txt
8cd08db8cb269461db29d9eb23ad690b  dm6-STAR-index/sjdbList.fromGTF.out.tab
fd0b412f47ca662ce68e08ca6939a2c7  dm6-STAR-index/sjdbList.out.tab
bdf26dee24f8cd1ca7b1bd854c882411  dm6-STAR-index/transcriptInfo.tab
5aadf7ccab5a6b674e76516bf75eaa09  dm6.fa

The following code chunk may be of assistance. You may also use a text editor.

BASH

(
  cd rnaseq && \
  {
     echo "3c8458cf67d71c22c4c420dd37e23ef2  Drosophila_melanogaster.BDGP6.87.gtf"
     echo "036b68a2a8fe8725d48fb5fd89e2b8b2  GSM461177_1_subsampled.fastqsanger"
     echo "87b09607057743ecf3e38448630421c9  GSM461177_2_subsampled.fastqsanger"
     echo "80f57c0a3537d2e4cd9f1748e7b7da91  dm6-STAR-index/Genome"
     echo "e1a0b37b3cae8af4871cfb8c293a98cb  dm6-STAR-index/SA"
     echo "eec51bc2096fbfd71f2cdb970529a6f9  dm6-STAR-index/SAindex"
     echo "d8f7048f5f882af92c7fda0c711c70ad  dm6-STAR-index/chrLength.txt"
     echo "d727129afa13f91ae22bd35e29afd0c0  dm6-STAR-index/chrName.txt"
     echo "63645d2aab7fd4c9ed356e071d7b9c2a  dm6-STAR-index/chrNameLength.txt"
     echo "f897e16cb9a8d9a4f1cf242938f97ac6  dm6-STAR-index/chrStart.txt"
     echo "d5212320af99792dc517f1d4fa830ec1  dm6-STAR-index/exonGeTrInfo.tab"
     echo "5ef2835f5c116f55dd20522567152bf8  dm6-STAR-index/exonInfo.tab"
     echo "11aa105726dbb95e8f27714fdbd8e897  dm6-STAR-index/geneInfo.tab"
     echo "faa97745744f087d06879800a531edea  dm6-STAR-index/sjdbInfo.txt"
     echo "8cd08db8cb269461db29d9eb23ad690b  dm6-STAR-index/sjdbList.fromGTF.out.tab"
     echo "fd0b412f47ca662ce68e08ca6939a2c7  dm6-STAR-index/sjdbList.out.tab"
     echo "bdf26dee24f8cd1ca7b1bd854c882411  dm6-STAR-index/transcriptInfo.tab"
     echo "5aadf7ccab5a6b674e76516bf75eaa09  dm6.fa"
  } >> rnaseq.md5
)

Checking your files

(
  cd rnaseq && \
  md5sum -c rnaseq.md5
)

We expect the following outputs from this command:

Drosophila_melanogaster.BDGP6.87.gtf: OK
GSM461177_1_subsampled.fastqsanger: OK
GSM461177_2_subsampled.fastqsanger: OK
dm6-STAR-index/Genome: OK
dm6-STAR-index/Log.out: OK
dm6-STAR-index/SA: OK
dm6-STAR-index/SAindex: OK
dm6-STAR-index/chrLength.txt: OK
dm6-STAR-index/chrName.txt: OK
dm6-STAR-index/chrNameLength.txt: OK
dm6-STAR-index/chrStart.txt: OK
dm6-STAR-index/exonGeTrInfo.tab: OK
dm6-STAR-index/exonInfo.tab: OK
dm6-STAR-index/geneInfo.tab: OK
dm6-STAR-index/genomeParameters.txt: OK
dm6-STAR-index/sjdbInfo.txt: OK
dm6-STAR-index/sjdbList.fromGTF.out.tab: OK
dm6-STAR-index/sjdbList.out.tab: OK
dm6-STAR-index/transcriptInfo.tab: OK
dm6.fa: OK

md5 Troubleshooting: Windows users only

If you see an error such as

md5sum: 'Drosophila_melanogaster.BDGP6.87.gtf'$'\r': No such file or directory
: FAILED open or reader.BDGP6.87.gtf

it may be because rnaseq.md5 has windows based line endings.

The following code will install the dos2unix cli tool and then convert the rnaseq.md5 file to unix-based line endings.

BASH

sudo apt update -yq
sudo apt install dos2unix -yq

dos2unix rnaseq/rnaseq.md5

Then retry the md5sum command