This lesson is still being designed and assembled (Pre-Alpha version)

Introduction to Workflows with Common Workflow Language

Introduction

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is Common Workflow Language?

  • How are CWL workflows written?

  • How do CWL workflows compare to shell workflows?

  • What are the advantages of using CWL workflows?

Objectives
  • First learning objective. (FIXME)

Common Workflow Language

Computational workflows are widely used for data analysis, enabling rapid innovation and decision making. Workflow thinking is a form of “conceptualizing processes as recipes and protocols, structured as [work- or] dataflow graphs with computational steps, and subsequently developing tools and approaches for formalizing, analyzing and communicating these process descriptions” (Gryk & Ludascher, 2017).

However as the rise in popularity of workflows has been matched by a rise in the number of dispirit workflow managers that are available, each with their own standards for describing the tools and workflows, reducing portability and interoperability of these workflows.

CWL is a free and open standard for describing command-line tool based workflows1. These standards provide a common, but reduced, set of abstractions that are both used in practice and implemented in many popular workflow systems. The CWL language is declarative, enabling computational workflows to be constructed from diverse software tools, executing each through their command-line interface.

Previously researchers might write shell scripts to link together these command-line tools. Although these scripts might provide a direct means of accessing the tools, writing and maintaining them requires specific knowledge of the system that they will be used on. Shell scripts are not easily portable, and so researchers can easily end up spending more time maintaining the scripts than carrying out their research. The aim of CWL is to reduce that barrier of usage of these tools to researchers.

CWL workflows are written in a subset of YAML, with a syntax that does not restrict the amount of detail provided for a tool or workflow. The execution model is explicit, all required elements of a tool’s runtime environment must be specified by the CWL tool-description author. On top of these basic requirements they can also add hints or requirements to the tool-description, helping to guide users (and workflow engines) on what resources are needed for a tool.

The CWL standards explicitly support the use of software container technologies, helping ensure that the execution of tools is reproducible. Data locations are explicitly defined, and working directories kept separate for each tool invocation. These standards ensure the portability of tools and workflows, allowing the same workflows to be run on your local machine, or in a HPC or cloud environment, with minimal changes required.

RNA sequencing example

In this tutorial a bio-informatics RNA-sequencing analysis is used as an example. However, there is no specific knowledge needed for this tutorial. RNA-sequencing is a technique which examines the quantity and sequences of RNA in a sample using next-generation sequencing. The RNA reads are analyzed to measure the relative numbers of different RNA molecules in the sample. This analysis is differential gene expression.

The process looks like this:

During this tutorial, only the middle analytical steps will be performed. The adapter trimming is skipped. These steps will be done:

The different tools necessary for this analysis are already available. In this tutorial a workflow will be set up to connect these tools and generate the desired output files.

  1. M. R. Crusoe, S. Abeln, A. Iosup, P. Amstutz, J. Chilton, N. Tijanić, H. Ménager, S. Soiland-Reyes, B. Gavrilović, C. Goble, The CWL Community (2021): Methods Included: Standardizing Computational Reuse and Portability with the Common Workflow Language. Communication of the ACM. https://doi.org/10.1145/3486897 

Key Points

  • First key point. Brief Answer to questions. (FIXME)


CWL and Shell Tools

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is the difference between a CWL tool description and a CWL workflow?

  • How can we create a tool descriptor?

  • How can we use this in a single step workflow?

Objectives
  • describe the relationship between a tool and its corresponding CWL document

  • exercise good practices when naming inputs and outputs

  • understand how to reference files for input and output

  • explain that only files explicitly mentioned in a description will be included in the output of a step/workflow

  • implement bulk capturing of all files produced by a step/workflow for debugging purposes

  • use STDIN and STDOUT as input and output

  • capture output written to a specific directory, the working directory, or the same directory where input is located

learning objectives

By the end of this episode, learners should be able to explain how a workflow document describes the input and output of a workflow and describe all the requirements for running a tool and define the files that will be included as output of a workflow.

CWL workflows are written in the YAML syntax. This short tutorial explains the parts of YAML used in CWL. A CWL document contains the workflow and the requirements for running that workflow. All CWL documents should start with two lines of code:

cwlVersion: v1.2
class: 

The cwlVersion string defines which standard of the language is required for the tool or workflow. The most recent version is v1.2.

The class field defines what this particular document is. The majority of CWL documents will fall into one of two classes: CommandLineTool, or Workflow. The CommandLineTool class is used for describing the interface for a command-line tool, while the Workflow class is used for connecting those tool descriptions into a workflow. In this lesson the differences between these two classes are explained, how to pass data to and from command-line tools and specify working environments for these, and finally how to use a tool description within a workflow.

Our first CWL script

To demonstrate the basic requirements for a tool descriptor a CWL description for the popular “Hello world!” demonstration will be examined.

echo.cwl

cwlVersion: v1.2
class: CommandLineTool

baseCommand: echo

inputs:
  message_text:
    type: string
	inputBinding:
	  position: 1

outputs: []

Next, the input file: hello_world.yml.

hello_world.yml

message_text: Hello world!

We will use the reference CWL runner, cwltool to run this CWL document (the .cwl workflow file) along with the .yml input file.

cwltool echo.cwl hello_world.yml
INFO Resolved 'echo.cwl' to 'file:///.../echo.cwl'
INFO [job echo.cwl] /private/tmp/docker_tmprm65mucw$ echo \
    'Hello world!'
Hello world!
INFO [job echo.cwl] completed success
{}
INFO Final process status is success

The output displayed above shows that the program has run succesfully and its output, Hello world!.

Let’s take a look at the echo.cwl script in more detail.

As explained above, the first 2 lines are always the same, the CWL version and the class of the script are defined. In this example the class is CommandLineTool, in particular the echo command. The next line, baseCommand, contains the command that will be run (echo).

inputs:
  message_text:
    type: string
	inputBinding:
	  position: 1

This block of code contains the inputs section of the tool description. This section provides all the inputs that are needed for running this specific tool. To run this example we will need to provide a string which will be included on the command line. Each of the inputs has a name, to help us tell them apart; this first input has the name : message_text. The field inputBinding is one way to specify how the input should appear on the command line. Here the position field indicates at which position the input will be on the command line; in this case the message_text value will be the first thing added to the command line (after the baseCommand, echo).

outputs: []

Lastly the outputs of the tool description. This example doesn’t have a formal output. The text is printed directly in the terminal. So an empty YAML list ([]) is used as the output.

Script order

To make the script more readable the input field is put in front of the output field. However CWL syntax requires only that each field is properly defined, it does not require them to be in a particular order.

Changing input text

What do you need to change to print a different text on the command line?

Solution

To change the text on the command line, you only have to change the text in the hello_world.yml file.

For example:

message_text: Good job!

CWL single step workflow

The RNA-seq data from the introduction episode will be used for the first CWL workflow. The first step of RNA-sequencing analysis is a quality control of the RNA reads using the fastqc tool. This tool is already available to use so there is no need to write a new CWL tool description.

This is the workflow file (rna_seq_workflow.cwl).

rna_seq_workflow.cwl

clwVersion: v1.2
class: Workflow

inputs:
  rna_reads_human: File
  
steps:
  quality_control:
    run: bio-cwl-tools/fastqc/fastqc_2.cwl
	in:
	  reads_file: rna_reads_human
    out: [html_file]

outputs: 
  qc_html:
    type: File
	outputSource: quality_control/html_file

In a workflow the steps field must always be present. The workflow tasks or steps that you want to run are listed in this field. At the moment the workflow only contains one step: quality_control. In the next episodes more steps will be added to the workflow.

Let’s take a closer look at the workflow. First the inputs field will be explained.

inputs:
  rna_reads_human: File

Looking at the CWL script of the fastqc tool, it needs a fastq file as its input. In this example the fastq file consists of human RNA reads. So we call the variable rna_reads_human and it has File as its type. To make this workflow interpretable for other researchers, self-explanatory and sensible variable names are used.

Input and output names

It is very important to give inputs and outputs a sensible name. Try not to use variable names like inputA or inputB because others might not understand what is meant by it.

The next part of the script is the steps field.

steps:
  quality_control:
    run: bio-cwl-tools/fastqc/fastqc_2.cwl
	in:
	  reads_file: rna_reads_human
    out: [html_file]

Every step of a workflow needs an name, the first step of the workflow is called quality_control. Each step needs a run field, an in field and an out field. The run field contains the location of the CWL file of the tool to be run. The in field connects the inputs field to the fastqc tool. The fastqc tool has an input parameter called reads_file, so it needs to connect the reads_file to rna_reads_human. Lastly, the out field is a list of output parameters from the tool to be used. In this example, the fastqc tool produces an output file called html_file.

The last part of the script is the output field.

outputs: 
  qc_html:
    type: File
	outputSource: quality_control/html_file

Each output in the outputs field needs its own name. In this example the output is called qc_html. Inside qc_html the type of output is defined. The output of the quality_control step is a file, so the qc_html type is File. The outputSource field refers to where the output is located, in this example it came from the step quality_control and it is called html_file.

When you want to run this workflow, you need to provide a file with the inputs the workflow needs. This file is similar to the hello_world.yml file in the previous section. The input file is called workflow_input.yml

workflow_input.yml

rna_reads_human:
  class: File
  location: rnaseq/raw_fastq/Mov10_oe_1.subset.fq
  format: http://edamontology.org/format_1930

In the input file the values for the inputs that are declared in the inputs section of the workflow are provided. The workflow takes rna_reads_human as an input parameter, so we use the same variable name in the input file. When setting inputs, the class of the object needs to be defined, for example class: File or class: Directory. The location field contains the location of the input file. In this example the last line is needed to provide a format for the fastq file.

Now you can run the workflow using the following command:

cwltool rna_seq_workflow.cwl workflow_input.yml

Exercise

Needs some exercises

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Developing Multi-Step Workflows

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How can we expand to a multi-step workflow?

  • Iterative workflow development

  • Workflows as dependency graphs

  • How to use sketches for workflow design?

Objectives
  • graph objectives:

  • explain that a workflow is a dependency graph

  • sketch objectives:

  • use cwlviewer online

  • generate Graphviz diagram using cwltool

  • exercise with the printout of a simple workflow; draw arrows on code; hand draw a graph on another sheet of paper

  • iterate objectives:

  • recognise that workflow development can be iterative i.e. that it doesn’t have to happen all at once

By the end of this episode, learners should be able to explain that a workflow is a dependency graph and sketch their workflow, both by hand, and with an automated visualizer and recognise that workflow development can be iterative i.e. that it doesn’t have to happen all at once.

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Resources for Reusing Tools and Scripts

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How to find other tools/solutions for awkward problems?

Objectives
  • tools objectives:

  • know good resources for finding solutions to common problems

By the end of this episode, learners should be aware of where they can look for CWL recipes and more help for common, but awkward, tasks.

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Documentation and Citation in Workflows

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How to document your workflow?

  • How to cite research software in your workflow?

Objectives
  • Documentation Objectives:

  • explain the importance of documenting a workflow

  • use description fields to document purpose, intent, and other factors at multiple levels within their workflow

  • recognise when it is appropriate to include this documentation

  • Citation Objectives:

  • explain the importance of correctly citing research software

  • give credit for all the tools used in their workflow(s)

By the end of this episode, learners should be able to document their workflows to increase reusability and explain the importance of correctly citing research software.

TODO (CITE): define some specific objectives to capture the skills being taught in this section.

See this page.

Finding an identifier for the tool

(Something about permanent identifiers insert here)

When your workflow is using a pre-existing command line tool, it is good practice to provide citation for the tool, beyond which command line it is executed with.

The SoftwareRequirement hint can list named packages that should be installed in order to run the tool. So for instance if you installed using the package management system with apt install bamtools the package bamtools can be cited in CWL as:

hints:
  SoftwareRequirement:
    packages:
      bamtools: {}

Adding version

Q: bamtools --version prints out blablabla 2.3.1 - how would you indicate in CWL that this is the version of BAMTools the workflow was tested against?

A:

hints:
  SoftwareRequirement:
    packages:
      bamtools:
          version: ["2.3.1"]

Adding Permanent identifiers

To help identify the tool across package management systems we can also add permanent identifiers and URLs, for instance to:

These can be added to the specs list:

hints:
  SoftwareRequirement:
    packages:
      interproscan:
        specs: [ "https://identifiers.org/rrid/RRID:SCR_005829" ]
        version: [ "5.21-60" ]

How to find a RRID permanent identifier

RRID provides identifiers for many commonly used resources tools in bioinformatics. For instance, a search for BAMtools finds an entry for BAMtools with identifier RRID:SCR_015987 and additional information.

We can transform the RRID into a Permanent Identifier (PID) for use in CWL using http://identifiers.org/ by appending the RRID to https://identifiers.org/rrid/ - making the PID https://identifiers.org/rrid/RRID:SCR_015987 which we see resolve to the same SciCrunch entry, and add to our specs list:

hints:
  SoftwareRequirement:
    packages:
      interproscan:
        specs: [ "https://identifiers.org/rrid/RRID:SCR_015987" ]

Note that as CWL is based on YAML we use "quotes" to escape these identifiers include the : character.

Finding bio.tools identifiers

As an alternative to RRID we can add identifiers from the ELIXIR Tools Registry https://bio.tools/ - for instance https://bio.tools/bamtools

hints:
  SoftwareRequirement:
    packages:
      bamtools:
        specs:
          - "https://identifiers.org/rrid/RRID:SCR_015987"
          - "https://bio.tools/bamtools"

Package manager identifiers

Q: You have used apt install bamtools in the Linux distribution Debian 10.8 “Buster”. How would you in CWL SoftwareRequirement identify the Debian package recipe, and with which version?

A:

hints:
  SoftwareRequirement:
    packages:
      bamtools:
        specs:
          - "https://identifiers.org/rrid/RRID:SCR_015987"
          - "https://bio.tools/bamtools"
          - "https://packages.debian.org/buster/bamtools"
        version: ["2.5.1", "2.5.1+dfsg-3"]

This package repository has a URI for each installable package, depending on the distribution, we here pick "buster". While the upstream GitHub repository of bamtools has release version v2.5.1, the Debian packaging adds +dfsg-3 to indicate the 3rd repackaging with additional patches, in this case to make the software comply with Debian Free Software Guidelines (dfsg).

Under version list in CWL we’ll include 2.5.1 which is the upstream version, ignoring everything after + or - according to semantic versioning rules. As an optional extra you can also include the Debian-specific version "2.5.1+dfsg-3" to indicate which particular packaging we tested the workflow with at the time.

Exercise: There is a “obvious” DOI

Q: You have a workflow using bowtie2, how would you add a citation?

A:

hints:
  SoftwareRequirement:
    packages:
      bowtie2:
        specs: [ "https://doi.org/10.1038/nmeth.1923" ]
        version: [ "1.x.x" ]

RRID for bowtie2

RRID:SCR_005476 -> https://scicrunch.org/resolver/RRID:SCR_005476 #bowtie not bowtie2 https://identifiers.org/rrid/ + RRID -> https://identifiers.org/rrid/RRID:SCR_005476 PID

https://bio.tools/bowtie2

http://bioconda.github.io/recipes/bowtie2/README.html vs. https://anaconda.org/bioconda/bowtie2

Giving clues to reader

Authorship/citation of a tool vs the CWL file itself (particularly of a workflow)

Add identifiers under requirements? https://www.commonwl.org/user_guide/20-software-requirements/index.html

SciCrunch - looking up RRID for Bowtie2 Then bio.tools

hints:
  SoftwareRequirement:
    packages:
      interproscan:
        specs: [ "https://identifiers.org/rrid/RRID:SCR_005829",
                 "http://somethingelse"]
        version: [ "5.21-60" ]

Trickier: Only Github and homepage

s:codeRepository:
hints:
  SoftwareRequirement:
    packages:
      interproscan:
        specs: [ "https://github.com/BenLangmead/bowtie2"]
        version: [ "fb688f7264daa09dd65fdfcb9d0f008a7817350f" ]

No version, add commit ID or date instead as version

–> (How to make Your own tool citable?)

Getting credit for your CWL files

NOTE: Difference between credit for this CWL file vs credit for the tool it calls.

s:author "Me"
s:dateModified: "2020-10-6"
s:version: "2.4.2"
s:license: https://spdx.org/licenses/GPL-3.0

https://www.commonwl.org/user_guide/17-metadata/index.html

Using s:citation?

something like..

s:citation: https://dx.doi.org/10.1038/nmeth.1923

s:url: http://example.com/tools/

s:codeRepository: https://github.com/BenLangmead/bowtie2
$namespaces:
  s: https://schema.org/

$schemas:
 - http://schema.org/version/9.0/schemaorg-current-http.rdf

—> Need new guidance on how to publish workflows, making DOIs in Zenodo, Dockstore etc. https://docs.bioexcel.eu/cwl-best-practice-guide/devpractice/publishing.html https://guides.github.com/activities/citable-code/

How to do it properly to improve findability.

How to publisize CWL tools

CWL workflow descriptions

About how to wire together CommandLineTool steps in a cwl Workflow file.

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Debugging Workflows

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • introduce within above lessons?

Objectives
  • interpret commonly encountered error messages

  • solve these common issues

By the end of this episode, learners should be able to recognize and fix simple bugs in their workflow code.

(non-exhaustive) list of possible examples:

Key Points

  • First key point. Brief Answer to questions. (FIXME)