Working with Nextflow

How to create reproducible workflows and computational pipelines

Published

27-Nov-2024

1 Introduction

Nextflow is a workflow management system (WfMS), and is one of the most common such systems within the bioinformatic and academic communities. These systems are important for scientific reproducibility in that they greatly facilitate keeping track of which files have been processed in what way throughout an entire project.

Nextflow is built from the ground-up to be portable, scalable, reproducible and usable in a platform-agnostic sense. This means that any workflow you write in Nextflow can be run locally on your laptop, a computer cluster or a cloud service (as long as your architecture has the necessary computational resources). You can also define the compute environment in which each task is carried out on a per-task basis. You might thus develop your workflow on your local computer using a minimal test dataset, but run the full analyses with all samples on e.g. a computer cluster. Nextflow can work on both files and arbitrary values, often-times connected in useful and advanced ways.

Nextflow can easily work with dynamic inputs where the exact output is unknown, e.g. the exact number of files or which samples pass some arbitrary quality control threshold. While Nextflow is based on the Groovy language, you don’t need to know how to code Groovy to be able to write good Nextflow workflows. Nextflow has a large community centred around it, including the nf-core curated collection of high quality pipelines used by e.g. the National Genomics Infrastructure.

This tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to set it up if you haven’t done so already, then open up a terminal and go to workshop-reproducible-research/tutorials/nextflow and activate your nextflow-env Conda environment.

2 The basics

We’ll start by creating a very simple workflow from scratch, to show how Nextflow works: it will take two input files and convert them to UPPERCASE letters.

  • Start by running the following commands:
touch main.nf
echo "This is a.txt" > a.txt
echo "This is b.txt" > b.txt

Open the main.nf file with an editor of your choice. This is the main workflow file used in Nextflow, where workflows and their processes are defined.

  • Copy the following code into your main.nf file:
// Workflow definition
workflow {
    // Define input files
    ch_input = Channel.fromPath( "a.txt" )

    // Run workflow
    CONVERT_TO_UPPER_CASE( ch_input )
}

// Process definition
process CONVERT_TO_UPPER_CASE {
    publishDir "results/",
        mode: "copy"

    input:
    path(file)

    output:
    path("a.upper.txt")

    script:
    """
    tr [a-z] [A-Z] < ${file} > a.upper.txt
    """
}

Here we have two separate parts. The first is the workflow definition, while the last is a process. Let’s go through them both in more detail!

Nextflow comments

Double-slashes (//) are used for comments in Nextflow.

Nextflow and whitespace

Nextflow is not indentation-sensitive. In fact, Nextflow doesn’t care at all about whitespace, so go ahead and use it in whatever manner you think is easiest to read and work with! Do keep in mind that indentations and other types of whitespace does improve readability, so it’s generally not a good idea to forego it entirely, even though you can.

2.1 Workflow definitions

workflow {
    // Define input files
    ch_input = Channel.fromPath( "a.txt" )

    // Run workflow
    CONVERT_TO_UPPER_CASE( ch_input )
}

The workflow definition here has two parts, each doing an important job for any Nextflow workflow. The first part defines a channel, which is an asynchronous first-in-first-out stream of data that connect a workflow’s various inputs and outputs. In simpler terms, channels contain the data that you want to process with the workflow and can be passed between the various parts of the workflow.

Channels can be created in various different ways using channel factories, depending on what type data you want to put into them and where this data is stored. In this particular case we define our ch_input channel using the .fromPath channel factory, which takes a file path as input - here we use the a.txt file. You can thus read ch_input = Channel.fromPath("a.txt") as “create the channel ch_input and send the file a.txt into it”.

Naming channels

A channel can be named anything you like, but it is good practice to prepend them with ch_, as that makes it clear which variables are channels and which are just normal variables.

How do we use these channels then? Channels pass data to and from processes through our workflow. By providing channels as arguments to processes, we describe how we want data to flow. This is exactly what we do in the second part: we call our CONVERT_TO_UPPER_CASE process with the ch_input as input argument - this is very similar to functional programming.

This is our entire workflow, for now: the creation of a channel followed by using the contents of that channel as input to a single process. Let’s look at how processes themselves are defined!

2.2 Process definitions

process CONVERT_TO_UPPER_CASE {
    publishDir "results/",
        mode: "copy"

    input:
    path(file)

    output:
    path("a.upper.txt")

    script:
    """
    tr [a-z] [A-Z] < ${file} > a.upper.txt
    """
}

Looking at the process in the code above, we can see several parts. The process block starts with its name, in this case CONVERT_TO_UPPER_CASE, followed by several sections, or directives as Nextflow calls them: publishDir, input, output and script.

Naming processes

A process can be named using any case, but a commonly used convention is to use UPPERCASE letters for processes to visually distinguish them in the workflow. You do not have to follow this if you don’t want to, but we do so here.

Let’s start with the first directive: publishDir. This tells Nextflow where the output of the process should be placed when it is finished. Setting mode to "copy" just means that we want to copy the output files to the publishing directory, rather than using a symbolic link (which is the default).

The input and output directives describe the data expected to come through this specific process. Each line of input describes the data expected for each process argument, in the order used in the workflow. In this case, CONVERT_TO_UPPER_CASE expects a single channel (one line of input), and expects the data to be filenames ( i.e. of type path). The script directive is where you put the code that the process should execute.

Notice that there is a difference between how the inputs and outputs are declared? The output is an explicit string (i.e. surrounded by quotes), while the input is a variable named file. This means inputs can be referenced in the process without naming the data explicitly, unlike the output where the name needs to be explicit. We’ll get back to exactly how this works in just a moment. While the name of the input variable here is chosen to be the descriptive file, we could also have chosen something completely different, e.g. banana (we’d also have to change its reference in the script directive).

2.3 Executing workflows

Let’s try running the workflow we just created!

  • Type the following in your terminal:
nextflow run main.nf

This will make Nextflow run the workflow specified in your main.nf file. You should see something along these lines:

N E X T F L O W  ~  version 22.10.6
Launching `./main.nf` [mad_legentil] - revision: 87f0c253ed
executor >  local (1)
[32/9124a1] process > CONVERT_TO_UPPER_CASE (1) [100%] 1 of 1 ✔

The first few lines are information about this particular run, including the Nextflow version used, which workflow definition file was used, a randomly generated run name (an adjective and a scientist), the revision ID as well as where the processes were executed (locally, in this case, as opposed to e.g. SLURM or AWS).

What follows next is a list of all the various processes for this particular workflow. The order does not necessarily reflect the order of execution (depending on each process’ input and output dependencies), but they are in the order they were defined in the workflow file - there’s only the one process here, of course. The first part (e.g. [32/9124a1]) is the process ID, which is also the first part of the subdirectory in which the process is run (the full subdirectory will be something like 32/9124a1dj56n2346236245i2343, so just a longer hash). We then get the process and its name. Lastly, we get how many instances of each process are currently running or have finished. Here we only have the one process, of course, but this will soon change.

  • Let’s check that everything worked: type ls results/ and see that it contains the output we expected.

  • Let’s explore the working directory: change into whatever directory is specified by the process ID (your equivalent to work/32/9124a1[...]).

What do you see when you list the contents of this directory? You should see a symbolic link named a.txt pointing to the real location of this file, plus a normal file a.upper.txt, which is the output of the process that was run in this directory. You generally only move into these work directories when debugging errors in your workflow, and Nextflow has some tricks to make this process a lot easier - more on this later.

So, in summary: we have three components: a set of inputs stored in a channel, a set of processes and a workflow that defines which processes should be run in what order. We tell Nextflow to push the inputs through the entire workflow, so to speak.

  • Now it’s your turn! Move back to the workflow root and make it use only the b.txt input file and give you the b.upper.txt instead.

  • Run your workflow and make sure it works before you move on; check below if you’re having trouble.

ch_input = Channel.fromPath( "b.txt" )

2.4 Viewing channel contents

Something that’s highly useful during development of Nextflow workflows is to view the contents of channels, which can be done with the view() operator.

  • Add the following to your workflow definition (on a new line) and execute the workflow: ch_input.view(). What do you see?

  • Remove the view() operator once you’re done.

It can be quite helpful to view the channel contents whenever you’re unsure of what a channel contains or if you’ve run into some kind of bug or error, or even just when you’re adding something new to your workflow. Remember to view the channel contents whenever you need to during the rest of this tutorial!

2.5 Files and sample names

One powerful feature of Nextflow is that it can handle complex data structures as input, and not only filenames. One of the more useful things this allows us to do is to couple sample names with their respective data files inside channels.

  • Change the channel definition to the following:
ch_input = Channel
    .fromPath ( "a.txt" )
    .map      { file -> tuple(file.getBaseName(), file) }

Here we create a tuple (something containing multiple parts) using the map operator, the base name of the file (a) and the file path (a.txt). The statement .map{ file -> tuple(file.getBaseName(), file) } can thus be read as “replace the channel’s contents with a tuple containing the base name and the file path”. The contents of the channel thus change from [a.txt] to [a, a.txt]. Passing the sample name or ID together with the sample data in this way is extremely useful in a workflow context and can greatly simplify downstream processes.

Before this will work, however, we have to change the process itself to make use of this new information contained in the ch_input channel.

  • Change the process definition to the following:
process CONVERT_TO_UPPER_CASE {
    publishDir "results/",
        mode: "copy"

    input:
    tuple val(sample), path(file)

    output:
    path("${sample}.upper.txt")

    script:
    """
    tr [a-z] [A-Z] < ${file} > ${sample}.upper.txt
    """
}

Notice how the input now is aware that we’re passing a tuple as input, which allows us to use both the file variable (as before) and the new sample variable. All that’s left now is to change the input to our pipeline!

  • Change the channel definition line from .fromPath ( "a.txt" ) to .fromPath ( ["a.txt", "b.txt"] ) and try running the pipeline. Make sure it works before you move on! Remember to use the view() operator if you want to inspect the channel contents in detail.

2.6 Input from samplesheets

So far we’ve been specifying inputs using strings inside the workflow itself, but hard-coding inputs like this is not ideal. A better solution is to use samplesheets instead, e.g. comma- or tab-separated data files; this is standard for many pipelines, including nf-core. Take, for example, the following CSV file:

a,a.txt
b,b.txt

This specifies the samples and their respective files on each row. Using such a file is much more portable, scalable and overall easier to use than simple hard-coding things in the workflow definition itself. We might also include an arbitrary number of additional metadata columns, useful for downstream processing and analyses. Using contents of files as input can be done using the .splitCsv() and .map{} operators, like so:

ch_input = Channel
    .fromPath ( "first_samplesheet.csv" )
    .splitCsv ( )
    .map      { row -> tuple(row[0], file(row[1])) }

The .SplitCsv() operator lets the channel know the input is a CSV file, while the .map{} operator makes the CSV content into a tuple from the first and second elements of each row.

  • Change the input channel definition to the code above and create the first_samplesheet.csv file as shown above.

  • Add the .view() operator somewhere to show the contents of ch_input.

  • Execute the pipeline. Do you see what you expect? Remove the .view() operator before moving on.

Note

While we are still hard-coding the name of the samplesheet it is still much better to edit a samplesheet than having to edit the pipeline itself - there are also convenient ways to work around this using parameters, which we’ll talk more about later in this tutorial.

We can also specify a header in our samplesheet like so: .splitCsv(header: true). This will allow us to reference the columns using their names instead of their index, e.g. row.col1 instead of row[0].

  • Add an appropriate header to your samplesheet, make sure your workflow can read it and execute. Use .view() to see what’s going on, if needed.

2.7 Adding more processes

It’s time to add more processes to our workflow! We have the two files a.upper.txt and b.upper.txt; the next part of the workflow is a step that concatenates the content of all these UPPERCASE files.

We already have a channel containing the two files we need: the output of the CONVERT_TO_UPPER_CASE process called CONVERT_TO_UPPER_CASE.out. We can use this output as input to a new process using the syntax: CONVERT_TO_UPPER_CASE.out.collect(). The collect() operator groups all the outputs in the channel into a single data object for the next process. This is a many-to-one type of operation: a stream with several files (many) is merged into a lone list of files (one). If collect() was not used, the next process would try to run a task for each file in the output channel.

Let’s put this in use by adding a new process to the workflow definition. We’ll call this process CONCATENATE_FILES and it will take the output from CONVERT_TO_UPPER_CASE as input, grouped using the collect() operator.

  • Add a line to your workflow definition for this new process with the appropriate input - remember that you can use .view() to check channel contents; click below if you’re having trouble.
CONCATENATE_FILES( CONVERT_TO_UPPER_CASE.out.collect() )

Now all we have to do is define the actual CONCATENATE_FILES process in the process definition section.

  • Copy the following code as a new process into your workflow:
process CONCATENATE_FILES {
    publishDir "results/",
        mode: "copy"

    input:
    path(files)

    output:
    path("*.txt")

    script:
    """
    cat ${files} > concat.txt
    """
}
  • Run your workflow again and check the results/ directory. At this point you should have three files there: a.upper.txt, b.upper.txt and concat.txt.

  • Inspect the contents of concat.txt - do you see everything as you expected?

Note the use of path(files) as input. Although we pass a list of files as input, the list is considered a single object, and so the files variable references a list. Each file in that list can be individually accessed using an index e.g. ${files[0]}, or as we do here, use the variable without an index to list all the input files.

Quick recap

In this section we’ve learnt:

  • How to create, execute and extend workflows
  • How to explore the work directory and channel contents
  • How to couple sample names to sample data files
  • How to use samplesheets as input
  • How to collect multiple files as single inputs for processes

3 Executing workflows

It’s time to start working with a more realistic workflow using the MRSA case study of this course! We’ve created a bare-bones version of this pipeline for you, but we’ll work our way through it as we go along and learn more about Nextflow’s features and functionality. The MRSA workflow looks like this:

workflow {

    // Workflow for generating count data for the MRSA case study

    // Get input files from a samplesheet
    ch_input = Channel
        .fromPath ( "samplesheet.csv" )
        .splitCsv ( header: true)

    // Define the workflow
    DOWNLOAD_FASTQ_FILES (
        ch_input
    )
    RUN_FASTQC (
        DOWNLOAD_FASTQ_FILES.out
    )
    RUN_MULTIQC (
        RUN_FASTQC.out[1].collect()
    )
    GET_GENOME_FASTA ()
    INDEX_GENOME (
        GET_GENOME_FASTA.out.fasta
    )
    ALIGN_TO_GENOME (
        DOWNLOAD_FASTQ_FILES.out,
        INDEX_GENOME.out.index
    )
    SORT_BAM (
        ALIGN_TO_GENOME.out.bam
    )
    GET_GENOME_GFF3 ()
    GENERATE_COUNTS_TABLE (
        SORT_BAM.out.bam.collect(),
        GET_GENOME_GFF3.out.gff
    )
}

The workflow has one input channel named ch_input, which reads input from the samplesheet.csv file. We then define the processes to be executed by this workflow, nine in total. The first process (DOWNLOAD_FASTQ_FILES) takes the ch_input channel as input, while the rest of the processes takes the output of previous processes as input. Before we go into more detail regarding the ins-and-outs of this workflow, let’s start with some specifics of how workflows are executed and what you can get from them.

3.1 Reports and visualisations

Let’s start with running the workflow plus getting some reports and visualisation while we’re at it!

  • Run the workflow using the following command: nextflow run main_mrsa.nf -with-report report.html -with-timeline timeline.html -with-dag dag.png.

After successful executing, you will find three more files in your current directory: report.html, timeline.html and dag.png. The first file contains a workflow report, which includes various information regarding execution such as runtime, resource usage and details about the different processes. The second file contains a timeline for how long each individual process took to execute, while the last contains a visualisation of the workflow itself.

Take a few minutes to browse these files for yourself. When running a workflow you can of course choose which of these additional files you want to include by picking which ones are important or interesting to you - or don’t include any!

3.2 Logs

Nextflow keeps a log of all the workflows that have been executed. Let’s check it out!

  • Type nextflow log to get a list of all the executions.

Here we get information about when the workflow was executed, how long it ran, its run name, whether it succeeded or not and what command was used to run it. You can also use nextflow log <run name> to show each task’s directory that was executed for that run. You can also supply the -f (or -fields) flag along with additional fields to show.

  • Run nextflow log <run name> -f hash,name,exit,status

This shows us not only the beginning of each task’s working directory, but also its name, exit code and status (i.e. if it completed successfully or failed in some manner).

Listing fields

If you want to see a complete list of all the fields you might explore using the log, just type nextflow log -l or nextflow log -list-fields. This is highly useful for debugging when there’s some specific information about a run you’re particularly interested in!

We can also get even more detailed information about the latest run by looking into the .nextflow.log file!

  • Look into the latest log by typing less .nextflow.log.

You’ll be greeted by a wealth of debugging information, which may even seem a bit overkill at this point! This level of detail is, however, quite useful both as a history of what you’ve attempted and as an additional help when you run into errors! Also, it helps with advanced debugging - which we’ll get into later.

3.3 Re-running workflows

Something you often want to do in Nextflow (or any WfMS for that matter) is to re-run the workflow when you changed some input files or some of the code for its analyses, but you don’t want to re-run the entire workflow from start to finish. Let’s find out how this works in Nextflow!

  • Run the same nextflow run main_mrsa.nf command again.

You’ll notice that Nextflow actually re-ran the entire workflow from scratch, even though we didn’t change anything. This is the default behaviour of Nextflow.

  • Let’s try that again: nextflow run main_mrsa.nf -resume instead.

Now you can see that Nextflow didn’t actually re-run anything. The -resume flag instructed Nextflow to use the cached results from the previous run!

Nextflow automatically keeps track of not only changes to input files, but also changes to code, process definitions and scripts. You can thus change anything relating to your workflow and just re-run with the -resume flag and be sure that only processes relevant to your changes are executed again!

  • Use tree work/ to list the contents of the work directory.

Because Nextflow keeps track of all the runs, we’ve now got two sets of files in the work directory. One set from the first run, and another from the second run. This can take up valuable space, so let’s clean that up.

  • Use nextflow clean -n -before <run_name> to show which work directories will be cleaned up (use nextflow log to find the run name if you don’t remember it). Then delete those directories by changing -n (dry-run) to -f (force).

Nextflow’s clean subcommand can be used to clean up failed tasks and unused processes. Here we used the -before flag, meaning that any runs before the specified run are removed; use nextflow help clean to see other options for cleaning. This is the preferred way to clean up the working directory.

  • Remove the results directory and re-run the workflow again using the -resume flag.

We removed all the results we used before, but we still managed to resume the workflow and use its cache - how come? Remember that Nextflow uses the work directory to run all of its tasks, while the results directory is just where we have chosen to publish our outputs. We can thus delete the results directory as often as we like (a necessity when output filenames are changed) and still get everything back without having to re-run anything. If we were to delete the work directory, however…

  • Delete the work directory and re-run the workflow using the -resume flag.

There is no longer any cache for Nextflow to use, so it re-runs from the start! This is good to keep in mind: you can always delete the output directories of your workflow, but if you mess with work you’ll lose, well… work!

Quick recap

In this section we’ve learnt:

  • How to get automatic reports and visualisations
  • How to check the Nextflow logs
  • How to re-run workflows
  • How to clean the Nextflow cache

4 Working with processes

Now that we’ve gone through the specifics of executing workflows in a bit more detail, let’s go through working with processes. While there are numerous process directives that can be used, we’ll go through some of the more commonly used ones here.

4.1 Tags

Let’s look at the command line output we got during the workflow’s execution, which should look something like this:

N E X T F L O W  ~  version 22.10.6
Launching `./main.nf` [friendly_bhaskara] - revision: b4490b9201
executor >  local (17)
[c9/e5f818] process > DONWLOAD_FASTQ_FILES (SRR935092) [100%] 3 of 3 ✔
[d5/b5f24e] process > RUN_FASTQC (SRR935092)           [100%] 3 of 3 ✔
[91/2cea54] process > RUN_MULTIQC                      [100%] 1 of 1 ✔
[e0/b4fd37] process > GET_GENOME_FASTA                 [100%] 1 of 1 ✔
[87/32ce10] process > INDEX_GENOME                     [100%] 1 of 1 ✔
[56/e9a460] process > ALIGN_TO_GENOME (SRR935092)      [100%] 3 of 3 ✔
[ed/d8c223] process > SORT_BAM (SRR935092)             [100%] 3 of 3 ✔
[e7/4a6bda] process > GET_GENOME_GFF3                  [100%] 1 of 1 ✔
[e9/84f093] process > GENERATE_COUNTS_TABLE            [100%] 1 of 1 ✔

Have you noticed that there are SRA IDs after some of the processes? Well, if you look at which processes show these SRA IDs you might see that it’s only those processes that are executed three times, i.e. once per SRA ID. This doesn’t happen automatically, however, and comes from something called tags. Let’s look at the DONWLOAD_FASTQ_FILES process:

process DONWLOAD_FASTQ_FILES {

    // Download a single-read FASTQ file from the SciLifeLab Figshare remote

    tag "${sra_id}"
    publishDir "results/data",
        mode: "copy"

    input:
    tuple val(sra_id), val(figshare_link)

    output:
    tuple val(sra_id), path("*.fastq.gz")

    script:
    """
    wget ${figshare_link} -O ${sra_id}.fastq.gz
    """
}

You can see the tag directive at the very top of the process definition. Tags can be used to e.g. show information about the sample currently being analysed by the process. This is useful both during run-time (allowing you to see which sample is being processed) but also for debugging or finding problematic samples in case of errors or odd output. There is, naturally, no need to use tags for processes which are only run once.

  • Comment out (prefix with //) the tag directive from the DONWLOAD_FASTQ_FILES process and run the workflow again. What do you see?

Without the tag directive you should instead see the numbers 1 through 3, representing the input files (of which there are three). Nextflow still tells us that it’s working on one of the input files, but it’s generally much more useful to actually see the sample name or ID, rather than just a number.

  • Uncomment the tag directive before you move on.

4.2 Named outputs

Let’s move on to the next process! It looks like this:

process RUN_FASTQC {

    // Run FastQC on a FASTQ file.

    tag "${sample}"
    publishDir "results/",
        mode: "copy"

    input:
    tuple val(sample), path(fastq)

    output:
    path("*.html")
    path("*.zip")

    script:
    """
    fastqc ${fastq} -q
    """
}

Here is a process with two output channels! One contains all the .html files, while the other contains all the .zip files. How is this handled in the workflow definition of downstream processes that use the outputs? The RUN_MULTIQC process uses this output, and its part in the workflow definition looks like this:

RUN_MULTIQC (
    RUN_FASTQC.out[1].collect()
)

We already know about .out and .collect(), but we have something new here: the RUN_MULTIQC process is taking the second channel of the output from the RUN_FASTQC process - [1] is the index for the second channel, as Groovy is zero-based (the first channel is indexed by [0]).

This comes with some issues, however. What if we accidentally changed the order of the outputs in the rule, or added a new one? Using positions like this is easy to mess up, but there is a better solution: named outputs! This can be achieved by adding the emit option for some or all of the outputs, like so:

output:
path("*.txt"), emit: text

Instead of referring to the output by its position in an array as before we refer to the channel with a label we choose (.out.text) instead. This benefits us in that we can infer more information about channel contents called text rather than [1], and it is also allows us to be less error-prone when rewriting parts of a workflow.

  • Your turn! Add named outputs to the RUN_FASTQC process and make RUN_MULTIQC use those outputs. You’ll have to change both the output section of the RUN_FASTQC process, and the workflow definition section for RUN_MULTIQC. If you need help, see the hint below.
// Workflow definition for RUN_MULTIQC
RUN_MULTIQC (
    RUN_FASTQC.out.zip.collect()
)

// Output section of RUN_FASTC
output:
path("*.html"), emit: html
path("*.zip"),  emit: zip

Check if it works by executing the workflow.

4.3 Advanced publishing

So far we’ve only used the publishDir directive in a very simple way: specifying a directory and the mode to use when publishing (to copy the files rather than symbolically link them). There are more things you can do, however, especially for processes with more than one output. For example, we can publish outputs in separate directories, like so:

publishDir "results/tables",
    pattern: "*.tsv",
    mode: "copy"
publishDir "results/logs",
    pattern: "*.log",
    mode: "copy"

In this example, *.tsv files are copied to the folder results/tables/, while *.log files are copied to the folder results/logs. The publishDir directive can be used multiple times in a single process, allowing one to separate output as above, or publish the same output to multiple folders.

  • Edit the RUN_FASTQC process to place the HTML and compressed files in separate directories. Remove the results directory and re-run the workflow to check that it worked - click below if you’re having trouble.
process RUN_FASTQC {

    (...)

    publishDir "results/fastqc/html",
        pattern: "*.html",
        mode: "copy"
    publishDir "results/fastqc/zip",
        pattern: "*.zip",
        mode: "copy"

    (...)
}

Note that an output and a published output are different things: something can be an output of a process without being published. In fact, the RUN_FASTQC process is a prime example of this! Think about the compressed output: this output is only used by the downstream process RUN_MULTIQC and is never meant to be viewed by a human or used by a human in some downstream task not part of the pipeline itself. We would thus like to keep the compressed files as an output, but not publish said output. How do we do this? Just remove the corresponding publishDir directive!

The MRSA workflow we’ve made here was refactored directly from its original version in the Snakemake tutorial of this course, which means that its output structure is not fully taking advantage of some of Nextflow’s functionality. The compressed output we’ve already talked about above is one example.

  • See if you can find any other processes in the current implementation of the MRSA workflow that you could optimise like this!

Think about whether all processes actually need to have published outputs. Make sure you test executing the workflow after you’ve made any changes; click below if you want a hint.

The GET_GENOME_FASTA and GET_GENOME_GFF3 both download reference files which are only needed by the workflow itself and does not need to be published, the same goes for the genome index generated by the INDEX_GENOME process.

One could argue that neither of the BAM files generated by the ALIGN_TO_GENOME and SORT_BAM processes are needed by the user if only the final counts table is of interest, but BAM files can also be useful for exploring the alignments in e.g. IGV. Both BAMs are, however, definitely not needed: only the sorted one should be published if one is interested in BAM files.

4.4 Debugging

It is, sadly, inevitable that we all make mistakes while coding - nobody’s perfect! Nextflow helps you quite a bit when this happens, not just with its logs but also with informative error messages. Let’s introduce an error and look at what we get:

  • Change the final output line in the RUN_MULTIQC process to the following and re-run the workflow: path("multiqc_general_stats.csv") - notice the usage of .csv rather than .txt as before.

We got an error! We get a number of things, actually, including (in order from the top) the name of the process that gave the error, the likely cause, the command that was executed, along with its exit status, output, error and the work directory that the task was run in. Let’s focus on the Caused by: part at the top, which should look something like this:

Caused by:
  Missing output file(s) `multiqc_general_stats.csv` expected by process `RUN_MULTIQC`

We can also see that the command’s exit status is 0, which means that the command was successful; any exit status other than 0 means there was an error of some kind. We can thus infer that the command (1) worked, (2) failed to give us the output expected by Nextflow. Thankfully, Nextflow graciously prints the work directory for us so that we may check out what happened in more detail.

  • Copy the working directory path, cd into it and list its contents using ls.

You might already have spotted the error in the message above; the error we introduced here was that the expected output file has a .csv extension, rather than the correct .txt. Nextflow is expecting the .csv output, but the process script directive is (correctly) giving us the .txt file, which we can see inside the process’ work directory.

  • Go back to the root directory, revert the error you introduced and re-run the workflow to make sure it works again.

This might have seemed like a trivial error, but a lot of errors in Nextflow can be solved in the same manner, i.e. by just following the debugging output reported by Nextflow and inspecting the specific subdirectory in question.

A note about Bash

If you are using Bash variables inside the script directive you have to be careful to prepend it with a backslash, e.g. \${BASH_VARIABLE}. This is because the dollar-sign is used by Nextflow, so you have to tell Nextflow explicitly when you’re using a Bash variable. This is a common source of errors when using Bash variables, so keeping it in mind can save you some debugging time!

Quick recap

In this section we’ve learnt:

  • How to use the tag directive
  • How to use named output with emit
  • How to publish outputs into different directories
  • How to debug errors and mistakes

5 Workflow configuration

We’ve so far been working with a relatively non-generalised workflow: it’s got hard-coded inputs, paths and genome references. This is perfectly fine for a project that is purely aimed at getting reproducible results (which is the full extent of what you want in a lot of cases), but it can be made a lot more generalisable. Let’s go through the MRSA workflow and see what can be improved!

5.1 Parameters

One of the things that allow generalisability of Nextflow workflows is parameters, which hold information and values that can be changed directly on the command-line at the time of execution. One use of parameters in our MRSA workflow is to remove the hard-coded results output directory, for example. Parameters can be written in the following form:

params {
    parameter_1 = "some/data/path"      // A string parameter
    parameter_2 = 42                    // A value parameter
    parameter_3 = ["a", "b", "c", "d"]  // A list parameter
}

You would then refer to these parameters using e.g. params.parameter_1 anywhere you need to in the workflow. Although parameters can be defined in main_mrsa.nf, it is preferable to define them in a separate configuration file. The default name of this file is nextflow.config and if such a file is present it will be used automatically by Nextflow (to supply a config file with another name use nextflow -c <path-to-config-file> run main_mrsa.nf)

  • Create a configuration file and add a parameter for the results output directory.

  • Use your newly created parameter in the publishDir directory of a process Run your workflow to see if it worked; click below if you need help.

// Configuration file
params {
    outdir = "results"
}

// A publishDir directive in a process
publishDir "${params.outdir}",
    mode: "copy"

5.2 Command line parameters

Workflow parameters can be assigned on the command-line by executing workflows like so: nextflow run main_mrsa.nf --parameter_name 'some_value'. The workflow parameter parameter_name is prefixed by a double dash -- to tell Nextflow this is a parameter to the workflow (a single dash is a parameter to Nextflow, e.g. -resume). The value is also quoted (this is important for parameters that take file paths as values).

  • Run your workflow using the parameter you previously created, but pick something other than the default value!

You should now have a new directory containing all the results! This is highly useful if you want to keep track of separate runs of a workflow with different software parameters, for example: nextflow run main.nf --important_param 'value1' --resultsdir 'results-value1', or simply want to keep the results of separate versions of the same workflow. You can also change parameters by using the -params-file option or by using another configuration file (and using -c), rather than on the command line!

5.3 Configuring inputs

Remember the input for the MRSA workflow, the ch_input channel? This input (the samplesheet.csv file) is hard-coded inside the main_mrsa.nf file. This could also be made into a parameter!

  • Change the definition of the ch_input channel to take the value of a new parameter of your choice, defined in the configuration file.

You should now have a more generalised input to your workflow! Try to run it to make sure it works - look below if you need some help.

// Channel definition
ch_input = Channel
    .fromPath ( params.input )
    .splitCsv ( header: true )

// Configuration file
input = "samplesheet.csv"

By specifying inputs from sample sheets like this we can change inputs of a workflow execution by creating another sample sheet and specifying e.g., --input samplesheet-2.csv on the command line. This is highly useful when you want to run a single sample e.g., when testing a workflow, or when you want to keep track of all the different inputs you’ve used historically.

5.4 Other configuration scopes

There are lots of things that you might want to add to your configuration, not just parameters! The workflow manifest, for example, which might look like this:

manifest {
    name        = "My Workflow"
    description = "My awesome workflow, created by me"
    author      = "Me"
    mainScript  = "main.nf"
    version     = "1.0.0"
}
  • Go ahead and add a workflow manifest to your nextflow.config file!

The manifest is useful when you’re publishing or sharing the workflow through e.g. GitHub or similar. There are many more such configuration scopes that you might want to use - read more about them in the documentation.

Quick recap

In this section we learnt:

  • How to create parameters in a configuration file
  • How to specify parameters on the command line
  • How to add workflow manifest and other configuration scopes

6 Optimising the MRSA workflow

We just added several parameters and configurations to our MRSA workflow, but we didn’t do anything about the reference genomes: those are still hard-coded. The current MRSA workflow is, in fact, not very well-optimised for Nextflow at all, being a refactor from the Snakemake tutorial of this course.

All of the processes are basically unchanged, excluding some minor alterations. For example, the run_fastqc rule in Snakemake used the -o flag to specify that the results should be in the current directory, followed by moving the output files to their respective output directory. The first part is not needed in Nextflow (as everything is run in its own subdirectory), and the second part is done by the publishDir directive. These are just minor alterations, though, but we can do much more if we fully utilise Nextflow’s features!

6.1 Remote files

One of these features is the ability to automatically download remote files, without needing to explicitly do so! The path input type can handle either file paths (like we’ve done so far) or a URI-supported protocol (such as http://, s3://, ftp://, etc.). This would be highly useful for e.g. the GET_GENOME_FASTA process - in fact, we don’t even need that process at all! All we need to do is to change the input to the INDEX_GENOME and ALIGN_TO_GENOME processes.

  • Create a new input channel using the fromPath() channel factory and the absolute path (the FTP address) to the genome FASTA.

  • Make the INDEX_GENOME process use that input channel instead of the previously used output of the GET_GENOME_FASTA process.

  • Remove the GET_GENOME_FASTA process, as it is not needed anymore.

Re-run the workflow to see if it worked. Check the code below for an example if you’re stuck:

// Channel creation
ch_genome_fasta = Channel.fromPath( "ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna/Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz" )

// Workflow definition
INDEX_GENOME (
    ch_genome_fasta
)

We could also do this using parameters from our configfile, of course!

  • Now change the input to the GENERATE_COUNTS_TABLE to use the remote GFF3 file and remove the GET_GENOME_GFF3 in the same manner as above, but using a new parameter instead.

Re-run the workflow again to make sure it worked; check below if you’re stuck.

// [ nextflow.config ]
params {
    genome_gff3 = "ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/gff3/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.37.gff3.gz"
}

// [ main.nf ]
// Channel creation
ch_genome_ggf3 = Channel.fromPath ( params.genome_gff3 )

// Workflow definition
GENERATE_COUNTS_TABLE (
    SORT_BAM.out.bam.collect(),
    ch_genome_ggf3
)

If we want to get detailed we can also change the hard-coded “NCT8325” naming in e.g. the INDEX_GENOME process and put that in another parameter, or grab the baseName() from the channel and make a [prefix, file] tuple using the map{} operator like we did previously; check below if you’re curious of how this could be done.

// Channel definition
ch_genome_fasta = Channel
    .fromPath( "ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna/Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz" )
    .map     { file -> tuple(file.getBaseName(), file) }

// INDEX_GENOME process definition
process INDEX_GENOME {

    publishDir "results/bowtie2/",
        mode: "copy"

    input:
    tuple val(fasta_name), path(fasta)

    output:
    path("*.bt2"), emit: index

    script:
    """
    # Bowtie2 cannot use .gz, so unzip to a temporary file first
    gunzip -c ${fasta} > tempfile
    bowtie2-build tempfile ${fasta_name}
    """
}

6.2 Subworkflows

The DSL2 allows highly modular workflow design, where a workflow may contain multiple subworkflows. A subworkflow is just like a normal workflow, but it can be called inside other workflows, similar to a process. There is thus no special difference between a subworkflow and a workflow; the only difference is how you use them in practice. Let’s take a look at a toy example:

workflow {
    ch_input = Channel.fromPath ( params.input )
    SUBWORKFLOW (
        ch_input
    )
}

workflow SUBWORKFLOW {

    take:
    input_file

    main:
    ALIGN_READS( input_file )

    emit:
    bam = ALIGN_READS.out.bam
}

Here we have an unnamed, main workflow like before, plus a named subworkflow. A workflow can have inputs specified by the take directive, which is the equivalent of process input for workflows. The main part is the workflow body, which contains how to run which processes in which order. The last part, emit, also works the same as for processes, in that we name the different outputs of the workflow so that we may use them in other workflows or processes. Nextflow will run the unnamed workflow by default, unless the -entry flag is specified, like so:

nextflow run main.nf -entry SUBWORKFLOW

This will run the workflow named SUBWORKFLOW, but nothing else. You can also store subworkflows in separate files, so that everything doesn’t have to be crammed into a single main.nf file. A subworkflow named SUBWORKFLOW contained in the file subworkflow.nf can be loaded into a main.nf file like so:

include { SUBWORKFLOW } from "./subworkflow.nf"

If you have a complex workflow with several subworkflows you might thus store them in a separate directory, e.g. subworkflows/. This allows you to have fine-grained control over the general architecture of your Nextflow workflows, organising them in a manner that is easy to code and maintain. A process can also be treated in the same manner, and defined separately in another file.

  • Now it’s your turn! Separate the RUN_FASTQC and RUN_MULTIQC processes out of the main workflow and into a subworkflow. Check below if you’re having trouble.
// [ main.nf ]
// Include subworkflow
include { QUALITY_CONTROLS } from "./subworkflows/quality_controls.nf"

// Main workflow
QUALITY_CONTROLS (
    DOWNLOAD_FASTQ_FILES.out
)

// [ subworkflows/quality_controls.nf ]
// Quality controls subworkflow
workflow QUALITY_CONTROLS {

    take:
    fastq

    main:
    RUN_FASTQC (
        fastq
    )
    RUN_MULTIQC (
        RUN_FASTQC.out.zip.collect()
    )

    emit:
    html          = RUN_MULTIQC.out.html
    general_stats = RUN_MULTIQC.out.general_stats
}

// [ Include RUN_FASTQC and RUN_MULTIQC processes here ]

If you want to challenge yourself, try to do the same with the INDEX_GENOME, ALIGN_TO_GENOME and SORT_BAM processes! Be careful of where you get your inputs and outputs; check below if you want one of the ways in which you can do this:

// [ main.nf ]
// Include subworkflow
include { ALIGNMENT } from "./subworkflows/alignment.nf"

// Main workflow
ALIGNMENT {
    ch_genome_fasta,
    DOWNLOAD_FASTQ_FILES.out
}

// [ subworkflows/alignment.nf ]
// Alignment subworkflow
workflow ALIGNMENT {

    take:
    fasta
    fastq

    main:
    INDEX_GENOME (
        fasta
    )
    ALIGN_TO_GENOME (
        fastq,
        INDEX_GENOME.out.index
    )
    SORT_BAM (
        ALIGN_TO_GENOME.out.bam
    )

    emit:
    bam = SORT_BAM.out.bam
}

// [ Include INDEX_GENOME, ALIGN_TO_GENOME and SORT_BAM processes here ]
Quick recap

In this section we learnt:

  • How to automatically download remote files
  • How to create and work with subworkflows

7 Extra material

There are many more things you can do with Nextflow than covered here. If you are interested to learn more details about Nextflow, we will briefly show some of its advanced features in this section. But first, here are some links to additional resources on Nextflow:

7.1 Using containers in Nextflow

Nextflow has built-in support for using both Docker and Apptainer containers (and others too), either with a single container for the workflow as a whole or separate containers for each individual process. The simplest way to do it is to have a single container for your entire workflow, in which case you simply run the workflow and specify the image you want to use, like so:

# Run with docker
nextflow run main.nf -with-docker image-name

# Run with Apptainer
nextflow run main.nf -with-apptainer image.sif

If you don’t want to supply this at every execution, you can also add it directly to your configuration file:

# Docker configuration
process.container = 'image-name'
docker.enabled = true

# Apptainer configuration
process.container = 'path/to/image.sif'
apptainer.enabled = true

If you instead would like to have each process use a different container you can use the container directive in your processes:

process PROCESS_01 {
    (...)
    container: 'image_01'
    (...)
}

process PROCESS_02 {
    (...)
    container: 'image_02'
    (...)
}

Regardless of which solution you go for, Nextflow will execute all the processes inside the specified container. In practice, this means that Nextflow will automatically wrap your processes and run them by executing the Docker or Apptainer command with the image you have provided.

7.2 Using Conda in Nextflow

While you can execute Nextflow inside Conda environments just like you would any other type of software, you can also use Conda with Nextflow in the same way as for Docker and Apptainer above. You can either supply an environment.yml file, the path to an existing environment or the packages and their versions directly in the conda directive, like so:

process PROCESS_01 {
    (...)
    conda: 'mrsa-environment.yml'
    (...)
}
process PROCESS_02 {
    (...)
    conda: 'path/to/mrsa-env'
    (...)
}
process PROCESS_03 {
    (...)
    conda: 'bioconda::bwa=0.7.17 bioconda::samtools=1.13'
    (...)
}

You can use either of the methods described above with your configuration file as well, here exemplified using an environment.yml file:

process.conda = 'mrsa-environment.yml'

7.3 Running Nextflow on Uppmax

A lot of researchers in Sweden are using the Uppmax computer cluster in Uppsala, which is easily handled by Nextflow. What you need to do is to add the following profile to your nextflow.config file:

profiles {
    // Uppmax general profile
    uppmax {
        params {
            account        = null
        }
        process {
            executor       = 'slurm'
            clusterOptions = "-A '${params.account}'"
            memory         = { 6.GB * task.attempt }
            cpus           = { 1 * task.attempt }
            time           = { 10.h * task.attempt }
            scratch        = '$SNIC_TMP'
            errorStrategy  = 'retry'
            maxRetries     = 1
        }
    }
}

This will add a profile to your workflow, which you can access by running the workflow with -profile uppmax. You will also have to supply an extra parameter account which corresponds to your SNIC project account, but the rest you can leave as-is, unless you want to tinker with e.g. compute resource specifications. That’s all you need! Nextflow will take care of communications with SLURM (the system used by Uppmax, specified by the executor line) and will send off jobs to the cluster for you, and everything will look exactly the same way as if you were executing the pipeline locally.

The memory, cpus and time lines define the various resources Nextflow will use as well as how much to automatically increase them by if re-trying failed tasks; this, in turn, is specified by the errorStrategy and maxRetries variables. The scratch variable defines where each node’s local storage is situated, which gives Nextflow the most optimal access to the Uppmax file system for temporary files.

7.4 Advanced channel creation

The input data shown in the MRSA example workflow is not that complex, but Nextflow channels can do much more than that. A common scenario in high-throughput sequencing is that you have pairs of reads for each sample. Nextflow has a special, built-in way to create channels for this data type: the fromFilePairs channel factory:

ch_raw_reads = Channel
    .fromFilePairs ( "data/*_R{1,2}.fastq.gz" )

This will create a channel containing all the reads in the data/ directory in the format <sample>_R1.fastq.gz and <sample>_R2.fastq.gz and will pair them together into a nested tuple looking like this:

[sample, [data/sample_R1.fastq.gz, data/sample_R2.fastq.gz]]

The first element of the tuple ([0]) thus contains the value sample, while the second element ([1]) contains another tuple with paths to both read files. This nested tuple can be passed into processes for e.g. read alignment, and it makes the entire procedure of going from read pairs (i.e. two separate files, one sample) into a single alignment file (one file, one sample) very simple. For more methods of reading in data see the Nextflow documentation on Channel Factories.

We can also do quite advanced things to manipulate data in channels, such as this:

samples_and_treatments = Channel
    .fromPath ( params.metadata )
    .splitCsv ( sep: "\t", header: true )
    .map      { row -> tuple("${row.sample_id}", "${row.treatment}") }
    .filter   { id, treatment -> treatment != "DMSO" }
    .unique   ( )

That’s a bit of a handful! But what does it do? The first line specifies that we want to read some data from a file specified by the metadata parameter, and the second line actually reads that data using tab as delimiter, including a header. The map operator takes each entire row and subsets it to only two columns: the sample_id and treatment columns (discarding the other columns). This subset is stored as a tuple. The filter operator is then used to remove any tuples where the second entry (treatment) is not equal to the string "DMSO" (i.e. untreated cells, in this example). Finally, we only keep unique tuple values. Let’s say that this is the metadata we’re reading:

sample        dose    group     treatment
sample_1      0.1     control   DMSO
sample_1      1.0     control   DMSO
sample_1      2.0     control   DMSO
sample_2      0.1     case      vorinostat
sample_2      1.0     case      vorinostat
sample_2      2.0     case      vorinostat
sample_3      0.1     case      fulvestrant
sample_3      1.0     case      fulvestrant
sample_3      2.0     case      fulvestrant

Given the channel creation strategy above, we would get the following result:

[sample_2, vorinostat]
[sample_3, fulvestrant]

In this way, you can perform complex operations on input files or input metadata and send the resulting content to your downstream processes in a simple way. Composing data manipulations in Nextflow like this can be half the fun of writing the workflow. Check out Nextflow’s documentation on Channel operators to see the full list of channel operations at your disposal.

7.5 Using Groovy in processes

You don’t have to use bash or external scripts inside your processes all the time unless you want to: Nextflow is based on Groovy, which allows you to use both Groovy and Bash in the same process. For example, have a look at this:

process index_fasta {
    tag "${fasta_name}"

    input:
    tuple val(fasta), path(fasta_file)

    output:
    path("${fasta_name}.idx"), emit: fasta

    script:
    fasta_name = fasta.substring(0, fasta.lastIndexOf("."))
    """
    index --ref ${fasta_file},${fasta_name}
    """
}

Here we have some command index that, for whatever reason, requires both the path to a FASTA file and the name of that file without the .fasta extension. We can use Groovy in the script directive together with normal Bash, mixing and matching as we like. The first line of the script directive gets the name of the FASTA file without the extension by removing anything after the dot, while the second calls the index command like normal using bash.

7.6 The nf-core pipeline collection

You may have heard of the nf-core pipeline collection previously, which is a large, collaborative bioinformatics community dedicated to building, developing and maintaining Nextflow workflows. In fact, if you have sequenced data at e.g. the National Genomics Infrastructure (NGI), you can be sure that the data processing has been run using one of the nf-core pipelines! While the community only started in 2018 (with a Nature Biotechnology paper in 2020), it already has over 30 production-ready pipelines with everything from genomics, transcriptomics, proteomics and metagenomics - and more being developed all the time.

The nf-core pipelines all work in the same way, in that they have the same exact base for inputs, parameters and arguments, making them all highly similar to run. Since you’ve already learnt the basics of Nextflow in this course, you should now be able to also run the nf-core pipelines! It might be that you have a data type that you can analyse using one of the pipelines in nf-core, meaning you don’t need to do anything other than find out what parameters you should run it with.

Each pipeline comes with extensive documentation, test datasets that you can use to practice on, can be run on both HPCs like Uppmax, cloud services like AWS or locally on your own computer. All pipelines support both Conda and Docker and Apptainer, and you can additionally run specific versions of the pipelines, allowing for full reproducibility of your analyses. If you want to check nf-core out, simply head over to their list of pipelines and see what’s available! Who knows, you might even write your own nf-core pipeline in the future?