Working with Snakemake

How to create reproducible workflows and computational pipelines

Published

29-Nov-2024

1 Introduction

A workflow management system (WfMS) is a piece of software that sets up, performs and monitors a defined sequence of computational tasks (i.e. “a workflow”). Snakemake is a WfMS that was developed in the bioinformatics community, and as such it has a number of features that make it particularly well-suited for creating reproducible and scalable data analyses.

First of all the language you use to formulate your workflows is based on Python, which is a language with strong standing in academia. However, users are not required to know how to code in Python to work efficiently with Snakemake. Workflows can easily be scaled from your desktop to server, cluster, grid or cloud environments. This makes it possible to develop a workflow on your laptop, maybe using only a small subset of your data, and then run the real analysis on a cluster. Snakemake also has several features for defining the environment with which each task is carried out. This is important in bioinformatics, where workflows often involve running a large number of small third-party tools.

Snakemake is primarily intended to work on files (rather than for example streams, reading/writing from databases or passing variables in memory). This fits well with many fields of bioinformatics, notably next-generation sequencing, that often involve computationally expensive operations on large files. It’s also a good fit for a scientific research setting, where the exact specifications of the final workflow aren’t always known at the beginning of a project.

Lastly, a WfMS is a very important tool for making your analyses reproducible. By keeping track of when each file was generated, and by which operation, it is possible to ensure that there is a consistent “paper trail” from raw data to final results. Snakemake also has features that allow you to package and distribute the workflow, and any files it involves, once it’s done.

This tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to set it up if you haven’t done so already, then open up a terminal and go to workshop-reproducible-research/tutorials/snakemake and activate your snakemake-env Conda environment.

2 The basics

In this part of the tutorial we will create a very simple workflow from scratch, in order to show the fundamentals of how Snakemake works. The workflow will take two files as inputs, a.txt and b.txt, and the purpose is to convert the text in the files to upper case and then to concatenate them.

Run the following shell commands. The first one will make an empty file named Snakefile, which will later contain the workflow. The second and third commands generate two files containing some arbitrary text.

touch Snakefile
echo "This is a.txt" > a.txt
echo "This is b.txt" > b.txt

Then open Snakefile in your favourite text editor. A Snakemake workflow is based on rules which take some file(s) as input, performs some type of operation on them, and generate some file(s) as outputs. Here is a very simple rule that produces a.upper.txt as an output, using a.txt as input. Copy this rule to your Snakefile and save it.

rule convert_to_upper_case:
    output:
        "a.upper.txt"
    input:
        "a.txt"
    shell:
        """
        tr [a-z] [A-Z] < {input} > {output}
        """
Caution

Indentation is important in Snakefiles, so make sure that you have the correct number of spaces before input/output/shell and their respective subsections. The number of spaces per level doesn’t matter as long as you’re consistent. Here we use four, but you could just as well use two for a more compact look. Don’t use tabs (unless your editor automatically converts them to spaces).

Rules can be given names, here it’s convert_to_upper_case. While rule names are not strictly necessary we encourage you to use them and to make an effort to name your rules in a way that makes it easy to understand the purpose of the rule, as rule names are one of the main ways to interact with the workflow. The shell section (or directive) contains the shell commands that will convert the text in the input file to upper case and send it to the output file. In the shell command string, we can refer to elements of the rule via curly brackets. Here, we refer to the output file by specifying {output} and to the input file by specifying {input}. If you’re not very familiar with Bash, this particular command can be read like “send the contents of a.txt to the program tr, which will convert all characters in the set [a-z] to the corresponding character in the set [A-Z], and then send the output to a.upper.txt”.

Now let’s run our first Snakemake workflow. When a workflow is executed Snakemake tries to generate a set of target files. Target files can be specified via the command line or, as you will see later, in several other ways. But importantly you will always have to specify what you want Snakemake to generate, in one way or another. The basic Snakemake command line looks like this:

snakemake <arguments> [target files ...]

where <arguments> are flags that modify the behaviour of Snakemake and [target files ...] are the files that you want Snakemake to generate.

Let’s ask Snakemake to generate the file a.upper.txt, this is our target file. For the arguments part we’ll use the -s flag to specify the file containing our rules and the -c flag to set the number of cores to use. It’s good practice to first run with the flag -n (or --dry-run), which will show what Snakemake plans to do without actually running anything. You can also use the flag -p, for showing the shell commands that it will execute. snakemake --help will show you all available flags.

The command thus becomes:

snakemake -s Snakefile -c 1 -n -p a.upper.txt

You should see something like this:

Building DAG of jobs...
Job stats:
job                      count
---------------------  -------
convert_to_upper_case        1
total                        1


[Wed Nov 13 21:59:01 2024]
rule convert_to_upper_case:
    input: a.txt
    output: a.upper.txt
    jobid: 0
    reason: Missing output files: a.upper.txt
    resources: tmpdir=<TBD>


        tr [a-z] [A-Z] < a.txt > a.upper.txt
        
Job stats:
job                      count
---------------------  -------
convert_to_upper_case        1
total                        1

Reasons:
    (check individual jobs above for details)
    output files have to be generated:
        convert_to_upper_case

This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

You can see that Snakemake plans to run one job: the rule convert_to_upper_case with a.txt as input and a.upper.txt as output. The reason for doing this is that it’s missing the file a.upper.txt. Now run the command without the -n flag and check that the contents of a.upper.txt is as expected.

We used the -s flag to point to the file containing the rules. In fact, Snakemake will look for a file named Snakefile in the current directory (or in a subdirectory called workflow) by default, which means we can omit this flag for this example. If we also omit the -c flag Snakemake will use all available cores, which won’t make a difference for this small workflow. The command can thus be simplified to:

snakemake -p a.upper.txt

Try running this simplified command. What do you see? It turns out that Snakemake only reruns jobs if there have been changes to either the input files, or the workflow itself. Here neither the a.txt input, nor the Snakefile have been changed so Snakemake determines that everything in the workflow is up to date and there is nothing to do.

What if we ask Snakemake to generate the file b.upper.txt? Run the following:

snakemake -n -p b.upper.txt

You will see that Snakemake complains that there is no rule to produce b.upper.txt. This is because we have only defined a rule for a.upper.txt. We could copy the rule to make a similar one for b.txt, but that would be a bit cumbersome. Here is where named wildcards come in; one of the most powerful features of Snakemake. Simply change the input and output directives from

    output:
        "a.upper.txt"
    input: 
        "a.txt"

to

    output:
        "{some_name}.upper.txt"
    input: 
        "{some_name}.txt"

Now try asking for b.upper.txt again.

Tada! What happens here is that Snakemake looks at all the rules it has available (actually only one in this case) and tries to assign values to all wildcards so that the targeted files can be generated. In this case you can see that it says wildcards: some_name=b, which means that the some_name wildcard was assigned the value b. This was quite simple but for large workflows and multiple wildcards it can get much more complex. Named wildcards is what enables a workflow (or single rules) to be efficiently generalized and reused between projects or shared between people.

Remember that we’re trying to make a workflow that will convert two files to uppercase and concatenate them. It seems we have the first part of our workflow working, now it’s time to make the second rule for concatenating the outputs from convert_to_upper_case.

The syntax for concatenating two files in Bash is :

cat first_file.txt second_file.txt > output_file.txt`

The structure of this second rule will be similar to the first but here we have two inputs instead of one. This can be expressed in two ways, either with named inputs like this:

input:
    firstFile="...",
    secondFile="..."
shell:
    """
    some_function {input.firstFile} {input.secondFile}
    """

Or with indexes like this:

input:
    "...",
    "..."
shell:
    """
    some_function {input[0]} {input[1]}
    """
Caution

If you have multiple inputs or outputs they need to be delimited with a comma (as seen above). This is a very common mistake when writing Snakemake workflows. The parser will complain, but sometimes the error message can be difficult to interpret.

Now try to construct this second rule yourself. Add it to the Snakefile and name the rule concatenate_a_and_b. Use a.upper.txt and b.upper.txt as input and call the output c.txt. Check the solution below if you get stuck.

rule concatenate_a_and_b:
    output:
        "c.txt"
    input:
        "a.upper.txt",
        "b.upper.txt"
    shell:
        """
        cat {input[0]} {input[1]} > {output}
        """

Now try to generate the file c.txt with your workflow by running:

snakemake -p c.txt

Validate that the output looks as expected.

Good job! You’ve now created a simple Snakemake workflow that converts two files to uppercase and concatenates them.

Wouldn’t it be nice if our workflow could be used for any files, not just a.txt and b.txt? We can achieve this by generalizing the concatenate_a_and_b rule using named wildcards.

As we’ve mentioned, Snakemake looks at all the rules it has available and tries to assign values to all wildcards so that the targeted files can be generated. We therefore have to name the output file in a way so that it contains information about which input files it should be based on. Try to figure out how to modify the concatenate_a_and_b rule yourself. If you’re stuck you can look at the solution below, but spend some time on it before you look. Also rename the concatenate_a_and_b rule to concatenate_files to reflect its new more general use.

rule concatenate_files:
    output:
        "{first}_{second}.txt"
    input:
        "{first}.upper.txt",
        "{second}.upper.txt"
    shell:
        """
        cat {input[0]} {input[1]} > {output}
        """

We can now control which input files to use by the name of the file we ask Snakemake to generate. Run the workflow to create a file a_b.txt:

snakemake -p a_b.txt

You will see that Snakemake assigned the values a and b to the wildcards {first} and {second}, respectively. This is because the output of the concatenate_files rule is defined as "{first}_{second}.txt", which means that when we ask for a_b.txt, the {first} wildcard is assigned a and the {second} wildcard is assigned b. Try to generate the file b_a.txt and see what happens.

Tip

You can name a file whatever you want in a Snakemake workflow, but you will find that everything falls into place much nicer if the filename reflects the file’s path through the workflow, e.g. sample_a.trimmed.deduplicated.sorted.bam.

The input to Snakemake rules have to be strings or lists of strings, however you don’t have to specify these strings directly in the input: section of rules. Instead, you can specify Python functions that return strings or lists of strings. This allows you to supply input to rules that can vary depending on the wildcards being used. We’ll get to why that’s useful in a sec, but first let’s put it to use for the conatenate_files rule. Because Snakemake is based on Python we can mix rule definitions with standard python code in the same file. Add a function just above the concatenate_files that looks like this:

def concat_input(wildcards):
    files = [wildcards.first + ".upper.txt", wildcards.second + ".upper.txt"]
    return files

This is the syntax to define a function in Python. The def concat_input(wildcards): line shows the name of the function (concat_input) and the variable passed to the function (the wildcards object). In the second line we add two items to a list that we call files and add the ‘.upper.txt’ suffix to each item. Finally, the function returns the list. Because the concatenate_files rule has two wildcards {first} and {second} we can access the actual strings in the wildcards object using wildcards.first and wildcards.second. When we ask for the file a_b.txt then wildcards.first == 'a' and wildcards.second == 'b'. This means that the files list returned by the function will be ['a.upper.txt', 'b.upper.txt']. To see for yourself you can add the following line to the function, just before the return statement: print (wildcards.first, wildcards.second, files). This way the wildcard values and the list will be printed to the terminal when you run Snakemake.

Now that we’ve defined the function to use as input, we can use it in the concatenate_files rule. Update the rule so that it looks like this:

rule concatenate_files:
    output:
        "{first}_{second}.txt"
    input:
        concat_input
    shell:
        """
        cat {input[0]} {input[1]} > {output}
        """

You see that the name of the function concat_input is added in place of the input strings. When using the wildcards object in input functions like this we have to call the function without any arguments (simply concat_input) and the function has to be defined to accept a single argument (here def concat_input(wildcards):). Let’s run the workflow with the updated rule. Remove the file a_b.txt or add -f to the Snakemake command to force a re-run:

snakemake -f a_b.txt

If you added the print statement to the function you should see the following printed to your terminal:

Building DAG of jobs...
a b ['a.upper.txt', 'b.upper.txt']

Followed by the rest of the workflow output.

There are a number of possible use-cases for input functions. For example, say that you have an experiment where you’ve sequenced three samples: sample1, sample2 and sample3 with the corresponding FASTQ files under data/ and you want to write a rule that outputs the statistics of all sequences within each sample. However, samples sample1 and sample2 have been sequenced with single-end technology while sample3 have paired-end reads. The single-end samples will have only one FASTQ file whereas the paired-end sample will have two (one for each sequenced end). Thus, depending on the name of the sample the input to the function will either be one file or two. With input functions we can write a generalized rule that can handle both types:

def fastq_input(wildcards):
    if wildcards.sample_id in ["sample1", "sample2"]:
        return "data/" + wildcards.sample_id + ".fastq.gz"
    else:
        return ["data/" + wildcards.sample_id + ".R1.fastq.gz",
                "data/" + wildcards.sample_id + ".R2.fastq.gz"]

rule fastq_stats:
    output:
        "{sample_id}.stats.txt"
    input:
        fastq_input
    shell:
        """
        seqtk comp {input} > {output}
        """

As you can see, the fastq_stats rule outputs one file {sample_id}.stats.txt and takes as input the value returned from the fastq_input function. In this function the sample id is evaluated and if it is either sample1 or sample2 (our single-end samples) then the function returns a single string which is the path to the FASTQ file for that sample. Otherwise, the function returns a list containing both the R1 and R2 files for the sample. In the shell: directive of the rule the seqtk comp command is run on the input and the output is sent to the output file.

Quick recap

In this section we’ve learned:

  • How a simple Snakemake rule looks.
  • How to define target files when executing a workflow.
  • How to use named wildcards for writing generic and flexible rules.
  • How to use input functions in rules

3 Visualising workflows

All that we’ve done so far could quite easily be done in a simple shell script that takes the input files as parameters. Let’s now take a look at some of the features where a WfMS like Snakemake really adds value compared to a more straightforward approach. One such feature is the possibility to visualize your workflow. Snakemake can generate three types of graphs, one that shows how the rules are connected, one that shows how the jobs (i.e. an execution of a rule with some given inputs/outputs/settings) are connected, and finally one that shows rules with their respective input/output files.

First we look at the rule graph. The following command will generate a rule graph in the dot language and pipe it to the program dot, which in turn will save a visualization of the graph as a PNG file (if you’re having troubles displaying PNG files you could use SVG or JPG instead).

Caution

If you added the print(wildcards.first,wildcards.second,files) statement to the concat_input function in the previous section you need to remove that line before running the commands below.

snakemake --rulegraph a_b.txt | dot -Tpng > rulegraph.png

This looks simple enough, the output from the rule convert_to_upper_case will be used as input to the rule concatenate_files.

For a more typical bioinformatics project it can look something like this when you include all the rules from processing of the raw data to generating figures for the paper.

While saying that it’s easy to read might be a bit of a stretch, it definitely gives you a better overview of the project than you would have without a WfMS.

The second type of graph is based on the jobs, and looks like this for our little workflow (use --dag instead of --rulegraph).

snakemake --dag a_b.txt | dot -Tpng > jobgraph.png

The main difference here is that now each node is a job instead of a rule. You can see that the wildcards used in each job are also displayed. Another difference is the dotted lines around the nodes. A dotted line is Snakemake’s way of indicating that this rule doesn’t need to be rerun in order to generate a_b.txt. Validate this by running snakemake -n a_b.txt and it should say that there is nothing to be done.

We’ve discussed before that one of the main purposes of using a WfMS is that it automatically makes sure that everything is up to date. This is done by recursively checking that outputs are always newer than inputs for all the rules involved in the generation of your target files. Now try to change the contents of a.txt to some other text and save it. What do you think will happen if you run snakemake -n a_b.txt again?

Building DAG of jobs...
Job stats:
job                      count
---------------------  -------
concatenate_files            1
convert_to_upper_case        1
total                        2


[Fri Nov 15 10:27:22 2024]
rule convert_to_upper_case:
    input: a.txt
    output: a.upper.txt
    jobid: 1
    reason: Updated input files: a.txt
    wildcards: some_name=a
    resources: tmpdir=<TBD>


[Fri Nov 15 10:27:22 2024]
rule concatenate_files:
    input: a.upper.txt, b.upper.txt
    output: a_b.txt
    jobid: 0
    reason: Input files updated by another job: a.upper.txt
    wildcards: first=a, second=b
    resources: tmpdir=<TBD>

Job stats:
job                      count
---------------------  -------
concatenate_files            1
convert_to_upper_case        1
total                        2

Reasons:
    (check individual jobs above for details)
    input files updated by another job:
        concatenate_files
    updated input files:
        convert_to_upper_case

This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

Were you correct? Also generate the job graph and compare to the one generated above. What’s the difference? Now rerun without -n and validate that a_b.txt contains the new text. Note that Snakemake doesn’t look at the contents of files when trying to determine what has changed, only at the timestamp for when they were last modified.

We’ve seen that Snakemake keeps track of if files in the workflow have changed, and automatically makes sure that any results depending on such files are regenerated. What about if the rules themselves are changed? Since version 7.8.0 Snakemake keeps track of this automatically.

Modify the rule concatenate_files to also include which files were concatenated.

rule concatenate_files:
    output:
        "{first}_{second}.txt"
    input:
        concat_input
    shell:
        """
        echo 'Concatenating {input}' | cat - {input[0]} {input[1]} > {output}
        """
Note

It’s not really important for the tutorial, but the shell command used here first outputs “Concatenating” followed by a space delimited list of the files in input. This string is then sent to the program cat where it’s concatenated with input[0] and input[1] (the parameter - means that it should read from standard input). Lastly, the output from cat is sent to {output}.

If you now run:

snakemake -n a_b.txt

you should see:

rule concatenate_files:
    input: a.upper.txt, b.upper.txt
    output: a_b.txt
    jobid: 0
    reason: Code has changed since last execution
    wildcards: first=a, second=b

Although no files involved in the workflow have been changed, Snakemake recognizes that the workflow code itself has been modified and this triggers a re-run.

Snakemake is aware of changes to five categories of such “rerun-triggers”:

  • “input” (changes to rule input files),
  • “params” (changes to the rule params section)
  • “software-env” (changes to Conda environment files specified by the conda: directive)
  • “code” (changes to code in the shell:, run:, script: and notebook: directives)
  • “mtime” (changes to the modification time of input files)

Using the flag --rerun-triggers you can specify which of these categories should trigger a re-run. For example, --rerun-triggers input will only re-run the workflow if the specified input files have changed and --rerun-triggers input params will re-run if either the input files or the params section of the rule has changed.

Prior to version 7.8.0, only changes to the modification time of input files would trigger automatic re-runs. To run Snakemake with this previous behaviour you can use the setting --rerun-triggers mtime at the command line. Try running: snakemake --rerun-triggers mtime -n a_b.txt and you should again see Nothing to be done (all requested files are present and up to date).

You can also export information on how all files were generated (when, by which rule, which version of the rule, and by which commands) to a tab-delimited file like this:

snakemake a_b.txt -D > summary.tsv

The content of summary.tsv is shown in the table below:

summary.tsv
output_file date rule log-file(s) input-file(s) shellcmd status plan
a_b.txt Fri Nov 15 10:32:04 2024 concatenate_files a.upper.txt,b.upper.txt cat a.upper.txt b.upper.txt > a_b.txt rule implementation changed update pending
a.upper.txt Fri Nov 15 10:32:04 2024 convert_to_upper_case a.txt tr [a-z] [A-Z] < a.txt > a.upper.txt ok no update
b.upper.txt Thu Nov 14 00:04:06 2024 convert_to_upper_case b.txt tr [a-z] [A-Z] < b.txt > b.upper.txt ok no update

You can see in the second last column that the rule implementation for a_b.txt has changed. The last column shows if Snakemake plans to regenerate the files when it’s next executed. You can see that for the concatenate_files the plan is update pending because we generated the summary with the default behaviour of using all rerun-triggers.

You might wonder where Snakemake keeps track of all these things? It stores all information in a hidden subdirectory called .snakemake. This is convenient since it’s easy to delete if you don’t need it anymore and everything is contained in the project directory. Just be sure to add it to .gitignore so that you don’t end up tracking it with git.

Let’s take a look at the final type of graph that Snakemake can generate, this is done with the --filegraph flag and will show the rules and their input and output files.

snakemake --filegraph a_b.txt | dot -Tpng > filegraph.png

In terms of details/complexity the filegraph is an intermediate between the rulegraph (--rulegraph) and the jobgraph (--dag). It shows the rules to run with their respective input and output files (shown with wildcards), but doesn’t show the jobs that will be executed.

By now you should be familiar with the basic functionality of Snakemake, and you can build advanced workflows with only the features we have discussed here. There’s a lot we haven’t covered though, in particular when it comes to making your workflow more reusable. In the following section we will start with a workflow that is fully functional but not very flexible. We will then gradually improve it, and at the same time showcase some Snakemake features we haven’t discussed yet. Note that this can get a little complex at times, so if you felt that this section was a struggle then you could move on to one of the other tutorials instead.

Quick recap

In this section we’ve learned:

  • How to use --rulegraph, --dag and --filegraph for visualizing the workflow.
  • How Snakemake reruns relevant parts of the workflow after there have been changes.
  • How Snakemake tracks changes to files and code in a workflow

4 The MRSA workflow

As you might remember from the intro, we are attempting to understand how lytic bacteriophages can be used as a future therapy for the multi-resistant bacteria MRSA (methicillin-resistant Staphylococcus aureus). In order to do this we have performed RNA-seq of three strains, one test and two controls. We have already set up a draft Snakemake workflow for the RNA-seq analysis and it seems to be running nicely. The rest of the Snakemake tutorial will be spent improving and making this workflow more flexible!

Tip

This section will leave a little more up to you compared to the previous one. If you get stuck at some point the final workflow after all the modifications is available in tutorials/containers/Snakefile.

You are probably already in your snakemake-env environment, otherwise activate it (use conda info --envs if you are unsure).

Let’s start by generating the rule graph so that we get an overview of the workflow. Here we have to specify the file with the rules using the -s flag to Snakemake since the path to the file differs from the default.

snakemake -s snakefile_mrsa.smk --rulegraph | dot -T png > rulegraph_mrsa.png

There’s another difference in this command compared to the one we’ve used before, namely that we don’t define a target. In the toy example we used a_b.txt as a target, and the wildcards were resolved based on that. How come that we don’t need to do that here? This is because by default Snakemake uses the first rule in a workflow as its target, unless a target is specified on the command line. By convention, this rule is called all and serves as a rule for aggregating the main outputs of the workflow.

MRSA_rulegraph 0 all 1 generate_count_table 1->0 2 sort_bam 2->1 3 align_to_genome 3->2 4 get_SRA_by_accession 4->3 9 fastqc 4->9 5 index_genome 5->3 6 get_genome_fasta 6->5 7 get_genome_gff3 7->1 8 multiqc 8->0 9->8

As you can see from the snakefile_mrsa.smk and the rulegraph, the input to the all rule is the output from the rules multiqc and generate_count_table. These rules in turn have their own inputs from other rules and by targeting the all rule Snakemake determines what needs to be run by resolving the graph from bottom to top.

Now take some time and look through the workflow file and try to understand how the rules fit together. Use the rule graph as aid. The rules represent a quite standard, although somewhat simplified, workflow for RNA-seq analysis. If you are unfamiliar with the purpose of the different operations (index genome, FastQC and so on), then take a look at the intro.

Also generate the job graph in the same manner. Here you can see that three samples will be downloaded: SRR935090, SRR935091, and SRR935092. The original sample files contain tens of millions of reads but for the purpose of this course we have sub-sampled them to 100,000 reads per sample, so that they are easy to manage, and made them available at the SciLifeLab Data Repository. These FASTQ files will then be quality controlled with FastQC and aligned to a genome. The QC output will be aggregated with MultiQC and the alignments will be used to generate a count table, i.e. a table that shows how many reads map to each gene for each sample. This count table is then what the downstream analysis will be based on.

MRSA_jobgraph 0 all 1 generate_count_table 1->0 2 sort_bam 2->1 3 align_to_genome 3->2 4 get_SRA_by_accession sample_id: SRR935090 4->3 15 fastqc 4->15 5 index_genome 5->3 8 align_to_genome 5->8 11 align_to_genome 5->11 6 get_genome_fasta 6->5 7 sort_bam 7->1 8->7 9 get_SRA_by_accession sample_id: SRR935091 9->8 16 fastqc 9->16 10 sort_bam 10->1 11->10 12 get_SRA_by_accession sample_id: SRR935092 12->11 17 fastqc 12->17 13 get_genome_gff3 13->1 14 multiqc 14->0 15->14 16->14 17->14

Now let’s run the whole workflow. As before, specify -p to have Snakemake print the shell commands for rules and use -c to control how many cores to run with. If you don’t specify number of cores Snakemake will use all available cores on your machine. Let’s run the workflow with one core:

snakemake -s snakefile_mrsa.smk -c 1 -p

After everything is done, the workflow will have generated fastq files for each sample under data/ and reference genome files under data/ref. The results/ directory will contain the output from the different steps in the workflow, divided into subdirectories. Take some time to look through the structure.

Quick recap

In this section we’ve learned:

  • How the MRSA workflow looks.
  • How to run the MRSA workflow.
  • Which output files the MRSA workflow produces.

5 Parameters and configuration

In a typical bioinformatics project, considerable efforts are spent on tweaking parameters for the various programs involved. It would be inconvenient if you had to change in the shell scripts themselves every time you wanted to run with a new setting. Luckily, there is a better option for this: the params keyword.

rule some_rule:
    output:
        "results/some_output.txt"
    input:
        "some_input.txt"
    params:
        cutoff=2.5
    shell:
        """
        some_program --cutoff {params.cutoff} {input} {output}
        """

In the toy example above, the params directive has one keyword called cutoff which is set to 2.5 and this value can be accessed in the shell command with {params.cutoff}.

In the MRSA workflow most of the programs are run with default settings and don’t use the params: directive. However, the get_SRA_by_accession rule is an exception. Let’s take a look at this part of the workflow:

snakefile_mrsa.smk
def get_sample_url(wildcards):
    samples = {
        "SRR935090": "https://figshare.scilifelab.se/ndownloader/files/39539767",
        "SRR935091": "https://figshare.scilifelab.se/ndownloader/files/39539770",
        "SRR935092": "https://figshare.scilifelab.se/ndownloader/files/39539773"
    }
    return samples[wildcards.sample_id]

rule get_SRA_by_accession:
    """
    Retrieve a single-read FASTQ file
    """
    output:
        "data/{sample_id}.fastq.gz"
    params:
        url = get_sample_url
    shell:
        """
        wget -q -O - {params.url} | seqtk sample - 25000 | gzip -c > {output[0]}
        """

Here params has a keyword called url which is set to a function called get_sample_url defined just above the rule. The function returns a URL for the sample that the rule is supposed to download and this URL is then accessed in the shell: directive with {params.url}.

You may recognize this from The basics of this tutorial where we used input functions to generate strings and lists of strings for the input: section of a rule. Using a function to return values based on the wildcards also works for params:. Here sample_id is a wildcard which in this specific workflow can be either SRR935090, SRR935091, or SRR935092. The wildcards object is passed to the function get_sample_url and depending on what output the rule is supposed to generate, wildcards.sample_id will take the value of either of the three sample ids. The samples variable defined in the function is a Python dictionary that has the URLs for each sample_id hard-coded. This dictionary is used to convert the value of the sample_id wildcard to a URL, which is returned by the function. Finally, in the shell: directive we access the url parameter with {params.url}. (We could have written three separate rules to download the samples, but it’s easy to see how that can become impractical.)

Let’s add another parameter to the get_SRA_by_accession rule. As you can see in the shell command the FASTQ file downloaded by wget gets piped directly (the -O - part means send contents to STDOUT) to the seqtk sample command which reads from STDIN and outputs 25000 randomly sampled reads (out of the 100,000 contained in the example FASTQ file). Change in the rule to use the parameter max_reads instead and set the value to 20000. If you need help, click to show the solution below.

snakefile_mrsa.smk
rule get_SRA_by_accession:
    """
    Retrieve a single-read FASTQ file
    """
    output:
        "data/{sample_id}.fastq.gz"
    params:
        url = get_sample_url,
        max_reads = 20000
    shell:
        """
        wget -q -O - {params.url} | seqtk sample - {params.max_reads} | gzip -c > {output[0]}
        """

Now run the workflow again with:

snakemake -s snakefile_mrsa.smk -p -c 1

Because there’s been changes to the get_SRA_by_accession rule this will trigger a re-run of the rule for all three accessions. In addition all downstream rules that depend on output from get_SRA_by_accession are re-run.

As you can see the parameter values we set in the params section don’t have to be static, they can be any Python expression. In particular, Snakemake provides a global dictionary of configuration parameters called config. Let’s modify get_SRA_by_accession in order to make use of this dictionary:

snakefile_mrsa.smk
rule get_SRA_by_accession:
    """
    Retrieve a single-read FASTQ file
    """
    output:
        "data/{sample_id}.fastq.gz"
    params:
        url = get_sample_url,
        max_reads = config["max_reads"]
    shell:
        """
        wget -q -O - {params.url} | seqtk sample - {params.max_reads} | gzip -c > {output[0]}
        """

Note that Snakemake now expects there to be a key named max_reads in the config dictionary. If we don’t populate the dictionary somehow the dictionary will be empty so if you were to run the workflow now it would trigger a KeyError (try running snakemake -s snakefile_mrsa.smk -n to see for yourself). In order to populate the config dictionary with data for the workflow we could use the snakemake --config KEY=VALUE syntax directly from the command line (e.g. snakemake --config max_reads=20000 -s snakefile_mrsa.smk). However, from a reproducibility perspective, it’s not optimal to set parameters from the command line, since it’s difficult to keep track of which parameter values that were used.

A much better alternative is to use the --configfile FILE option to supply a configuration file to Snakemake. In this file we can collect all the project-specific settings, sample ids and so on. This also enables us to write the Snakefile in a more general manner so that it can be better reused between projects. Like several other files used in these tutorials, this file should be in YAML format. Create the file below and save it as config.yml (here we change back to using 25000 reads):

config.yml
max_reads: 25000

If we now run Snakemake with --configfile config.yml, it will parse this file to form the config dictionary. Try it out by running:

snakemake -s snakefile_mrsa.smk --configfile config.yml -p -c 1

Configuration settings that you specify with --config KEY=VALUE on the command line will overwrite settings specified in a config file. This can be useful if you want to run the workflow with a different setting for a specific parameter for testing purposes.

Tip

Rather than supplying the config file from the command line you could also add the line configfile: "config.yml" to the top of your Snakefile. Keep in mind that with such a setup Snakemake will complain if the file config.yml is not present.

Quick recap

In this section we’ve learned:

  • How to set parameter values with the params directive.
  • How to run Snakemake with the config variable and with a configuration file.

6 Logs

As you probably noticed it was difficult to follow how the workflow progressed since some rules printed a lot of output to the terminal. In some cases this also contained important information, such as statistics on the sequence alignments or genome indexing. This could be valuable for example if you later in the project get weird results and want to debug. It’s also important from a reproducibility perspective that the “paper trail” describing how the outputs were generated is saved. Luckily, Snakemake has a feature that can help with this. Just as we define input and output in a rule we can also define log.

rule some_rule:
    output:
        "results/some_output.txt"
    input:
        "some_input.txt"
    log:
        "logs/some_log.txt"
    shell:
        """
        some_program {input} {output} > {log}
        """

A log file is not different from any other output file, but it’s dealt with a little differently by Snakemake. For example, it’s shown in the file summary when using -D and unlike other output files it’s not deleted if jobs fail which of course is necessary for debugging purposes. It’s also a good way to clarify the purpose of the file. We probably don’t need to save logs for all the rules, only the ones with interesting output:

  • multiqc aggregates quality control data for all the samples into one html report, and the log contains information about which samples were aggregated.
  • get_genome_fasta and get_genome_gff3 would be good to log since they are dependent on downloading files from an external server.
  • index_genome outputs some statistics about the genome indexing.
  • generate_count_table outputs information about input files, settings and a summary of results.
  • align_to_genome outputs important statistics about the alignments. This is probably the most important log to save.

Now add a log file to some or all of the rules above. A good place to save them to would be results/logs/<rule_name>/. In order to avoid that multiple jobs write to the same files Snakemake requires that all output and log files contain the same wildcards, so be sure to include any wildcards used in the rule in the log name as well, e.g. {some_wildcard}.log.

You also have to specify in the shell section of each rule what you want the log to contain. Some of the programs we use send their log information to standard out, some to standard error and some let us specify a log file via a flag.

For example, bowtie2 used in the align_to_genome rule writes log info to standard error so we use 2>{log} to redirect this to the log file:

snakefile_mrsa.smk
rule align_to_genome:
    """
    Align a fastq file to a genome index using Bowtie 2.
    """
    output:
        "results/bam/{sample_id,\\w+}.bam"
    input:
        "data/{sample_id}.fastq.gz",
        "results/bowtie2/NCTC8325.1.bt2",
        "results/bowtie2/NCTC8325.2.bt2",
        "results/bowtie2/NCTC8325.3.bt2",
        "results/bowtie2/NCTC8325.4.bt2",
        "results/bowtie2/NCTC8325.rev.1.bt2",
        "results/bowtie2/NCTC8325.rev.2.bt2"
    log:
        "results/logs/align_to_genome/{sample_id}.log"
    shell:
        """
        bowtie2 -x results/bowtie2/NCTC8325 -U {input[0]} > {output} 2>{log}
        """

For programs used in the rules multiqc, get_genome_fasta, get_genome_gff3, index_genome and generate_count_table you can use the table below if you need help with how to redirect output to the log file.

Rule Program Redirect flag
multiqc multiqc 2>{log}
get_genome_fasta/get_genome_gff3 wget -o {log}
index_genome bowtie2-build >{log}
generate_count_table featureCounts 2>{log}

If you need help, click to show the solution below for the rules.

snakefile_mrsa.smk
rule multiqc:
    """
    Aggregate all FastQC reports into a MultiQC report.
    """
    output:
        html = "results/multiqc/multiqc.html",
        stats = "results/multiqc/multiqc_general_stats.txt"
    input:
        "results/fastqc/SRR935090_fastqc.zip",
        "results/fastqc/SRR935091_fastqc.zip",
        "results/fastqc/SRR935092_fastqc.zip"
    log:
        "results/logs/multiqc.log"
    shell:
        """
        # Run multiQC and keep the html report
        multiqc -n multiqc.html {input} 2>{log}
        mv multiqc.html {output.html}
        mv multiqc_data/multiqc_general_stats.txt {output.stats}

        # Remove the other directory that multiQC creates
        rm -rf multiqc_data
        """

rule get_genome_fasta:
    """
    Retrieve the sequence in fasta format for a genome.
    """
    output:
        "data/ref/NCTC8325.fa.gz"
    log:
        "results/logs/get_genome_fasta.log"
    shell:
        """
        wget -o {log} ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz -O {output}
        """

rule get_genome_gff3:
    """
    Retrieve annotation in gff3 format for a genome.
    """
    output:
        "data/ref/NCTC8325.gff3.gz"
    log:
        "results/logs/get_genome_gff3.log"
    shell:
        """
        wget -o {log} ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/gff3/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.37.gff3.gz -O {output}
        """

rule index_genome:
    """
    Index a genome using Bowtie 2.
    """
    output:
        "results/bowtie2/NCTC8325.1.bt2",
        "results/bowtie2/NCTC8325.2.bt2",
        "results/bowtie2/NCTC8325.3.bt2",
        "results/bowtie2/NCTC8325.4.bt2",
        "results/bowtie2/NCTC8325.rev.1.bt2",
        "results/bowtie2/NCTC8325.rev.2.bt2"
    input:
        "data/ref/NCTC8325.fa.gz"
    log:
        "results/logs/index_genome.log"
    shell:
        """
        # Bowtie2 cannot use .gz, so unzip to a temporary file first
        gunzip -c {input} > tempfile
        bowtie2-build tempfile results/bowtie2/NCTC8325 >{log}

        # Remove the temporary file
        rm tempfile
        """

rule generate_count_table:
    """
    Generate a count table using featureCounts.
    """
    output:
        "results/tables/counts.tsv"
    input:
        bams = ["results/bam/SRR935090.sorted.bam",
                "results/bam/SRR935091.sorted.bam",
                "results/bam/SRR935092.sorted.bam"],
        annotation = "data/ref/NCTC8325.gff3.gz"
    log:
        "results/logs/generate_count_table.log"
    shell:
        """
        featureCounts -t gene -g gene_id -a {input.annotation} -o {output} {input.bams} 2>{log}
        """

After adding logfiles to the rules, rerun the whole workflow with:

snakemake -s snakefile_mrsa.smk --configfile config.yml -p -c 1

Do the logs contain what they should? Note how much easier it is to follow the progression of the workflow when the rules write to logs instead of to the terminal.

Tip

If you have a rule with a shell directive in which several commands are run and you want to save stdout and stderr for all commands into the same log file you can add exec &>{log} as the first line of the shell directive.

If you run with -D (or -S for a simpler version) you will see that the summary table now also contains the log file for each of the files in the workflow.

Quick recap

In this section we’ve learned:

  • How to redirect output to log files with the log directive.

7 Temporary files

It’s not uncommon that workflows contain temporary files that should be kept for some time and then deleted once they are no longer needed. A typical case could be that some operation generates a file, which is then compressed to save space or indexed to make searching faster. There is then no need to save the original output file. Take a look at the job graph for our workflow again. The output from align_to_genome is a BAM file, which contains information about all the reads for a sample and where they map in the genome. For downstream processing we need this file to be sorted by genome coordinates. This is what the rule sort_bam is for. We therefore end up with both results/bam/{sample_id}.bam and results/bam/{sample_id}.sorted.bam as you can see if you list the contents of results/bam.

In Snakemake we can mark an output file as temporary like this:

rule some_rule:
    output:
        temp("results/some_output.txt")

The file will then be deleted as soon as all jobs where it’s an input have finished. Let’s use the temp notation for for the output of align_to_genome, modify the rule so that the output: directive looks like this:

snakefile_mrsa.smk
    output:
        temp("results/bam/{sample_id,\\w+}.bam")

We have to force a rerun of the rule for it to trigger, so we use --forcerun align_to_genome to do this. Run the workflow with:

snakemake -s snakefile_mrsa.smk --configfile config.yml -p -c 1 --forcerun align_to_genome

Take a look at the output printed to your terminal. After the rule sort_bam you should see something like this:

Removing temporary output results/bam/SRR935092.bam.

If you list the contents of results/bam you should see that it now only contains the sorted BAM files.

Tip

If you want to prevent Snakemake from deleting files marked as temporary you can run with the --notemp command line flag which will cause Snakemake to ignore any temp() declarations. If you at any point want to delete remaining files that are declared as temporary in your workflow you can instead run with the --delete-temp-output which will delete all remaining temp files.

Snakemake has a number of additional options for marking files:

  • protected("..."): The output file should be write-protected. Typically used to protect files that require a huge amount of computational resources from being accidentally deleted.
  • ancient("..."): The timestamp of the input file is ignored and it’s always assumed to be older than any of the output files.
  • touch("..."): The output file should be “touched”, i.e. created or updated, when the rule has finished. Typically used as “flag files” to enforce some rule execution order without real file dependencies.
  • directory("..."): The output is a directory rather than a file.
Quick recap

In this section we’ve learned:

  • How to mark an output file as temporary for automatic removal.

8 The expand function

Now that we have a working workflow with logs and temporary files, let’s look at a way to make the code more readable.

Consider the rule align_to_genome below.

snakefile_mrsa.smk
rule align_to_genome:
    """
    Align a fastq file to a genome index using Bowtie 2.
    """
    output:
        temp("results/bam/{sample_id,\\w+}.bam")
    input:
        "data/{sample_id}.fastq.gz",
        "results/bowtie2/NCTC8325.1.bt2",
        "results/bowtie2/NCTC8325.2.bt2",
        "results/bowtie2/NCTC8325.3.bt2",
        "results/bowtie2/NCTC8325.4.bt2",
        "results/bowtie2/NCTC8325.rev.1.bt2",
        "results/bowtie2/NCTC8325.rev.2.bt2"
    log:
        "results/logs/align_to_genome/{sample_id}.log"
    shell:
        """
        bowtie2 -x results/bowtie2/NCTC8325 -U {input[0]} > {output} 2>{log}
        """

Here we have seven inputs; the FASTQ file with the reads and six files with similar file names from the Bowtie2 genome indexing. Instead of writing all the filenames we can tidy this up by using the built-in function expand().

expand("results/bowtie2/NCTC8325.{substr}.bt2", 
    substr = ["1", "2", "3", "4", "rev.1", "rev.2"])

Here expand will take the string "results/bowtie2/NCTC8325.{substr}.bt2" and insert each of the values in the substr list: ["1", "2", "3", "4", "rev.1", "rev.2"] in place of {substr}. The result is a list of strings that is used as input to the rule. This is a very powerful feature that can save you a lot of typing and make your code more readable.

Now modify the rules index_genome and align_to_genome to use the expand() expression in the output and input, respectively. Click the solution below if you need help.

snakefile_mrsa.smk
rule index_genome:
    """
    Index a genome using Bowtie 2.
    """
    output:
        expand("results/bowtie2/NCTC8325.{substr}.bt2",
            substr=["1", "2", "3", "4","rev.1", "rev.2"])
    input:
        "data/ref/NCTC8325.fa.gz"
    log:
        "results/logs/index_genome/NCTC8325.log"
    shell:
        """
        # Bowtie2 cannot use .gz, so unzip to a temporary file first
        gunzip -c {input} > tempfile
        bowtie2-build tempfile results/bowtie2/NCTC8325 >{log}

        # Remove the temporary file
        rm tempfile
        """

rule align_to_genome:
    """
    Align a fastq file to a genome index using Bowtie 2.
    """
    output:
        temp("results/bam/{sample_id,\\w+}.bam")
    input:
        "data/{sample_id}.fastq.gz",
        expand("results/bowtie2/NCTC8325.{substr}.bt2", 
                substr=["1", "2", "3", "4", "rev.1", "rev.2"])
    log:
        "results/logs/align_to_genome/{sample_id}.log"
    shell:
        """
        bowtie2 -x results/bowtie2/NCTC8325 -U {input[0]} > {output} 2>{log}
        """

Great! Let’s put the expand function to more use!

The multiqc and generate_count_table rules take as input the fastqc zip files, and the sorted bam files for all three samples, respectively. See the excerpt below:

snakefile_mrsa.smk
rule multiqc:
    ...
    input:
        "results/fastqc/SRR935090_fastqc.zip",
        "results/fastqc/SRR935091_fastqc.zip",
        "results/fastqc/SRR935092_fastqc.zip"

...

rule generate_count_table:
    ...
    input:
        bams = ["results/bam/SRR935090.sorted.bam",
                "results/bam/SRR935091.sorted.bam",
                "results/bam/SRR935092.sorted.bam"],
...

Having the files for each sample written explicitly in these rules is how Snakemake knows to run the workflow for all three samples. However, this is also a potential source of errors since it’s easy to change in one place and forget to change in the other. Because we can use Python code “everywhere” let’s instead define a list of sample ids and put at the top of the Snakefile, just before the rule all:

snakefile_mrsa.smk
sample_ids = ["SRR935090", "SRR935091", "SRR935092"]

Now use expand() in multiqc and generate_count_table to use sample_ids for the sample ids. For the multiqc rule it could look like this:

    input:
        expand("results/fastqc/{sample_id}_fastqc.zip", 
            sample_id = sample_ids)

See if you can update the generate_count_table rule in the same manner. If you need help, click the solution below.

snakefile_mrsa.smk
rule generate_count_table:
    """
    Generate a count table using featureCounts.
    """
    output:
        "results/tables/counts.tsv"
    input:
        bams = expand("results/bam/{sample_id}.sorted.bam", sample_id = sample_ids),
        annotation = "data/ref/NCTC8325.gff3.gz"
    log:
        "results/logs/generate_count_table.log"
    shell:
        """
        featureCounts -t gene -g gene_id -a {input.annotation} -o {output} {input.bams} 2>{log}
        """

Now run the workflow again with:

snakemake -s snakefile_mrsa.smk --configfile config.yml -p -c 1 --forcerun align_to_genome index_genome multiqc generate_count_table

to force a rerun of the modified rules and make sure that the workflow still works as expected.

Quick recap

In this section we’ve learned:

  • How to use the expand() expression to make the code more readable.

9 Shadow rules

Take a look at the index_genome rule below:

snakefile_mrsa.smk
rule index_genome:
    """
    Index a genome using Bowtie 2.
    """
    output:
        expand("results/bowtie2/NCTC8325.{substr}.bt2",
           substr = ["1", "2", "3", "4", "rev.1", "rev.2"])
    input:
        "data/NCTC8325.fa.gz"
    log:
        "results/logs/index_genome/NCTC8325.log"
    shell:
        """
        # Bowtie2 cannot use .gz, so unzip to a temporary file first
        gunzip -c {input} > tempfile
        bowtie2-build tempfile results/bowtie2/NCTC8325 >{log}

        # Remove the temporary file
        rm tempfile
        """

There is a temporary file here called tempfile which is the uncompressed version of the input fasta file. This file is created when the rule starts ( since Bowtie2 cannot use compressed files) but is not needed upon completion. There are a number of drawbacks with having files that aren’t explicitly part of the workflow as input/output files to rules:

  • Snakemake cannot clean up these files if the job fails, as it would do for normal output files.
  • If several jobs are run in parallel there is a risk that they write to tempfile at the same time. This can lead to very scary results.
  • Sometimes we don’t know the names of all the files that a program can generate and we may only be interested in a few of them.

All of these issues can be dealt with by using the shadow option for a rule. The shadow option results in that each execution of the rule is run in an isolated temporary directory (located in .snakemake/shadow/ by default). There are a few options for shadow (for the full list of these options see the Snakemake docs).

The most simple is shadow: "minimal", which means that the rule is executed in an empty shadow directory that the input files to the rule have been symlinked into. For the rule below, that means that some_input.txt would be symlinked into a shadow directory under .snakemake/shadow/. The shell commands would generate the file some_other_junk_file in this shadow directory, as well as the output file results/some_output.txt. When the rule finishes, the shadow directory is removed which means we wouldn’t have to think about manually removing some_other_junk_file.

rule some_rule:
    output:
        "results/some_output.txt"
    input:
        "some_input.txt"
    shadow: "minimal"
    shell:
        """
        touch some_other_junk_file
        some_program {input} {output}
        """

Try this out for the rules multiqc and index_genome where we have to “manually” deal with files that aren’t tracked by Snakemake. Add shadow: "minimal" to these rules and also delete the shell commands that remove temporary files from those rules (the last lines in the shell directives that begin with rm), as they are no longer needed. Check the solution below if you need help.

snakefile_mrsa.smk
rule multiqc:
    """
    Aggregate all FastQC reports into a MultiQC report.
    """
    output:
        html = "results/multiqc/multiqc.html",
        stats = "results/multiqc/multiqc_general_stats.txt"
    input:
        expand("results/fastqc/{sample_id}_fastqc.zip", 
            sample_id = sample_ids)
    log:
        "results/logs/multiqc.log"
    shadow: "minimal"
    shell:
        """
        # Run multiQC and keep the html report
        multiqc -n multiqc.html {input} 2>{log}
        mv multiqc.html {output.html}
        mv multiqc_data/multiqc_general_stats.txt {output.stats}
        """

rule index_genome:
    """
    Index a genome using Bowtie 2.
    """
    output:
        expand("results/bowtie2/NCTC8325.{substr}.bt2",
            substr=["1", "2", "3", "4","rev.1", "rev.2"])
    input:
        "data/ref/NCTC8325.fa.gz"
    log:
        "results/logs/index_genome/NCTC8325.log"
    shadow: "minimal"
    shell:
        """
        # Bowtie2 cannot use .gz, so unzip to a temporary file first
        gunzip -c {input} > tempfile
        bowtie2-build tempfile results/bowtie2/NCTC8325 >{log}
        """

Now run the workflow again with:

snakemake -s snakefile_mrsa.smk --configfile config.yml -p -c 1

and make sure that you see neither tempfile, nor the multiqc_data directory in your current working directory after the workflow has finished.

Tip

Some people use the shadow option for almost every rule and some never use it at all. One thing to keep in mind is that it leads to some extra file operations when the outputs are moved to their final location. This is no issue when the shadow directory is on the same disk as the output directory, but if you’re running on a distributed file system and generate very many or very large files it might be worth considering other options (see e.g. the --shadow-prefix flag).

Quick recap

In this section we’ve learned:

  • How to use the shadow option to handle files that are not tracked by Snakemake.

10 Generalizing the MRSA workflow

You’ve now improved the MRSA workflow by making it more readable, adding logfiles, a config file etc. However, the workflow is not very general since it’s hard-coded to work with the three samples SRR935090, SRR935091 and SRR935092 and the genome NCTC8325. If someone would want to use the workflow with other samples or genomes they would have to change the code in the Snakefile, which is not optimal. Instead it’s a good idea to separate project-specific parameters from the actual implementation of the workflow. This allows anyone using the workflow to modify its behaviour without changing the underlying code, making the workflow more general.

In this section we will go through how to make the workflow more general and flexible by moving the sample ids and URLs to a configuration file and turning the reference genome into a wildcard.

10.1 Specifying samples in a configuration file

Remove the sample_ids = ["SRR935090", "SRR935091", "SRR935092"] line we added to the top of snakefile_mrsa.smk and add the following to config.yml:

config.yml
samples:
  SRR935090: "https://figshare.scilifelab.se/ndownloader/files/39539767"
  SRR935091: "https://figshare.scilifelab.se/ndownloader/files/39539770"
  SRR935092: "https://figshare.scilifelab.se/ndownloader/files/39539773"

Now when passing --configfile config.ymlon the command line config["samples"] will be a dictionary where the keys are the sample ids and the values are the URLs to the FASTQ files.

Next, modify the get_sample_url function to use this dictionary instead of the hard-coded values:

snakefile_mrsa.smk
def get_sample_url(wildcards):
    return config["samples"][wildcards.sample_id]

Now change the multiqc and generate_count_table rules that use the expand function in the input directive to use config["samples"].keys() to expand the sample_id wildcard. This is how it should look for the multiqc rule:

    input:
        expand("results/fastqc/{sample_id}_fastqc.zip", 
            sample_id = config["samples"].keys()

Try to change the generate_count_table rule in the same way. Check the solution below if you need help.

snakefile_mrsa.smk
rule generate_count_table:
    """
    Generate a count table using featureCounts.
    """
    output:
        "results/tables/counts.tsv"
    input:
        bams = expand("results/bam/{sample_id}.sorted.bam", 
            sample_id = config["samples"].keys()),
        annotation = "data/ref/NCTC8325.gff3.gz"
    log:
        "results/logs/generate_count_table.log"
    shell:
        """
        featureCounts -t gene -g gene_id -a {input.annotation} -o {output} {input.bams} 2>{log}
        """

To see that your new implementation works you can do a forced rerun of the modified rules:

snakemake -s snakefile_mrsa.smk --configfile config.yml -p -c 1 --forcerun generate_count_table multiqc get_SRA_by_accession
Tip

If you were to scale up this workflow with more samples it could become impractical to have to define the sample ids and URLs by hand in the config file. A tip then is to have a separate file where samples are listed in one column and the URLs (or file paths) in another column. With a few lines of python code you could then read that list at the start of the workflow and add each sample to the config dictionary. See below for an example of how to do this.

10.2 Generalizing the genome reference

Now we’ll take a look at how to generalize the genome reference used in the workflow.

In the get_genome_fasta and get_genome_gff3 rules we have hard-coded FTP paths to the FASTA and GFF annotation file for the genome NCTC8325. Let’s move this information to the configuration file, and also add information for another genome, ST398. Add the following to config.yml:

config.yml
genomes:
  NCTC8325:
    fasta: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz
    gff3: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/gff3/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.37.gff3.gz
  ST398:
    fasta: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection//staphylococcus_aureus_subsp_aureus_st398/dna/Staphylococcus_aureus_subsp_aureus_st398.ASM958v1.dna.toplevel.fa.gz
    gff3: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/gff3/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_st398//Staphylococcus_aureus_subsp_aureus_st398.ASM958v1.37.gff3.gz

As you can see this is very similar to what we did with samples, it’s just that we have one more nested level. Now to access the FTP path to the FASTA file for genome id NCTC8325 we can use config["genomes"]["NCTC8325"]["fasta"].

Let’s now look at how to do the mapping from genome id to FASTA path in the rule get_genome_fasta. This is how the rule currently looks (if you have added the log section as previously described).

snakefile_mrsa.smk
rule get_genome_fasta:
    """
    Retrieve the sequence in fasta format for a genome.
    """
    output:
        "data/raw_external/NCTC8325.fa.gz"
    log:
        "results/logs/get_genome_fasta/NCTC8325.log"
    shell:
        """
        wget -o {log} ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz -O {output}
        """

Replace the hard-coded NCTC8325 with a wildcard {genome_id} in both the input and log directives. We now need to supply the remote paths to the FASTA file for a given genome id. Because we’ve added this information to the config file we just need to pass it to the rule in some way, and just like in the get_SRA_by_accession rule we’ll use a function to do the job:

snakemfile_mrsa.smk
def get_fasta_path(wildcards):
    return config["genomes"][wildcards.genome_id]["fasta"]

rule get_genome_fasta:
    """
    Retrieve the sequence in fasta format for a genome.
    """
    output:
        "data/ref/{genome_id}.fa.gz"
    log:
        "results/logs/get_genome_fasta/{genome_id}.log"
    params:
        fasta_path = get_fasta_path
    shell:
        """
        wget -o {log} {params.fasta_path} -O {output}
        """

Now change the get_genome_gff3 rule in a similar manner. Click to see the solution below if you’re having trouble.

def get_gff_path(wildcards):
    return config["genomes"][wildcards.genome_id]["gff3"]

rule get_genome_gff3:
    """
    Retrieve annotation in gff3 format for a genome.
    """
    output:
        "data/ref/{genome_id}.gff3.gz"
    log:
        "results/logs/get_genome_gff3/{genome_id}.log"
    params:
        gff3_path = get_gff_path
    shell:
        """
        wget -o {log} {params.gff3_path} -O {output}
        """

Also replace NCTC8325 with {genome_id} in the log, input and output of the index_genome rule. Here you will run into a complication if you have followed the previous instructions and use the expand() expression. We want the list of output files be ["results/bowtie2/{genome_id}.1.bt2", "results/bowtie2/{genome_id}.2.bt2", ...], i.e. only expanding the substr wildcard in the Bowtie2 index. To prevent Snakemake from trying to expand the genome_id wildcard we have to “mask” it with double curly brackets: {genome_id}.

In addition, we need to replace the hard-coded results/bowtie2/NCTC8325 in the shell directive of the rule with the genome id wildcard. Inside the shell directive the wildcard object is accessed with this syntax: {wildcards.genome_id}, so the Bowtie2-build command should be:

bowtie2-build tempfile results/bowtie2/{wildcards.genome_id} > {log}

The final rule should look like this:

snakefile_mrsa.smk
rule index_genome:
    """
    Index a genome using Bowtie 2.
    """
    output:
        expand("results/bowtie2/{{genome_id}}.{substr}.bt2",
            substr = ["1", "2", "3", "4", "rev.1", "rev.2"])
    input:
        "data/ref/{genome_id}.fa.gz"
    log:
        "results/logs/index_genome/{genome_id}.log"
    shadow: "minimal"
    shell:
        """
        # Bowtie2 cannot use .gz, so unzip to a temporary file first
        gunzip -c {input} > tempfile
        bowtie2-build tempfile results/bowtie2/{wildcards.genome_id} >{log}
        """

Good job! The rules get_genome_fasta, get_genome_gff3 and index_genome can now download and index any genome as long as we provide valid links in the config file.

Only two rules remain where the NCTC8325 genome id is hard-coded: align_to_genome and generate_count_table. We will use these rules to define which genome id we actually want to use when running the workflow. First add a parameter to config.yml called "genome_id" and set it to NCTC8325:

config.yml
genome_id: "NCTC8325"

Now we can resolve the genome_id wildcard from the config. For align_to_genome we change the input directive to:

    input:
        "data/{sample_id}.fastq.gz",
        expand("results/bowtie2/{genome_id}.{substr}.bt2",
            genome_id = config["genome_id"],
            substr = ["1", "2", "3", "4", "rev.1", "rev.2"])

Here the substr wildcard gets expanded from a list while genome_id gets expanded from the config dictionary. Also change the hard-coded NCTC8325 in the shell: directive of align_to_genome so that the genome_id is inserted directly from the config, like this:

shell:
    """
    bowtie2 -x results/bowtie2/{config[genome_id]} -U {input[0]} > {output} 2>{log}
    """

The final rule should look like this:

snakefile_mrsa.smk
rule align_to_genome:
    """
    Align a fastq file to a genome index using Bowtie 2.
    """
    output:
        temp("results/bam/{sample_id,\\w+}.bam")
    input:
        "data/{sample_id}.fastq.gz",
        expand("results/bowtie2/{genome_id}.{substr}.bt2",
           genome_id = config["genome_id"],
           substr = ["1", "2", "3", "4", "rev.1", "rev.2"])
    log:
        "results/logs/align_to_genome/{sample_id}.log"
    shell:
        """
        bowtie2 -x results/bowtie2/{config[genome_id]} -U {input[0]} > {output} 2>{log}
        """

Now let’s change the hard-coded genome id in the generate_count_table input in a similar manner. The final rule should look like this:

snakefile_mrsa.smk
rule generate_count_table:
    """
    Generate a count table using featureCounts.
    """
    output:
        "results/tables/counts.tsv",
        "results/tables/counts.tsv.summary"
    input:
        bams=expand("results/bam/{sample_id}.sorted.bam",
            sample_id = config["samples"].keys()),
        annotation=expand("data/ref/{genome_id}.gff3.gz",
            genome_id = config["genome_id"])
    log:
        "results/logs/generate_count_table.log"
    shell:
        """
        featureCounts -t gene -g gene_id -a {input.annotation} -o {output[0]} {input.bams} 2>{log}
        """

To make sure that the workflow works as expected, run it with:

snakemake -s snakefile_mrsa.smk --configfile config.yml -p -c 1

Try changing the genome_id in the config file to ST398 and rerun the workflow. You should now see that the workflow downloads the genome files for ST398 and aligns the samples to that genome instead.

In general, we want the rules as far downstream as possible in the workflow to be the ones that determine what the wildcards should resolve to. In our case align_to_genome and generate_count_table. You can think of it like the rule that really “needs” the file asks for it, and then it’s up to Snakemake to determine how it can use all the available rules to generate it. Here the align_to_genome rule says “I need this genome index to align my sample to” and then it’s up to Snakemake to determine how to download and build the index.

Summary

Well done! You now have a complete Snakemake workflow with a number of excellent features:

  • A general RNA-seq pipeline which can easily be reused between projects, thanks to clear separation between code and settings.
  • Great traceability due to logs and summary tables.
  • Clearly defined the environment for the workflow using Conda.
  • The workflow is neat and free from temporary files due to using temp() and shadow.
  • A logical directory structure which makes it easy to separate data and results of different software packages.
  • A project set up in a way that makes it very easy to distribute and reproduce either via Git, Snakemake’s --archive option or a Docker image.

11 Extra material

If you want to read more about Snakemake in general you can find several resources here:

11.1 Reading samples from a file instead of hard-coding them

So far we’ve specified the samples to use in the workflow either as a hard-coded list in the Snakefile, or as a list in the configuration file. This is of course impractical for large real-world examples. Here we’ll just quickly show how you could supply the samples to the MRSA workflow via a tab-separated file. For example you could create a file called samples.csv with the following content:

samples.csv
SRR935090,https://figshare.scilifelab.se/ndownloader/files/39539767
SRR935091,https://figshare.scilifelab.se/ndownloader/files/39539770
SRR935092,https://figshare.scilifelab.se/ndownloader/files/39539773

The first column has the sample id and the second column has the url to the fastq file. Now in order to read this into the workflow we need to use a few lines of python code. Since you can mix python code with rule definitions in Snakemake we’ll just add the following lines to the top of the Snakefile:

snakefile_mrsa.smk
# define an empty 'samples' dictionary
samples = {}
# read the sample list file and populate the dictionary
with open("samples.csv", "r") as fhin:
    for line in fhin:
        # strip the newline character from the end of the line
        # then split by tab character to get the sample id and url
        sample_id, url = line.strip().split(",")
        # store the url in the dictionary with the sample id as key
        samples[sample_id] = url

Now we can use the samples dictionary in the workflow. For example, to get the url for SRR935090 we can use samples["SRR935090"].

For example, the get_sample_url function can now be written as:

snakefile_mrsa.smk
def get_sample_url(wildcards):
    return samples[wildcards.sample_id]

We can also use the samples dictionary in expand(), for example in the multiqc rule:

    input:
        expand("results/fastqc/{sample_id}_fastqc.zip", 
            sample_id = samples.keys())

Now this depends on there being a samples.csv file in the working directory. To make this a configurable parameter we can add it to the config file:

sample_list: "samples.csv"

and update the code for populating the samples dictionary:

# define an empty 'samples' dictionary
samples = {}
# read the sample list file and populate the dictionary
with open(config["sample_list"], "r") as fhin:
    for line in fhin:
        # strip the newline character from the end of the line
        # then split by tab character to get the sample id and url
        sample_id, url = line.strip().split(",")
        # store the url in the dictionary with the sample id as key
        samples[sample_id] = url

This way, anyone can take our Snakefile and just update the path to their own sample_list using the config file.

11.2 Using containers in Snakemake

Snakemake also supports defining an Apptainer or Docker container for each rule (you will have time to work on the Containers tutorial later during the course). Analogous to using a rule-specific Conda environment, specify container: "docker://some-account/rule-specific-image" in the rule definition. Instead of a link to a container image, it is also possible to provide the path to an image file somwhere on your filesystem. When executing Snakemake, add the --software-deployment-method apptainer (or the shorthand --sdm apptainer) flag to the command line. For the given rule, an Apptainer container will then be created from the image or file that is provided in the rule definition on the fly by Snakemake and the rule will be run in this container.

You can find pre-made Apptainer or Docker images for many tools on https://biocontainers.pro/ (bioinformatics-specific) or on https://hub.docker.com/.

Here is an example for a rule and its execution:

rule align_to_genome:
    output:
        temp("results/bam/{sample_id,\\w+}.bam")
    input:
        fastq = "data/{sample_id}.fastq.gz",
        index = expand("results/bowtie2/{genome_id}.{substr}.bt2",
            genome_id=config["genome_id"],
            substr=["1", "2", "3", "4", "rev.1", "rev.2"])
    log:
        expand("results/logs/align_to_genome/{{sample_id}}_{genome_id}.log",
            genome_id = config["genome_id"])
    container: "docker://quay.io/biocontainers/bowtie2:2.5.0--py310h8d7afc0_0"
    shell:
        """
        bowtie2 -x results/bowtie2/{config[genome_id]} -U {input.fastq} > {output} 2>{log}
        """

Start your Snakemake workflow with the following command:

snakemake --configfile config.yml  -s snakefile_mrsa.smk -c 1 --software-deployment-method apptainer

Feel free to modify the MRSA workflow according to this example. As Apptainer is a container software that was developed for HPC clusters with only limited support on e.g. Macs, it might not work to run your updated Snakemake workflow with Apptainer locally on your computer. In the next section we explain how you can run Snakemake workflows on UPPMAX where Apptainer is pre-installed.

11.3 Running Snakemake workflows on HPC clusters

If you need to run a Snakemake workflow on a high-performance computing (HPC) cluster you have a wide range of options at your disposal. Via the plugin catalogue you can find plugins that will add support for various HPC schedulers to Snakemake.

Here we will focus on how to run Snakemake workflows on clusters with SLURM, a workload manager commonly used on HPC clusters in Sweden such as Rackham, Tetralith and Dardel.

Tip

When running on remote clusters we highly recommend to use a session manager like tmux or screen so that you can run your workflow in a session in the background while doing other things on the cluster or even logging out of the cluster.

11.3.1 Option 1: Run the entire workflow as a single job

For short workflows with only a few rules that need the same compute resources in terms of CPU (cores) and memory, you can submit the entire workflow as a job directly to the SLURM scheduler, or start an interactive job (in your tmux or screen session), and run your Snakemake workflow as you would on your local machine. Make sure to give your job enough time to finish running all rules of your Snakemake workflow.

If you choose this option, you don’t need to install anything from the plugin catalogue. However, your workflow may not run as efficiently as it could if you were to add SLURM support in Snakemake.

11.3.2 Option 2: Use built-in SLURM support

For workflows with long run times and/or where each rule requires different compute resources, Snakemake comes with built in functionality for interacting with the SLURM workload manager and send each rule as a job to the SLURM queue and to track the status of each job.

In this case, you can start the workflow on the login node and let it run there until all jobs have finished. Given that workflows often consist of many rules, some of which may be highly resource demanding, this is the option we recommend when running most Snakemake workflows on HPC clusters.

To add SLURM support to Snakemake you first need to install the SLURM plugin from the plugin catalogue. This can be done with conda:

conda install -c conda-forge snakemake-executor-plugin-slurm

Once installed, adding the --executor slurm flag to your Snakemake command line call will enable the plugin. You also need to specify how many jobs Snakemake can submit to the SLURM queue at the same time with the -j flag. For example, to allow up to 100 jobs to be put into the queue at any given time, you would run Snakemake with the following command:

snakemake --executor slurm -j 100 <other flags>

11.4 Specifying resources for SLURM

Depending on the cluster you are using, you will need to specify some resource requirements for the rules in your workflow, such as the number of CPUs, memory, runtime and account id. This can be done either:

  1. directly on the command line with the --default-resources flag which sets default resource settings for all rules
  2. in the rule definition of your workflow using the resources: directive, or
  3. in a configuration profile, a folder with a config.yaml file that contains the resource settings.

You can also use a combination of these methods. For example, the SLURM account id (_e.g. naiss-2023-01-001), which will most likely be the same for all rules, can be set with --default-resources:

snakemake --executor slurm -j 100 --default-resources slurm_account=naiss-2023-01-001

Rule-specific resources such as runtime, memory and number of CPUs can be set in the rule definition, for example:

rule testrule:
    output:
        "results/output.txt"
    resources:
        runtime = 60,
        mem_mb = 16000,
        tasks = 4
    shell:
        """
        uname -a > {output}
        """

This rule uses the standard resource runtime to set the maximum allowed time (in minutes) for the rule, sets the memory requirement with mem_mb and the number of requested CPUs with tasks. In this example the rule will have a time limit of 60 minutes, will require 16G of RAM and 4 CPUs.

Some clusters also require you to specify the partition you want to run your job on. The partition name will differ between clusters, for example the Rackham cluster uses core and node partitions, while Dardel uses e.g. shared and main. See the documentation for the cluster you are using for more information.

The partition can be set with the slurm_partition resource, for example like so:

rule testrule:
    output:
        "results/output.txt"
    resources:
        runtime = 60,
        mem_mb = 16000,
        tasks = 4,
        slurm_partition: "shared"
    shell:
        """
        uname -a > {output}
        """

To make it easy to adapt your workflow to different compute clusters it is recommended to define resource settings in a configuration profile. A configuration profile is a folder with a config.yaml file that contains values for Snakemake command line arguments, allowing you to modify the behavior of Snakemake without changing the workflow code. For example, you could create a dardel folder (e.g. in the root of your workflow) with a config.yaml file that contains the following:

dardel/config.yaml
executor: "slurm"
jobs: 100
default-resources:
  slurm_account: "naiss-2023-01-001"
  slurm_partition: "shared"
  mem_mb: 16000
  tasks: 1
  runtime: 60

This yaml-formatted file contains Snakemake command line arguments that will be used when running the workflow. You can then run Snakemake with the --profile flag pointing to the folder containing the config.yaml file:

snakemake --profile dardel

This greatly simplifies running the workflow on different clusters, and makes the command line call much more succinct.

To set rule-specific resources in the configuration profile, you can add a set_resources: section to the config.yaml file:

dardel/config.yaml
executor: "slurm"
jobs: 100
default-resources:
  slurm_account: "naiss-2023-01-001"
  slurm_partition: "shared"
  mem_mb: 16000
  tasks: 4
  runtime: 60
set_resources:
  index_genome:
    runtime: 240
    mem_mb: 32000
    tasks: 8
  align_to_genome:
    runtime: 120
    mem_mb: 24000
    tasks: 6

In this example, the index_genome rule will have a runtime of 240 minutes, will require 32G of RAM and 8 CPUs, while the align_to_genome rule will have a runtime of 120 minutes, will require 24G of RAM and 6 CPUs. Both rules will use the slurm_account and slurm_partition settings from the default_resources section, unless overridden in the rule-specific settings.

You can still define resources in the rule definition, but the values in the configuration profile will take precedence.

Now, when you run your Snakemake workflow with:

snakemake --profile dardel

Snakemake will submit each job to the SLURM queue and inform you about both the local jobid and the SLURM jobid by writing something similar to this to your terminal:

Job 0 has been submitted with SLURM jobid 37099380 (log: .snakemake/slurm_logs/rule_name/37099380.log).

In this example the log output from the job will be in .snakemake/slurm_logs/rule_name/37099380.log.

You can read more details about running Snakemake on compute clusters in the Snakemake docs.