1 Introduction
A workflow management system (WfMS) is a piece of software that sets up, performs and monitors a defined sequence of computational tasks (i.e. “a workflow”). Snakemake is a WfMS that was developed in the bioinformatics community, and as such it has a number of features that make it particularly well-suited for creating reproducible and scalable data analyses.
First of all the language you use to formulate your workflows is based on Python, which is a language with strong standing in academia. However, users are not required to know how to code in Python to work efficiently with Snakemake. Workflows can easily be scaled from your desktop to server, cluster, grid or cloud environments. This makes it possible to develop a workflow on your laptop, maybe using only a small subset of your data, and then run the real analysis on a cluster. Snakemake also has several features for defining the environment with which each task is carried out. This is important in bioinformatics, where workflows often involve running a large number of small third-party tools.
Snakemake is primarily intended to work on files (rather than for example streams, reading/writing from databases or passing variables in memory). This fits well with many fields of bioinformatics, notably next-generation sequencing, that often involve computationally expensive operations on large files. It’s also a good fit for a scientific research setting, where the exact specifications of the final workflow aren’t always known at the beginning of a project.
Lastly, a WfMS is a very important tool for making your analyses reproducible. By keeping track of when each file was generated, and by which operation, it is possible to ensure that there is a consistent “paper trail” from raw data to final results. Snakemake also has features that allow you to package and distribute the workflow, and any files it involves, once it’s done.
This tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to set it up if you haven’t done so already, then open up a terminal and go to workshop-reproducible-research/tutorials/snakemake
and activate your snakemake-env
Conda environment.
2 The basics
In this part of the tutorial we will create a very simple workflow from scratch, in order to show the fundamentals of how Snakemake works. The workflow will take two files as inputs, a.txt
and b.txt
, and the purpose is to convert the text in the files to upper case and then to concatenate them.
Run the following shell commands. The first one will make an empty file named Snakefile
, which will later contain the workflow. The second and third commands generate two files containing some arbitrary text.
touch Snakefile
echo "This is a.txt" > a.txt
echo "This is b.txt" > b.txt
Then open Snakefile
in your favourite text editor. A Snakemake workflow is based on rules which take some file(s) as input, performs some type of operation on them, and generate some file(s) as outputs. Here is a very simple rule that produces a.upper.txt
as an output, using a.txt
as input. Copy this rule to your Snakefile
and save it.
rule convert_to_upper_case:
output:"a.upper.txt"
input:
"a.txt"
shell:"""
tr [a-z] [A-Z] < {input} > {output}
"""
Indentation is important in Snakefiles, so make sure that you have the correct number of spaces before input
/output
/shell
and their respective subsections. The number of spaces per level doesn’t matter as long as you’re consistent. Here we use four, but you could just as well use two for a more compact look. Don’t use tabs (unless your editor automatically converts them to spaces).
Rules can be given names, here it’s convert_to_upper_case
. While rule names are not strictly necessary we encourage you to use them and to make an effort to name your rules in a way that makes it easy to understand the purpose of the rule, as rule names are one of the main ways to interact with the workflow. The shell
section (or directive) contains the shell commands that will convert the text in the input file to upper case and send it to the output file. In the shell command string, we can refer to elements of the rule via curly brackets. Here, we refer to the output file by specifying {output}
and to the input file by specifying {input}
. If you’re not very familiar with Bash, this particular command can be read like “send the contents of a.txt
to the program tr
, which will convert all characters in the set [a-z]
to the corresponding character in the set [A-Z]
, and then send the output to a.upper.txt
”.
Now let’s run our first Snakemake workflow. When a workflow is executed Snakemake tries to generate a set of target files. Target files can be specified via the command line or, as you will see later, in several other ways. But importantly you will always have to specify what you want Snakemake to generate, in one way or another. The basic Snakemake command line looks like this:
snakemake <arguments> [target files ...]
where <arguments>
are flags that modify the behaviour of Snakemake and [target files ...]
are the files that you want Snakemake to generate.
Let’s ask Snakemake to generate the file a.upper.txt
, this is our target file. For the arguments part we’ll use the -s
flag to specify the file containing our rules and the -c
flag to set the number of cores to use. It’s good practice to first run with the flag -n
(or --dry-run
), which will show what Snakemake plans to do without actually running anything. You can also use the flag -p
, for showing the shell commands that it will execute. snakemake --help
will show you all available flags.
The command thus becomes:
snakemake -s Snakefile -c 1 -n -p a.upper.txt
You should see something like this:
Building DAG of jobs...
Job stats:
job count
--------------------- -------
convert_to_upper_case 1
total 1
[Wed Nov 13 21:59:01 2024]
rule convert_to_upper_case:
input: a.txt
output: a.upper.txt
jobid: 0
reason: Missing output files: a.upper.txt
resources: tmpdir=<TBD>
tr [a-z] [A-Z] < a.txt > a.upper.txt
Job stats:
job count
--------------------- -------
convert_to_upper_case 1
total 1
Reasons:
(check individual jobs above for details)
output files have to be generated:
convert_to_upper_case
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
You can see that Snakemake plans to run one job: the rule convert_to_upper_case
with a.txt
as input and a.upper.txt
as output. The reason for doing this is that it’s missing the file a.upper.txt
. Now run the command without the -n
flag and check that the contents of a.upper.txt
is as expected.
We used the -s
flag to point to the file containing the rules. In fact, Snakemake will look for a file named Snakefile
in the current directory (or in a subdirectory called workflow
) by default, which means we can omit this flag for this example. If we also omit the -c
flag Snakemake will use all available cores, which won’t make a difference for this small workflow. The command can thus be simplified to:
snakemake -p a.upper.txt
Try running this simplified command. What do you see? It turns out that Snakemake only reruns jobs if there have been changes to either the input files, or the workflow itself. Here neither the a.txt
input, nor the Snakefile
have been changed so Snakemake determines that everything in the workflow is up to date and there is nothing to do.
What if we ask Snakemake to generate the file b.upper.txt
? Run the following:
snakemake -n -p b.upper.txt
You will see that Snakemake complains that there is no rule to produce b.upper.txt
. This is because we have only defined a rule for a.upper.txt
. We could copy the rule to make a similar one for b.txt
, but that would be a bit cumbersome. Here is where named wildcards come in; one of the most powerful features of Snakemake. Simply change the input and output directives from
output:"a.upper.txt"
input:
"a.txt"
to
output:"{some_name}.upper.txt"
input:
"{some_name}.txt"
Now try asking for b.upper.txt
again.
Tada! What happens here is that Snakemake looks at all the rules it has available (actually only one in this case) and tries to assign values to all wildcards so that the targeted files can be generated. In this case you can see that it says wildcards: some_name=b
, which means that the some_name
wildcard was assigned the value b
. This was quite simple but for large workflows and multiple wildcards it can get much more complex. Named wildcards is what enables a workflow (or single rules) to be efficiently generalized and reused between projects or shared between people.
Remember that we’re trying to make a workflow that will convert two files to uppercase and concatenate them. It seems we have the first part of our workflow working, now it’s time to make the second rule for concatenating the outputs from convert_to_upper_case
.
The syntax for concatenating two files in Bash is :
cat first_file.txt second_file.txt > output_file.txt`
The structure of this second rule will be similar to the first but here we have two inputs instead of one. This can be expressed in two ways, either with named inputs like this:
input:
="...",
firstFile="..."
secondFile
shell:"""
some_function {input.firstFile} {input.secondFile}
"""
Or with indexes like this:
input:
"...",
"..."
shell:"""
some_function {input[0]} {input[1]}
"""
If you have multiple inputs or outputs they need to be delimited with a comma (as seen above). This is a very common mistake when writing Snakemake workflows. The parser will complain, but sometimes the error message can be difficult to interpret.
Now try to construct this second rule yourself. Add it to the Snakefile
and name the rule concatenate_a_and_b
. Use a.upper.txt
and b.upper.txt
as input and call the output c.txt
. Check the solution below if you get stuck.
rule concatenate_a_and_b:
output:"c.txt"
input:
"a.upper.txt",
"b.upper.txt"
shell:"""
cat {input[0]} {input[1]} > {output}
"""
Now try to generate the file c.txt
with your workflow by running:
snakemake -p c.txt
Validate that the output looks as expected.
Good job! You’ve now created a simple Snakemake workflow that converts two files to uppercase and concatenates them.
Wouldn’t it be nice if our workflow could be used for any files, not just a.txt
and b.txt
? We can achieve this by generalizing the concatenate_a_and_b
rule using named wildcards.
As we’ve mentioned, Snakemake looks at all the rules it has available and tries to assign values to all wildcards so that the targeted files can be generated. We therefore have to name the output file in a way so that it contains information about which input files it should be based on. Try to figure out how to modify the concatenate_a_and_b
rule yourself. If you’re stuck you can look at the solution below, but spend some time on it before you look. Also rename the concatenate_a_and_b
rule to concatenate_files
to reflect its new more general use.
rule concatenate_files:
output:"{first}_{second}.txt"
input:
"{first}.upper.txt",
"{second}.upper.txt"
shell:"""
cat {input[0]} {input[1]} > {output}
"""
We can now control which input files to use by the name of the file we ask Snakemake to generate. Run the workflow to create a file a_b.txt
:
snakemake -p a_b.txt
You will see that Snakemake assigned the values a
and b
to the wildcards {first}
and {second}
, respectively. This is because the output of the concatenate_files
rule is defined as "{first}_{second}.txt"
, which means that when we ask for a_b.txt
, the {first}
wildcard is assigned a
and the {second}
wildcard is assigned b
. Try to generate the file b_a.txt
and see what happens.
You can name a file whatever you want in a Snakemake workflow, but you will find that everything falls into place much nicer if the filename reflects the file’s path through the workflow, e.g. sample_a.trimmed.deduplicated.sorted.bam
.
The input to Snakemake rules have to be strings or lists of strings, however you don’t have to specify these strings directly in the input:
section of rules. Instead, you can specify Python functions that return strings or lists of strings. This allows you to supply input to rules that can vary depending on the wildcards being used. We’ll get to why that’s useful in a sec, but first let’s put it to use for the conatenate_files
rule. Because Snakemake is based on Python we can mix rule definitions with standard python code in the same file. Add a function just above the concatenate_files
that looks like this:
def concat_input(wildcards):
= [wildcards.first + ".upper.txt", wildcards.second + ".upper.txt"]
files return files
This is the syntax to define a function in Python. The def concat_input(wildcards):
line shows the name of the function (concat_input
) and the variable passed to the function (the wildcards
object). In the second line we add two items to a list that we call files
and add the ‘.upper.txt’ suffix to each item. Finally, the function returns the list. Because the concatenate_files
rule has two wildcards {first}
and {second}
we can access the actual strings in the wildcards
object using wildcards.first
and wildcards.second
. When we ask for the file a_b.txt
then wildcards.first == 'a'
and wildcards.second == 'b'
. This means that the files
list returned by the function will be ['a.upper.txt', 'b.upper.txt']
. To see for yourself you can add the following line to the function, just before the return statement: print (wildcards.first, wildcards.second, files)
. This way the wildcard values and the list will be printed to the terminal when you run Snakemake.
Now that we’ve defined the function to use as input, we can use it in the concatenate_files
rule. Update the rule so that it looks like this:
rule concatenate_files:
output:"{first}_{second}.txt"
input:
concat_input
shell:"""
cat {input[0]} {input[1]} > {output}
"""
You see that the name of the function concat_input
is added in place of the input strings. When using the wildcards
object in input functions like this we have to call the function without any arguments (simply concat_input
) and the function has to be defined to accept a single argument (here def concat_input(wildcards):
). Let’s run the workflow with the updated rule. Remove the file a_b.txt
or add -f
to the Snakemake command to force a re-run:
snakemake -f a_b.txt
If you added the print
statement to the function you should see the following printed to your terminal:
Building DAG of jobs...
a b ['a.upper.txt', 'b.upper.txt']
Followed by the rest of the workflow output.
There are a number of possible use-cases for input functions. For example, say that you have an experiment where you’ve sequenced three samples: sample1
, sample2
and sample3
with the corresponding FASTQ files under data/
and you want to write a rule that outputs the statistics of all sequences within each sample. However, samples sample1
and sample2
have been sequenced with single-end technology while sample3
have paired-end reads. The single-end samples will have only one FASTQ file whereas the paired-end sample will have two (one for each sequenced end). Thus, depending on the name of the sample the input to the function will either be one file or two. With input functions we can write a generalized rule that can handle both types:
def fastq_input(wildcards):
if wildcards.sample_id in ["sample1", "sample2"]:
return "data/" + wildcards.sample_id + ".fastq.gz"
else:
return ["data/" + wildcards.sample_id + ".R1.fastq.gz",
"data/" + wildcards.sample_id + ".R2.fastq.gz"]
rule fastq_stats:
output:"{sample_id}.stats.txt"
input:
fastq_input
shell:"""
seqtk comp {input} > {output}
"""
As you can see, the fastq_stats
rule outputs one file {sample_id}.stats.txt
and takes as input the value returned from the fastq_input
function. In this function the sample id is evaluated and if it is either sample1
or sample2
(our single-end samples) then the function returns a single string which is the path to the FASTQ file for that sample. Otherwise, the function returns a list containing both the R1
and R2
files for the sample. In the shell:
directive of the rule the seqtk comp
command is run on the input and the output is sent to the output file.
In this section we’ve learned:
- How a simple Snakemake rule looks.
- How to define target files when executing a workflow.
- How to use named wildcards for writing generic and flexible rules.
- How to use input functions in rules
3 Visualising workflows
All that we’ve done so far could quite easily be done in a simple shell script that takes the input files as parameters. Let’s now take a look at some of the features where a WfMS like Snakemake really adds value compared to a more straightforward approach. One such feature is the possibility to visualize your workflow. Snakemake can generate three types of graphs, one that shows how the rules are connected, one that shows how the jobs (i.e. an execution of a rule with some given inputs/outputs/settings) are connected, and finally one that shows rules with their respective input/output files.
First we look at the rule graph. The following command will generate a rule graph in the dot language and pipe it to the program dot
, which in turn will save a visualization of the graph as a PNG file (if you’re having troubles displaying PNG files you could use SVG or JPG instead).
If you added the print(wildcards.first,wildcards.second,files)
statement to the concat_input
function in the previous section you need to remove that line before running the commands below.
snakemake --rulegraph a_b.txt | dot -Tpng > rulegraph.png
This looks simple enough, the output from the rule convert_to_upper_case
will be used as input to the rule concatenate_files
.
For a more typical bioinformatics project it can look something like this when you include all the rules from processing of the raw data to generating figures for the paper.
While saying that it’s easy to read might be a bit of a stretch, it definitely gives you a better overview of the project than you would have without a WfMS.
The second type of graph is based on the jobs, and looks like this for our little workflow (use --dag
instead of --rulegraph
).
snakemake --dag a_b.txt | dot -Tpng > jobgraph.png
The main difference here is that now each node is a job instead of a rule. You can see that the wildcards used in each job are also displayed. Another difference is the dotted lines around the nodes. A dotted line is Snakemake’s way of indicating that this rule doesn’t need to be rerun in order to generate a_b.txt
. Validate this by running snakemake -n a_b.txt
and it should say that there is nothing to be done.
We’ve discussed before that one of the main purposes of using a WfMS is that it automatically makes sure that everything is up to date. This is done by recursively checking that outputs are always newer than inputs for all the rules involved in the generation of your target files. Now try to change the contents of a.txt
to some other text and save it. What do you think will happen if you run snakemake -n a_b.txt
again?
Building DAG of jobs...
Job stats:
job count
--------------------- -------
concatenate_files 1
convert_to_upper_case 1
total 2
[Fri Nov 15 10:27:22 2024]
rule convert_to_upper_case:
input: a.txt
output: a.upper.txt
jobid: 1
reason: Updated input files: a.txt
wildcards: some_name=a
resources: tmpdir=<TBD>
[Fri Nov 15 10:27:22 2024]
rule concatenate_files:
input: a.upper.txt, b.upper.txt
output: a_b.txt
jobid: 0
reason: Input files updated by another job: a.upper.txt
wildcards: first=a, second=b
resources: tmpdir=<TBD>
Job stats:
job count
--------------------- -------
concatenate_files 1
convert_to_upper_case 1
total 2
Reasons:
(check individual jobs above for details)
input files updated by another job:
concatenate_files
updated input files:
convert_to_upper_case
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
Were you correct? Also generate the job graph and compare to the one generated above. What’s the difference? Now rerun without -n
and validate that a_b.txt
contains the new text. Note that Snakemake doesn’t look at the contents of files when trying to determine what has changed, only at the timestamp for when they were last modified.
We’ve seen that Snakemake keeps track of if files in the workflow have changed, and automatically makes sure that any results depending on such files are regenerated. What about if the rules themselves are changed? Since version 7.8.0 Snakemake keeps track of this automatically.
Modify the rule concatenate_files
to also include which files were concatenated.
rule concatenate_files:
output:"{first}_{second}.txt"
input:
concat_input
shell:"""
echo 'Concatenating {input}' | cat - {input[0]} {input[1]} > {output}
"""
It’s not really important for the tutorial, but the shell command used here first outputs “Concatenating” followed by a space delimited list of the files in input
. This string is then sent to the program cat
where it’s concatenated with input[0]
and input[1]
(the parameter -
means that it should read from standard input). Lastly, the output from cat
is sent to {output}
.
If you now run:
snakemake -n a_b.txt
you should see:
rule concatenate_files:
input: a.upper.txt, b.upper.txt
output: a_b.txt
jobid: 0
reason: Code has changed since last execution
wildcards: first=a, second=b
Although no files involved in the workflow have been changed, Snakemake recognizes that the workflow code itself has been modified and this triggers a re-run.
Snakemake is aware of changes to five categories of such “rerun-triggers”:
- “input” (changes to rule input files),
- “params” (changes to the rule
params
section) - “software-env” (changes to Conda environment files specified by the
conda:
directive) - “code” (changes to code in the
shell:
,run:
,script:
andnotebook:
directives) - “mtime” (changes to the modification time of input files)
Using the flag --rerun-triggers
you can specify which of these categories should trigger a re-run. For example, --rerun-triggers input
will only re-run the workflow if the specified input files have changed and --rerun-triggers input params
will re-run if either the input files or the params
section of the rule has changed.
Prior to version 7.8.0, only changes to the modification time of input files would trigger automatic re-runs. To run Snakemake with this previous behaviour you can use the setting --rerun-triggers mtime
at the command line. Try running: snakemake --rerun-triggers mtime -n a_b.txt
and you should again see Nothing to be done (all requested files are present and up to date).
You can also export information on how all files were generated (when, by which rule, which version of the rule, and by which commands) to a tab-delimited file like this:
snakemake a_b.txt -D > summary.tsv
The content of summary.tsv
is shown in the table below:
output_file | date | rule | log-file(s) | input-file(s) | shellcmd | status | plan |
---|---|---|---|---|---|---|---|
a_b.txt | Fri Nov 15 10:32:04 2024 | concatenate_files | a.upper.txt,b.upper.txt | cat a.upper.txt b.upper.txt > a_b.txt | rule implementation changed | update pending | |
a.upper.txt | Fri Nov 15 10:32:04 2024 | convert_to_upper_case | a.txt | tr [a-z] [A-Z] < a.txt > a.upper.txt | ok | no update | |
b.upper.txt | Thu Nov 14 00:04:06 2024 | convert_to_upper_case | b.txt | tr [a-z] [A-Z] < b.txt > b.upper.txt | ok | no update |
You can see in the second last column that the rule implementation for a_b.txt
has changed. The last column shows if Snakemake plans to regenerate the files when it’s next executed. You can see that for the concatenate_files
the plan is update pending
because we generated the summary with the default behaviour of using all rerun-triggers.
You might wonder where Snakemake keeps track of all these things? It stores all information in a hidden subdirectory called .snakemake
. This is convenient since it’s easy to delete if you don’t need it anymore and everything is contained in the project directory. Just be sure to add it to .gitignore
so that you don’t end up tracking it with git.
Let’s take a look at the final type of graph that Snakemake can generate, this is done with the --filegraph
flag and will show the rules and their input and output files.
snakemake --filegraph a_b.txt | dot -Tpng > filegraph.png
In terms of details/complexity the filegraph is an intermediate between the rulegraph (--rulegraph
) and the jobgraph (--dag
). It shows the rules to run with their respective input and output files (shown with wildcards), but doesn’t show the jobs that will be executed.
By now you should be familiar with the basic functionality of Snakemake, and you can build advanced workflows with only the features we have discussed here. There’s a lot we haven’t covered though, in particular when it comes to making your workflow more reusable. In the following section we will start with a workflow that is fully functional but not very flexible. We will then gradually improve it, and at the same time showcase some Snakemake features we haven’t discussed yet. Note that this can get a little complex at times, so if you felt that this section was a struggle then you could move on to one of the other tutorials instead.
In this section we’ve learned:
- How to use
--rulegraph
,--dag
and--filegraph
for visualizing the workflow. - How Snakemake reruns relevant parts of the workflow after there have been changes.
- How Snakemake tracks changes to files and code in a workflow
4 The MRSA workflow
As you might remember from the intro, we are attempting to understand how lytic bacteriophages can be used as a future therapy for the multi-resistant bacteria MRSA (methicillin-resistant Staphylococcus aureus). In order to do this we have performed RNA-seq of three strains, one test and two controls. We have already set up a draft Snakemake workflow for the RNA-seq analysis and it seems to be running nicely. The rest of the Snakemake tutorial will be spent improving and making this workflow more flexible!
This section will leave a little more up to you compared to the previous one. If you get stuck at some point the final workflow after all the modifications is available in tutorials/containers/Snakefile
.
You are probably already in your snakemake-env
environment, otherwise activate it (use conda info --envs
if you are unsure).
Let’s start by generating the rule graph so that we get an overview of the workflow. Here we have to specify the file with the rules using the -s
flag to Snakemake since the path to the file differs from the default.
snakemake -s snakefile_mrsa.smk --rulegraph | dot -T png > rulegraph_mrsa.png
There’s another difference in this command compared to the one we’ve used before, namely that we don’t define a target. In the toy example we used a_b.txt
as a target, and the wildcards were resolved based on that. How come that we don’t need to do that here? This is because by default Snakemake uses the first rule in a workflow as its target, unless a target is specified on the command line. By convention, this rule is called all
and serves as a rule for aggregating the main outputs of the workflow.
As you can see from the snakefile_mrsa.smk
and the rulegraph, the input to the all
rule is the output from the rules multiqc
and generate_count_table
. These rules in turn have their own inputs from other rules and by targeting the all
rule Snakemake determines what needs to be run by resolving the graph from bottom to top.
Now take some time and look through the workflow file and try to understand how the rules fit together. Use the rule graph as aid. The rules represent a quite standard, although somewhat simplified, workflow for RNA-seq analysis. If you are unfamiliar with the purpose of the different operations (index genome, FastQC and so on), then take a look at the intro.
Also generate the job graph in the same manner. Here you can see that three samples will be downloaded: SRR935090, SRR935091, and SRR935092. The original sample files contain tens of millions of reads but for the purpose of this course we have sub-sampled them to 100,000 reads per sample, so that they are easy to manage, and made them available at the SciLifeLab Data Repository. These FASTQ files will then be quality controlled with FastQC and aligned to a genome. The QC output will be aggregated with MultiQC and the alignments will be used to generate a count table, i.e. a table that shows how many reads map to each gene for each sample. This count table is then what the downstream analysis will be based on.
Now let’s run the whole workflow. As before, specify -p
to have Snakemake print the shell commands for rules and use -c
to control how many cores to run with. If you don’t specify number of cores Snakemake will use all available cores on your machine. Let’s run the workflow with one core:
snakemake -s snakefile_mrsa.smk -c 1 -p
After everything is done, the workflow will have generated fastq files for each sample under data/
and reference genome files under data/ref
. The results/
directory will contain the output from the different steps in the workflow, divided into subdirectories. Take some time to look through the structure.
In this section we’ve learned:
- How the MRSA workflow looks.
- How to run the MRSA workflow.
- Which output files the MRSA workflow produces.
5 Parameters and configuration
In a typical bioinformatics project, considerable efforts are spent on tweaking parameters for the various programs involved. It would be inconvenient if you had to change in the shell scripts themselves every time you wanted to run with a new setting. Luckily, there is a better option for this: the params
keyword.
rule some_rule:
output:"results/some_output.txt"
input:
"some_input.txt"
params:=2.5
cutoff
shell:"""
some_program --cutoff {params.cutoff} {input} {output}
"""
In the toy example above, the params directive has one keyword called cutoff
which is set to 2.5
and this value can be accessed in the shell command with {params.cutoff}
.
In the MRSA workflow most of the programs are run with default settings and don’t use the params:
directive. However, the get_SRA_by_accession
rule is an exception. Let’s take a look at this part of the workflow:
snakefile_mrsa.smk
def get_sample_url(wildcards):
= {
samples "SRR935090": "https://figshare.scilifelab.se/ndownloader/files/39539767",
"SRR935091": "https://figshare.scilifelab.se/ndownloader/files/39539770",
"SRR935092": "https://figshare.scilifelab.se/ndownloader/files/39539773"
}return samples[wildcards.sample_id]
rule get_SRA_by_accession:"""
Retrieve a single-read FASTQ file
"""
output:"data/{sample_id}.fastq.gz"
params:= get_sample_url
url
shell:"""
wget -q -O - {params.url} | seqtk sample - 25000 | gzip -c > {output[0]}
"""
Here params
has a keyword called url
which is set to a function called get_sample_url
defined just above the rule. The function returns a URL for the sample that the rule is supposed to download and this URL is then accessed in the shell:
directive with {params.url}
.
You may recognize this from The basics of this tutorial where we used input functions to generate strings and lists of strings for the input:
section of a rule. Using a function to return values based on the wildcards also works for params:
. Here sample_id
is a wildcard which in this specific workflow can be either SRR935090
, SRR935091
, or SRR935092
. The wildcards object is passed to the function get_sample_url
and depending on what output the rule is supposed to generate, wildcards.sample_id
will take the value of either of the three sample ids. The samples
variable defined in the function is a Python dictionary that has the URLs for each sample_id hard-coded. This dictionary is used to convert the value of the sample_id
wildcard to a URL, which is returned by the function. Finally, in the shell:
directive we access the url
parameter with {params.url}
. (We could have written three separate rules to download the samples, but it’s easy to see how that can become impractical.)
Let’s add another parameter to the get_SRA_by_accession
rule. As you can see in the shell command the FASTQ file downloaded by wget
gets piped directly (the -O -
part means send contents to STDOUT) to the seqtk sample
command which reads from STDIN and outputs 25000 randomly sampled reads (out of the 100,000 contained in the example FASTQ file). Change in the rule to use the parameter max_reads
instead and set the value to 20000. If you need help, click to show the solution below.
snakefile_mrsa.smk
rule get_SRA_by_accession:"""
Retrieve a single-read FASTQ file
"""
output:"data/{sample_id}.fastq.gz"
params:= get_sample_url,
url = 20000
max_reads
shell:"""
wget -q -O - {params.url} | seqtk sample - {params.max_reads} | gzip -c > {output[0]}
"""
Now run the workflow again with:
snakemake -s snakefile_mrsa.smk -p -c 1
Because there’s been changes to the get_SRA_by_accession
rule this will trigger a re-run of the rule for all three accessions. In addition all downstream rules that depend on output from get_SRA_by_accession
are re-run.
As you can see the parameter values we set in the params
section don’t have to be static, they can be any Python expression. In particular, Snakemake provides a global dictionary of configuration parameters called config
. Let’s modify get_SRA_by_accession
in order to make use of this dictionary:
snakefile_mrsa.smk
rule get_SRA_by_accession:"""
Retrieve a single-read FASTQ file
"""
output:"data/{sample_id}.fastq.gz"
params:= get_sample_url,
url = config["max_reads"]
max_reads
shell:"""
wget -q -O - {params.url} | seqtk sample - {params.max_reads} | gzip -c > {output[0]}
"""
Note that Snakemake now expects there to be a key named max_reads
in the config dictionary. If we don’t populate the dictionary somehow the dictionary will be empty so if you were to run the workflow now it would trigger a KeyError
(try running snakemake -s snakefile_mrsa.smk -n
to see for yourself). In order to populate the config dictionary with data for the workflow we could use the snakemake --config KEY=VALUE
syntax directly from the command line (e.g. snakemake --config max_reads=20000 -s snakefile_mrsa.smk
). However, from a reproducibility perspective, it’s not optimal to set parameters from the command line, since it’s difficult to keep track of which parameter values that were used.
A much better alternative is to use the --configfile FILE
option to supply a configuration file to Snakemake. In this file we can collect all the project-specific settings, sample ids and so on. This also enables us to write the Snakefile in a more general manner so that it can be better reused between projects. Like several other files used in these tutorials, this file should be in YAML format. Create the file below and save it as config.yml
(here we change back to using 25000 reads):
config.yml
max_reads: 25000
If we now run Snakemake with --configfile config.yml
, it will parse this file to form the config
dictionary. Try it out by running:
snakemake -s snakefile_mrsa.smk --configfile config.yml -p -c 1
Configuration settings that you specify with --config KEY=VALUE
on the command line will overwrite settings specified in a config file. This can be useful if you want to run the workflow with a different setting for a specific parameter for testing purposes.
Rather than supplying the config file from the command line you could also add the line configfile: "config.yml"
to the top of your Snakefile. Keep in mind that with such a setup Snakemake will complain if the file config.yml
is not present.
In this section we’ve learned:
- How to set parameter values with the
params
directive. - How to run Snakemake with the
config
variable and with a configuration file.
6 Logs
As you probably noticed it was difficult to follow how the workflow progressed since some rules printed a lot of output to the terminal. In some cases this also contained important information, such as statistics on the sequence alignments or genome indexing. This could be valuable for example if you later in the project get weird results and want to debug. It’s also important from a reproducibility perspective that the “paper trail” describing how the outputs were generated is saved. Luckily, Snakemake has a feature that can help with this. Just as we define input
and output
in a rule we can also define log
.
rule some_rule:
output:"results/some_output.txt"
input:
"some_input.txt"
log:"logs/some_log.txt"
shell:"""
some_program {input} {output} > {log}
"""
A log file is not different from any other output file, but it’s dealt with a little differently by Snakemake. For example, it’s shown in the file summary when using -D
and unlike other output files it’s not deleted if jobs fail which of course is necessary for debugging purposes. It’s also a good way to clarify the purpose of the file. We probably don’t need to save logs for all the rules, only the ones with interesting output:
multiqc
aggregates quality control data for all the samples into one html report, and the log contains information about which samples were aggregated.get_genome_fasta
andget_genome_gff3
would be good to log since they are dependent on downloading files from an external server.index_genome
outputs some statistics about the genome indexing.generate_count_table
outputs information about input files, settings and a summary of results.align_to_genome
outputs important statistics about the alignments. This is probably the most important log to save.
Now add a log file to some or all of the rules above. A good place to save them to would be results/logs/<rule_name>/
. In order to avoid that multiple jobs write to the same files Snakemake requires that all output and log files contain the same wildcards, so be sure to include any wildcards used in the rule in the log name as well, e.g. {some_wildcard}.log
.
You also have to specify in the shell
section of each rule what you want the log to contain. Some of the programs we use send their log information to standard out, some to standard error and some let us specify a log file via a flag.
For example, bowtie2
used in the align_to_genome
rule writes log info to standard error so we use 2>{log}
to redirect this to the log file:
snakefile_mrsa.smk
rule align_to_genome:"""
Align a fastq file to a genome index using Bowtie 2.
"""
output:"results/bam/{sample_id,\\w+}.bam"
input:
"data/{sample_id}.fastq.gz",
"results/bowtie2/NCTC8325.1.bt2",
"results/bowtie2/NCTC8325.2.bt2",
"results/bowtie2/NCTC8325.3.bt2",
"results/bowtie2/NCTC8325.4.bt2",
"results/bowtie2/NCTC8325.rev.1.bt2",
"results/bowtie2/NCTC8325.rev.2.bt2"
log:"results/logs/align_to_genome/{sample_id}.log"
shell:"""
bowtie2 -x results/bowtie2/NCTC8325 -U {input[0]} > {output} 2>{log}
"""
For programs used in the rules multiqc
, get_genome_fasta
, get_genome_gff3
, index_genome
and generate_count_table
you can use the table below if you need help with how to redirect output to the log file.
Rule | Program | Redirect flag |
---|---|---|
multiqc |
multiqc |
2>{log} |
get_genome_fasta /get_genome_gff3 |
wget |
-o {log} |
index_genome |
bowtie2-build |
>{log} |
generate_count_table |
featureCounts |
2>{log} |
If you need help, click to show the solution below for the rules.
snakefile_mrsa.smk
rule multiqc:"""
Aggregate all FastQC reports into a MultiQC report.
"""
output:= "results/multiqc/multiqc.html",
html = "results/multiqc/multiqc_general_stats.txt"
stats input:
"results/fastqc/SRR935090_fastqc.zip",
"results/fastqc/SRR935091_fastqc.zip",
"results/fastqc/SRR935092_fastqc.zip"
log:"results/logs/multiqc.log"
shell:"""
# Run multiQC and keep the html report
multiqc -n multiqc.html {input} 2>{log}
mv multiqc.html {output.html}
mv multiqc_data/multiqc_general_stats.txt {output.stats}
# Remove the other directory that multiQC creates
rm -rf multiqc_data
"""
rule get_genome_fasta:"""
Retrieve the sequence in fasta format for a genome.
"""
output:"data/ref/NCTC8325.fa.gz"
log:"results/logs/get_genome_fasta.log"
shell:"""
wget -o {log} ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz -O {output}
"""
rule get_genome_gff3:"""
Retrieve annotation in gff3 format for a genome.
"""
output:"data/ref/NCTC8325.gff3.gz"
log:"results/logs/get_genome_gff3.log"
shell:"""
wget -o {log} ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/gff3/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.37.gff3.gz -O {output}
"""
rule index_genome:"""
Index a genome using Bowtie 2.
"""
output:"results/bowtie2/NCTC8325.1.bt2",
"results/bowtie2/NCTC8325.2.bt2",
"results/bowtie2/NCTC8325.3.bt2",
"results/bowtie2/NCTC8325.4.bt2",
"results/bowtie2/NCTC8325.rev.1.bt2",
"results/bowtie2/NCTC8325.rev.2.bt2"
input:
"data/ref/NCTC8325.fa.gz"
log:"results/logs/index_genome.log"
shell:"""
# Bowtie2 cannot use .gz, so unzip to a temporary file first
gunzip -c {input} > tempfile
bowtie2-build tempfile results/bowtie2/NCTC8325 >{log}
# Remove the temporary file
rm tempfile
"""
rule generate_count_table:"""
Generate a count table using featureCounts.
"""
output:"results/tables/counts.tsv"
input:
= ["results/bam/SRR935090.sorted.bam",
bams "results/bam/SRR935091.sorted.bam",
"results/bam/SRR935092.sorted.bam"],
= "data/ref/NCTC8325.gff3.gz"
annotation
log:"results/logs/generate_count_table.log"
shell:"""
featureCounts -t gene -g gene_id -a {input.annotation} -o {output} {input.bams} 2>{log}
"""
After adding logfiles to the rules, rerun the whole workflow with:
snakemake -s snakefile_mrsa.smk --configfile config.yml -p -c 1
Do the logs contain what they should? Note how much easier it is to follow the progression of the workflow when the rules write to logs instead of to the terminal.
If you have a rule with a shell directive in which several commands are run and you want to save stdout and stderr for all commands into the same log file you can add exec &>{log}
as the first line of the shell directive.
If you run with -D
(or -S
for a simpler version) you will see that the summary table now also contains the log file for each of the files in the workflow.
In this section we’ve learned:
- How to redirect output to log files with the
log
directive.
7 Temporary files
It’s not uncommon that workflows contain temporary files that should be kept for some time and then deleted once they are no longer needed. A typical case could be that some operation generates a file, which is then compressed to save space or indexed to make searching faster. There is then no need to save the original output file. Take a look at the job graph for our workflow again. The output from align_to_genome
is a BAM file, which contains information about all the reads for a sample and where they map in the genome. For downstream processing we need this file to be sorted by genome coordinates. This is what the rule sort_bam
is for. We therefore end up with both results/bam/{sample_id}.bam
and results/bam/{sample_id}.sorted.bam
as you can see if you list the contents of results/bam
.
In Snakemake we can mark an output file as temporary like this:
rule some_rule:
output:"results/some_output.txt") temp(
The file will then be deleted as soon as all jobs where it’s an input have finished. Let’s use the temp
notation for for the output of align_to_genome
, modify the rule so that the output:
directive looks like this:
snakefile_mrsa.smk
output:"results/bam/{sample_id,\\w+}.bam") temp(
We have to force a rerun of the rule for it to trigger, so we use --forcerun align_to_genome
to do this. Run the workflow with:
snakemake -s snakefile_mrsa.smk --configfile config.yml -p -c 1 --forcerun align_to_genome
Take a look at the output printed to your terminal. After the rule sort_bam
you should see something like this:
Removing temporary output results/bam/SRR935092.bam.
If you list the contents of results/bam
you should see that it now only contains the sorted BAM files.
If you want to prevent Snakemake from deleting files marked as temporary you can run with the --notemp
command line flag which will cause Snakemake to ignore any temp()
declarations. If you at any point want to delete remaining files that are declared as temporary in your workflow you can instead run with the --delete-temp-output
which will delete all remaining temp
files.
Snakemake has a number of additional options for marking files:
protected("...")
: The output file should be write-protected. Typically used to protect files that require a huge amount of computational resources from being accidentally deleted.ancient("...")
: The timestamp of the input file is ignored and it’s always assumed to be older than any of the output files.touch("...")
: The output file should be “touched”, i.e. created or updated, when the rule has finished. Typically used as “flag files” to enforce some rule execution order without real file dependencies.directory("...")
: The output is a directory rather than a file.
In this section we’ve learned:
- How to mark an output file as temporary for automatic removal.
8 The expand function
Now that we have a working workflow with logs and temporary files, let’s look at a way to make the code more readable.
Consider the rule align_to_genome
below.
snakefile_mrsa.smk
rule align_to_genome:"""
Align a fastq file to a genome index using Bowtie 2.
"""
output:"results/bam/{sample_id,\\w+}.bam")
temp(input:
"data/{sample_id}.fastq.gz",
"results/bowtie2/NCTC8325.1.bt2",
"results/bowtie2/NCTC8325.2.bt2",
"results/bowtie2/NCTC8325.3.bt2",
"results/bowtie2/NCTC8325.4.bt2",
"results/bowtie2/NCTC8325.rev.1.bt2",
"results/bowtie2/NCTC8325.rev.2.bt2"
log:"results/logs/align_to_genome/{sample_id}.log"
shell:"""
bowtie2 -x results/bowtie2/NCTC8325 -U {input[0]} > {output} 2>{log}
"""
Here we have seven inputs; the FASTQ file with the reads and six files with similar file names from the Bowtie2 genome indexing. Instead of writing all the filenames we can tidy this up by using the built-in function expand()
.
"results/bowtie2/NCTC8325.{substr}.bt2",
expand(= ["1", "2", "3", "4", "rev.1", "rev.2"]) substr
Here expand
will take the string "results/bowtie2/NCTC8325.{substr}.bt2"
and insert each of the values in the substr
list: ["1", "2", "3", "4", "rev.1", "rev.2"]
in place of {substr}
. The result is a list of strings that is used as input to the rule. This is a very powerful feature that can save you a lot of typing and make your code more readable.
Now modify the rules index_genome
and align_to_genome
to use the expand()
expression in the output and input, respectively. Click the solution below if you need help.
snakefile_mrsa.smk
rule index_genome:"""
Index a genome using Bowtie 2.
"""
output:"results/bowtie2/NCTC8325.{substr}.bt2",
expand(=["1", "2", "3", "4","rev.1", "rev.2"])
substrinput:
"data/ref/NCTC8325.fa.gz"
log:"results/logs/index_genome/NCTC8325.log"
shell:"""
# Bowtie2 cannot use .gz, so unzip to a temporary file first
gunzip -c {input} > tempfile
bowtie2-build tempfile results/bowtie2/NCTC8325 >{log}
# Remove the temporary file
rm tempfile
"""
rule align_to_genome:"""
Align a fastq file to a genome index using Bowtie 2.
"""
output:"results/bam/{sample_id,\\w+}.bam")
temp(input:
"data/{sample_id}.fastq.gz",
"results/bowtie2/NCTC8325.{substr}.bt2",
expand(=["1", "2", "3", "4", "rev.1", "rev.2"])
substr
log:"results/logs/align_to_genome/{sample_id}.log"
shell:"""
bowtie2 -x results/bowtie2/NCTC8325 -U {input[0]} > {output} 2>{log}
"""
Great! Let’s put the expand
function to more use!
The multiqc
and generate_count_table
rules take as input the fastqc zip files, and the sorted bam files for all three samples, respectively. See the excerpt below:
snakefile_mrsa.smk
rule multiqc:
...input:
"results/fastqc/SRR935090_fastqc.zip",
"results/fastqc/SRR935091_fastqc.zip",
"results/fastqc/SRR935092_fastqc.zip"
...
rule generate_count_table:
...input:
= ["results/bam/SRR935090.sorted.bam",
bams "results/bam/SRR935091.sorted.bam",
"results/bam/SRR935092.sorted.bam"],
...
Having the files for each sample written explicitly in these rules is how Snakemake knows to run the workflow for all three samples. However, this is also a potential source of errors since it’s easy to change in one place and forget to change in the other. Because we can use Python code “everywhere” let’s instead define a list of sample ids and put at the top of the Snakefile, just before the rule all
:
snakefile_mrsa.smk
= ["SRR935090", "SRR935091", "SRR935092"] sample_ids
Now use expand()
in multiqc
and generate_count_table
to use sample_ids
for the sample ids. For the multiqc
rule it could look like this:
input:
"results/fastqc/{sample_id}_fastqc.zip",
expand(= sample_ids) sample_id
See if you can update the generate_count_table
rule in the same manner. If you need help, click the solution below.
snakefile_mrsa.smk
rule generate_count_table:"""
Generate a count table using featureCounts.
"""
output:"results/tables/counts.tsv"
input:
= expand("results/bam/{sample_id}.sorted.bam", sample_id = sample_ids),
bams = "data/ref/NCTC8325.gff3.gz"
annotation
log:"results/logs/generate_count_table.log"
shell:"""
featureCounts -t gene -g gene_id -a {input.annotation} -o {output} {input.bams} 2>{log}
"""
Now run the workflow again with:
snakemake -s snakefile_mrsa.smk --configfile config.yml -p -c 1 --forcerun align_to_genome index_genome multiqc generate_count_table
to force a rerun of the modified rules and make sure that the workflow still works as expected.
In this section we’ve learned:
- How to use the
expand()
expression to make the code more readable.
9 Shadow rules
Take a look at the index_genome
rule below:
snakefile_mrsa.smk
rule index_genome:"""
Index a genome using Bowtie 2.
"""
output:"results/bowtie2/NCTC8325.{substr}.bt2",
expand(= ["1", "2", "3", "4", "rev.1", "rev.2"])
substr input:
"data/NCTC8325.fa.gz"
log:"results/logs/index_genome/NCTC8325.log"
shell:"""
# Bowtie2 cannot use .gz, so unzip to a temporary file first
gunzip -c {input} > tempfile
bowtie2-build tempfile results/bowtie2/NCTC8325 >{log}
# Remove the temporary file
rm tempfile
"""
There is a temporary file here called tempfile
which is the uncompressed version of the input fasta file. This file is created when the rule starts ( since Bowtie2 cannot use compressed files) but is not needed upon completion. There are a number of drawbacks with having files that aren’t explicitly part of the workflow as input/output files to rules:
- Snakemake cannot clean up these files if the job fails, as it would do for normal output files.
- If several jobs are run in parallel there is a risk that they write to
tempfile
at the same time. This can lead to very scary results. - Sometimes we don’t know the names of all the files that a program can generate and we may only be interested in a few of them.
All of these issues can be dealt with by using the shadow
option for a rule. The shadow option results in that each execution of the rule is run in an isolated temporary directory (located in .snakemake/shadow/
by default). There are a few options for shadow
(for the full list of these options see the Snakemake docs).
The most simple is shadow: "minimal"
, which means that the rule is executed in an empty shadow directory that the input files to the rule have been symlinked into. For the rule below, that means that some_input.txt
would be symlinked into a shadow directory under .snakemake/shadow/
. The shell commands would generate the file some_other_junk_file
in this shadow directory, as well as the output file results/some_output.txt
. When the rule finishes, the shadow directory is removed which means we wouldn’t have to think about manually removing some_other_junk_file
.
rule some_rule:
output:"results/some_output.txt"
input:
"some_input.txt"
"minimal"
shadow:
shell:"""
touch some_other_junk_file
some_program {input} {output}
"""
Try this out for the rules multiqc
and index_genome
where we have to “manually” deal with files that aren’t tracked by Snakemake. Add shadow: "minimal"
to these rules and also delete the shell commands that remove temporary files from those rules (the last lines in the shell directives that begin with rm
), as they are no longer needed. Check the solution below if you need help.
snakefile_mrsa.smk
rule multiqc:"""
Aggregate all FastQC reports into a MultiQC report.
"""
output:= "results/multiqc/multiqc.html",
html = "results/multiqc/multiqc_general_stats.txt"
stats input:
"results/fastqc/{sample_id}_fastqc.zip",
expand(= sample_ids)
sample_id
log:"results/logs/multiqc.log"
"minimal"
shadow:
shell:"""
# Run multiQC and keep the html report
multiqc -n multiqc.html {input} 2>{log}
mv multiqc.html {output.html}
mv multiqc_data/multiqc_general_stats.txt {output.stats}
"""
rule index_genome:"""
Index a genome using Bowtie 2.
"""
output:"results/bowtie2/NCTC8325.{substr}.bt2",
expand(=["1", "2", "3", "4","rev.1", "rev.2"])
substrinput:
"data/ref/NCTC8325.fa.gz"
log:"results/logs/index_genome/NCTC8325.log"
"minimal"
shadow:
shell:"""
# Bowtie2 cannot use .gz, so unzip to a temporary file first
gunzip -c {input} > tempfile
bowtie2-build tempfile results/bowtie2/NCTC8325 >{log}
"""
Now run the workflow again with:
snakemake -s snakefile_mrsa.smk --configfile config.yml -p -c 1
and make sure that you see neither tempfile
, nor the multiqc_data
directory in your current working directory after the workflow has finished.
Some people use the shadow option for almost every rule and some never use it at all. One thing to keep in mind is that it leads to some extra file operations when the outputs are moved to their final location. This is no issue when the shadow directory is on the same disk as the output directory, but if you’re running on a distributed file system and generate very many or very large files it might be worth considering other options (see e.g. the --shadow-prefix
flag).
In this section we’ve learned:
- How to use the shadow option to handle files that are not tracked by Snakemake.
10 Generalizing the MRSA workflow
You’ve now improved the MRSA workflow by making it more readable, adding logfiles, a config file etc. However, the workflow is not very general since it’s hard-coded to work with the three samples SRR935090
, SRR935091
and SRR935092
and the genome NCTC8325
. If someone would want to use the workflow with other samples or genomes they would have to change the code in the Snakefile, which is not optimal. Instead it’s a good idea to separate project-specific parameters from the actual implementation of the workflow. This allows anyone using the workflow to modify its behaviour without changing the underlying code, making the workflow more general.
In this section we will go through how to make the workflow more general and flexible by moving the sample ids and URLs to a configuration file and turning the reference genome into a wildcard.
10.1 Specifying samples in a configuration file
Remove the sample_ids = ["SRR935090", "SRR935091", "SRR935092"]
line we added to the top of snakefile_mrsa.smk
and add the following to config.yml
:
config.yml
samples:
SRR935090: "https://figshare.scilifelab.se/ndownloader/files/39539767"
SRR935091: "https://figshare.scilifelab.se/ndownloader/files/39539770"
SRR935092: "https://figshare.scilifelab.se/ndownloader/files/39539773"
Now when passing --configfile config.yml
on the command line config["samples"]
will be a dictionary where the keys are the sample ids and the values are the URLs to the FASTQ files.
Next, modify the get_sample_url
function to use this dictionary instead of the hard-coded values:
snakefile_mrsa.smk
def get_sample_url(wildcards):
return config["samples"][wildcards.sample_id]
Now change the multiqc
and generate_count_table
rules that use the expand
function in the input directive to use config["samples"].keys()
to expand the sample_id
wildcard. This is how it should look for the multiqc
rule:
input:
"results/fastqc/{sample_id}_fastqc.zip",
expand(= config["samples"].keys() sample_id
Try to change the generate_count_table
rule in the same way. Check the solution below if you need help.
snakefile_mrsa.smk
rule generate_count_table:"""
Generate a count table using featureCounts.
"""
output:"results/tables/counts.tsv"
input:
= expand("results/bam/{sample_id}.sorted.bam",
bams = config["samples"].keys()),
sample_id = "data/ref/NCTC8325.gff3.gz"
annotation
log:"results/logs/generate_count_table.log"
shell:"""
featureCounts -t gene -g gene_id -a {input.annotation} -o {output} {input.bams} 2>{log}
"""
To see that your new implementation works you can do a forced rerun of the modified rules:
snakemake -s snakefile_mrsa.smk --configfile config.yml -p -c 1 --forcerun generate_count_table multiqc get_SRA_by_accession
If you were to scale up this workflow with more samples it could become impractical to have to define the sample ids and URLs by hand in the config file. A tip then is to have a separate file where samples are listed in one column and the URLs (or file paths) in another column. With a few lines of python code you could then read that list at the start of the workflow and add each sample to the config dictionary. See below for an example of how to do this.
10.2 Generalizing the genome reference
Now we’ll take a look at how to generalize the genome reference used in the workflow.
In the get_genome_fasta
and get_genome_gff3
rules we have hard-coded FTP paths to the FASTA and GFF annotation file for the genome NCTC8325
. Let’s move this information to the configuration file, and also add information for another genome, ST398
. Add the following to config.yml
:
config.yml
genomes:
NCTC8325:
fasta: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz
gff3: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/gff3/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.37.gff3.gz
ST398:
fasta: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection//staphylococcus_aureus_subsp_aureus_st398/dna/Staphylococcus_aureus_subsp_aureus_st398.ASM958v1.dna.toplevel.fa.gz
gff3: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/gff3/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_st398//Staphylococcus_aureus_subsp_aureus_st398.ASM958v1.37.gff3.gz
As you can see this is very similar to what we did with samples
, it’s just that we have one more nested level. Now to access the FTP path to the FASTA file for genome id NCTC8325
we can use config["genomes"]["NCTC8325"]["fasta"]
.
Let’s now look at how to do the mapping from genome id to FASTA path in the rule get_genome_fasta
. This is how the rule currently looks (if you have added the log section as previously described).
snakefile_mrsa.smk
rule get_genome_fasta:"""
Retrieve the sequence in fasta format for a genome.
"""
output:"data/raw_external/NCTC8325.fa.gz"
log:"results/logs/get_genome_fasta/NCTC8325.log"
shell:"""
wget -o {log} ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz -O {output}
"""
Replace the hard-coded NCTC8325
with a wildcard {genome_id}
in both the input
and log
directives. We now need to supply the remote paths to the FASTA file for a given genome id. Because we’ve added this information to the config file we just need to pass it to the rule in some way, and just like in the get_SRA_by_accession
rule we’ll use a function to do the job:
snakemfile_mrsa.smk
def get_fasta_path(wildcards):
return config["genomes"][wildcards.genome_id]["fasta"]
rule get_genome_fasta:"""
Retrieve the sequence in fasta format for a genome.
"""
output:"data/ref/{genome_id}.fa.gz"
log:"results/logs/get_genome_fasta/{genome_id}.log"
params:= get_fasta_path
fasta_path
shell:"""
wget -o {log} {params.fasta_path} -O {output}
"""
Now change the get_genome_gff3
rule in a similar manner. Click to see the solution below if you’re having trouble.
def get_gff_path(wildcards):
return config["genomes"][wildcards.genome_id]["gff3"]
rule get_genome_gff3:"""
Retrieve annotation in gff3 format for a genome.
"""
output:"data/ref/{genome_id}.gff3.gz"
log:"results/logs/get_genome_gff3/{genome_id}.log"
params:= get_gff_path
gff3_path
shell:"""
wget -o {log} {params.gff3_path} -O {output}
"""
Also replace NCTC8325
with {genome_id}
in the log
, input
and output
of the index_genome
rule. Here you will run into a complication if you have followed the previous instructions and use the expand()
expression. We want the list of output files be ["results/bowtie2/{genome_id}.1.bt2", "results/bowtie2/{genome_id}.2.bt2", ...]
, i.e. only expanding the substr
wildcard in the Bowtie2 index. To prevent Snakemake from trying to expand the genome_id
wildcard we have to “mask” it with double curly brackets: {genome_id}
.
In addition, we need to replace the hard-coded results/bowtie2/NCTC8325
in the shell directive of the rule with the genome id wildcard. Inside the shell directive the wildcard object is accessed with this syntax: {wildcards.genome_id}
, so the Bowtie2-build command should be:
bowtie2-build tempfile results/bowtie2/{wildcards.genome_id} > {log}
The final rule should look like this:
snakefile_mrsa.smk
rule index_genome:"""
Index a genome using Bowtie 2.
"""
output:"results/bowtie2/{{genome_id}}.{substr}.bt2",
expand(= ["1", "2", "3", "4", "rev.1", "rev.2"])
substr input:
"data/ref/{genome_id}.fa.gz"
log:"results/logs/index_genome/{genome_id}.log"
"minimal"
shadow:
shell:"""
# Bowtie2 cannot use .gz, so unzip to a temporary file first
gunzip -c {input} > tempfile
bowtie2-build tempfile results/bowtie2/{wildcards.genome_id} >{log}
"""
Good job! The rules get_genome_fasta
, get_genome_gff3
and index_genome
can now download and index any genome as long as we provide valid links in the config file.
Only two rules remain where the NCTC8325
genome id is hard-coded: align_to_genome
and generate_count_table
. We will use these rules to define which genome id we actually want to use when running the workflow. First add a parameter to config.yml
called "genome_id"
and set it to NCTC8325
:
config.yml
genome_id: "NCTC8325"
Now we can resolve the genome_id
wildcard from the config. For align_to_genome
we change the input directive to:
input:
"data/{sample_id}.fastq.gz",
"results/bowtie2/{genome_id}.{substr}.bt2",
expand(= config["genome_id"],
genome_id = ["1", "2", "3", "4", "rev.1", "rev.2"]) substr
Here the substr
wildcard gets expanded from a list while genome_id
gets expanded from the config dictionary. Also change the hard-coded NCTC8325
in the shell:
directive of align_to_genome
so that the genome_id is inserted directly from the config, like this:
shell:"""
bowtie2 -x results/bowtie2/{config[genome_id]} -U {input[0]} > {output} 2>{log}
"""
The final rule should look like this:
snakefile_mrsa.smk
rule align_to_genome:"""
Align a fastq file to a genome index using Bowtie 2.
"""
output:"results/bam/{sample_id,\\w+}.bam")
temp(input:
"data/{sample_id}.fastq.gz",
"results/bowtie2/{genome_id}.{substr}.bt2",
expand(= config["genome_id"],
genome_id = ["1", "2", "3", "4", "rev.1", "rev.2"])
substr
log:"results/logs/align_to_genome/{sample_id}.log"
shell:"""
bowtie2 -x results/bowtie2/{config[genome_id]} -U {input[0]} > {output} 2>{log}
"""
Now let’s change the hard-coded genome id in the generate_count_table
input in a similar manner. The final rule should look like this:
snakefile_mrsa.smk
rule generate_count_table:"""
Generate a count table using featureCounts.
"""
output:"results/tables/counts.tsv",
"results/tables/counts.tsv.summary"
input:
=expand("results/bam/{sample_id}.sorted.bam",
bams= config["samples"].keys()),
sample_id =expand("data/ref/{genome_id}.gff3.gz",
annotation= config["genome_id"])
genome_id
log:"results/logs/generate_count_table.log"
shell:"""
featureCounts -t gene -g gene_id -a {input.annotation} -o {output[0]} {input.bams} 2>{log}
"""
To make sure that the workflow works as expected, run it with:
snakemake -s snakefile_mrsa.smk --configfile config.yml -p -c 1
Try changing the genome_id
in the config file to ST398
and rerun the workflow. You should now see that the workflow downloads the genome files for ST398
and aligns the samples to that genome instead.
In general, we want the rules as far downstream as possible in the workflow to be the ones that determine what the wildcards should resolve to. In our case align_to_genome
and generate_count_table
. You can think of it like the rule that really “needs” the file asks for it, and then it’s up to Snakemake to determine how it can use all the available rules to generate it. Here the align_to_genome
rule says “I need this genome index to align my sample to” and then it’s up to Snakemake to determine how to download and build the index.
Well done! You now have a complete Snakemake workflow with a number of excellent features:
- A general RNA-seq pipeline which can easily be reused between projects, thanks to clear separation between code and settings.
- Great traceability due to logs and summary tables.
- Clearly defined the environment for the workflow using Conda.
- The workflow is neat and free from temporary files due to using
temp()
andshadow
. - A logical directory structure which makes it easy to separate data and results of different software packages.
- A project set up in a way that makes it very easy to distribute and reproduce either via Git, Snakemake’s
--archive
option or a Docker image.
11 Extra material
If you want to read more about Snakemake in general you can find several resources here:
- The Snakemake documentation is available on ReadTheDocs.
- Here is another (quite in-depth) tutorial.
- If you have questions, check out stack overflow.
11.1 Reading samples from a file instead of hard-coding them
So far we’ve specified the samples to use in the workflow either as a hard-coded list in the Snakefile, or as a list in the configuration file. This is of course impractical for large real-world examples. Here we’ll just quickly show how you could supply the samples to the MRSA workflow via a tab-separated file. For example you could create a file called samples.csv
with the following content:
samples.csv
SRR935090,https://figshare.scilifelab.se/ndownloader/files/39539767
SRR935091,https://figshare.scilifelab.se/ndownloader/files/39539770 SRR935092,https://figshare.scilifelab.se/ndownloader/files/39539773
The first column has the sample id and the second column has the url to the fastq file. Now in order to read this into the workflow we need to use a few lines of python code. Since you can mix python code with rule definitions in Snakemake we’ll just add the following lines to the top of the Snakefile:
snakefile_mrsa.smk
# define an empty 'samples' dictionary
= {}
samples # read the sample list file and populate the dictionary
with open("samples.csv", "r") as fhin:
for line in fhin:
# strip the newline character from the end of the line
# then split by tab character to get the sample id and url
= line.strip().split(",")
sample_id, url # store the url in the dictionary with the sample id as key
= url samples[sample_id]
Now we can use the samples
dictionary in the workflow. For example, to get the url for SRR935090
we can use samples["SRR935090"]
.
For example, the get_sample_url
function can now be written as:
snakefile_mrsa.smk
def get_sample_url(wildcards):
return samples[wildcards.sample_id]
We can also use the samples
dictionary in expand()
, for example in the multiqc
rule:
input:
"results/fastqc/{sample_id}_fastqc.zip",
expand(= samples.keys()) sample_id
Now this depends on there being a samples.csv
file in the working directory. To make this a configurable parameter we can add it to the config file:
sample_list: "samples.csv"
and update the code for populating the samples
dictionary:
# define an empty 'samples' dictionary
= {}
samples # read the sample list file and populate the dictionary
with open(config["sample_list"], "r") as fhin:
for line in fhin:
# strip the newline character from the end of the line
# then split by tab character to get the sample id and url
= line.strip().split(",")
sample_id, url # store the url in the dictionary with the sample id as key
= url samples[sample_id]
This way, anyone can take our Snakefile and just update the path to their own sample_list
using the config file.
11.2 Using containers in Snakemake
Snakemake also supports defining an Apptainer or Docker container for each rule (you will have time to work on the Containers tutorial later during the course). Analogous to using a rule-specific Conda environment, specify container: "docker://some-account/rule-specific-image"
in the rule definition. Instead of a link to a container image, it is also possible to provide the path to an image file somwhere on your filesystem. When executing Snakemake, add the --software-deployment-method apptainer
(or the shorthand --sdm apptainer
) flag to the command line. For the given rule, an Apptainer container will then be created from the image or file that is provided in the rule definition on the fly by Snakemake and the rule will be run in this container.
You can find pre-made Apptainer or Docker images for many tools on https://biocontainers.pro/ (bioinformatics-specific) or on https://hub.docker.com/.
Here is an example for a rule and its execution:
rule align_to_genome:
output:"results/bam/{sample_id,\\w+}.bam")
temp(input:
= "data/{sample_id}.fastq.gz",
fastq = expand("results/bowtie2/{genome_id}.{substr}.bt2",
index =config["genome_id"],
genome_id=["1", "2", "3", "4", "rev.1", "rev.2"])
substr
log:"results/logs/align_to_genome/{{sample_id}}_{genome_id}.log",
expand(= config["genome_id"])
genome_id "docker://quay.io/biocontainers/bowtie2:2.5.0--py310h8d7afc0_0"
container:
shell:"""
bowtie2 -x results/bowtie2/{config[genome_id]} -U {input.fastq} > {output} 2>{log}
"""
Start your Snakemake workflow with the following command:
snakemake --configfile config.yml -s snakefile_mrsa.smk -c 1 --software-deployment-method apptainer
Feel free to modify the MRSA workflow according to this example. As Apptainer is a container software that was developed for HPC clusters with only limited support on e.g. Macs, it might not work to run your updated Snakemake workflow with Apptainer locally on your computer. In the next section we explain how you can run Snakemake workflows on UPPMAX where Apptainer is pre-installed.
11.3 Running Snakemake workflows on HPC clusters
If you need to run a Snakemake workflow on a high-performance computing (HPC) cluster you have a wide range of options at your disposal. Via the plugin catalogue you can find plugins that will add support for various HPC schedulers to Snakemake.
Here we will focus on how to run Snakemake workflows on clusters with SLURM, a workload manager commonly used on HPC clusters in Sweden such as Rackham, Tetralith and Dardel.
11.3.1 Option 1: Run the entire workflow as a single job
For short workflows with only a few rules that need the same compute resources in terms of CPU (cores) and memory, you can submit the entire workflow as a job directly to the SLURM scheduler, or start an interactive job (in your tmux
or screen
session), and run your Snakemake workflow as you would on your local machine. Make sure to give your job enough time to finish running all rules of your Snakemake workflow.
If you choose this option, you don’t need to install anything from the plugin catalogue. However, your workflow may not run as efficiently as it could if you were to add SLURM support in Snakemake.
11.3.2 Option 2: Use built-in SLURM support
For workflows with long run times and/or where each rule requires different compute resources, Snakemake comes with built in functionality for interacting with the SLURM workload manager and send each rule as a job to the SLURM queue and to track the status of each job.
In this case, you can start the workflow on the login node and let it run there until all jobs have finished. Given that workflows often consist of many rules, some of which may be highly resource demanding, this is the option we recommend when running most Snakemake workflows on HPC clusters.
To add SLURM support to Snakemake you first need to install the SLURM plugin from the plugin catalogue. This can be done with conda:
conda install -c conda-forge snakemake-executor-plugin-slurm
Once installed, adding the --executor slurm
flag to your Snakemake command line call will enable the plugin. You also need to specify how many jobs Snakemake can submit to the SLURM queue at the same time with the -j
flag. For example, to allow up to 100 jobs to be put into the queue at any given time, you would run Snakemake with the following command:
snakemake --executor slurm -j 100 <other flags>
11.4 Specifying resources for SLURM
Depending on the cluster you are using, you will need to specify some resource requirements for the rules in your workflow, such as the number of CPUs, memory, runtime and account id. This can be done either:
- directly on the command line with the
--default-resources
flag which sets default resource settings for all rules - in the rule definition of your workflow using the
resources:
directive, or - in a configuration profile, a folder with a
config.yaml
file that contains the resource settings.
You can also use a combination of these methods. For example, the SLURM account id (_e.g. naiss-2023-01-001
), which will most likely be the same for all rules, can be set with --default-resources
:
snakemake --executor slurm -j 100 --default-resources slurm_account=naiss-2023-01-001
Rule-specific resources such as runtime, memory and number of CPUs can be set in the rule definition, for example:
rule testrule:
output:"results/output.txt"
resources:= 60,
runtime = 16000,
mem_mb = 4
tasks
shell:"""
uname -a > {output}
"""
This rule uses the standard resource runtime
to set the maximum allowed time (in minutes) for the rule, sets the memory requirement with mem_mb
and the number of requested CPUs with tasks
. In this example the rule will have a time limit of 60 minutes, will require 16G of RAM and 4 CPUs.
Some clusters also require you to specify the partition you want to run your job on. The partition name will differ between clusters, for example the Rackham cluster uses core
and node
partitions, while Dardel uses e.g. shared
and main
. See the documentation for the cluster you are using for more information.
The partition can be set with the slurm_partition
resource, for example like so:
rule testrule:
output:"results/output.txt"
resources:= 60,
runtime = 16000,
mem_mb = 4,
tasks "shared"
slurm_partition:
shell:"""
uname -a > {output}
"""
To make it easy to adapt your workflow to different compute clusters it is recommended to define resource settings in a configuration profile. A configuration profile is a folder with a config.yaml
file that contains values for Snakemake command line arguments, allowing you to modify the behavior of Snakemake without changing the workflow code. For example, you could create a dardel
folder (e.g. in the root of your workflow) with a config.yaml
file that contains the following:
dardel/config.yaml
executor: "slurm"
jobs: 100
default-resources:
slurm_account: "naiss-2023-01-001"
slurm_partition: "shared"
mem_mb: 16000
tasks: 1
runtime: 60
This yaml-formatted file contains Snakemake command line arguments that will be used when running the workflow. You can then run Snakemake with the --profile
flag pointing to the folder containing the config.yaml
file:
snakemake --profile dardel
This greatly simplifies running the workflow on different clusters, and makes the command line call much more succinct.
To set rule-specific resources in the configuration profile, you can add a set_resources:
section to the config.yaml
file:
dardel/config.yaml
executor: "slurm"
jobs: 100
default-resources:
slurm_account: "naiss-2023-01-001"
slurm_partition: "shared"
mem_mb: 16000
tasks: 4
runtime: 60
set_resources:
index_genome:
runtime: 240
mem_mb: 32000
tasks: 8
align_to_genome:
runtime: 120
mem_mb: 24000
tasks: 6
In this example, the index_genome
rule will have a runtime of 240 minutes, will require 32G of RAM and 8 CPUs, while the align_to_genome
rule will have a runtime of 120 minutes, will require 24G of RAM and 6 CPUs. Both rules will use the slurm_account
and slurm_partition
settings from the default_resources
section, unless overridden in the rule-specific settings.
You can still define resources in the rule definition, but the values in the configuration profile will take precedence.
Now, when you run your Snakemake workflow with:
snakemake --profile dardel
Snakemake will submit each job to the SLURM queue and inform you about both the local jobid and the SLURM jobid by writing something similar to this to your terminal:
Job 0 has been submitted with SLURM jobid 37099380 (log: .snakemake/slurm_logs/rule_name/37099380.log).
In this example the log output from the job will be in .snakemake/slurm_logs/rule_name/37099380.log
.
You can read more details about running Snakemake on compute clusters in the Snakemake docs.