Making reproducible workflows with

29-Oct-2024

Why do we need workflow managers?

  • As projects grow or age, it becomes increasingly difficult to keep track of all the parts and how they fit together.

snakemake_dag 0 fastqc 1 get_SRA_by_accession sample_id: SRR935090 1->0

Why do we need workflow managers?

  • As projects grow or age, it becomes increasingly difficult to keep track of all the parts and how they fit together.

snakemake_dag 0 fastqc 1 get_SRA_by_accession sample_id: SRR935090 1->0 2 fastqc 3 get_SRA_by_accession sample_id: SRR935091 3->2 4 fastqc 5 get_SRA_by_accession sample_id: SRR935092 5->4

Why do we need workflow managers?

  • As projects grow or age, it becomes increasingly difficult to keep track of all the parts and how they fit together.

snakemake_dag 0 align_to_genome 1 get_SRA_by_accession sample_id: SRR935091 1->0 2 index_genome 2->0 4 align_to_genome 2->4 6 align_to_genome 2->6 3 get_genome_fasta genome_id: NCTC8325 3->2 5 get_SRA_by_accession sample_id: SRR935092 5->4 7 get_SRA_by_accession sample_id: SRR935090 7->6

Why do we need workflow managers?

  • A workflow manager helps you scale up both in complexity and dataset size

snakemake_dag 0 all 1 generate_count_table 1->0 19 make_supplementary 1->19 2 sort_bam prefix: results/bam/SRR935090 2->1 3 align_to_genome 3->2 4 get_SRA_by_accession sample_id: SRR935090 4->3 15 fastqc 4->15 5 index_genome 5->3 8 align_to_genome 5->8 11 align_to_genome 5->11 6 get_genome_fasta genome_id: NCTC8325 6->5 7 sort_bam prefix: results/bam/SRR935091 7->1 8->7 9 get_SRA_by_accession sample_id: SRR935091 9->8 16 fastqc 9->16 10 sort_bam prefix: results/bam/SRR935092 10->1 11->10 12 get_SRA_by_accession sample_id: SRR935092 12->11 17 fastqc 12->17 13 get_genome_gff3 genome_id: NCTC8325 13->1 14 multiqc 14->0 14->19 15->14 16->14 17->14 18 generate_rulegraph 18->0 18->19 19->0

Workflow managers for bioinformatics

Most common

  • Snakemake
  • Nextflow

Others

  • Makeflow
  • Bpipe
  • Pachyderm

Snakemake workflows

  • Automatically track input/output file dependencies
  • Are built from rules
  • Are generalized with wildcards
  • Use a Python-based definition language
  • Easily scale from laptops to HPC clusters

Reproducible…

…and scalable workflows

Example: sequence trimming

Goal: Create workflow to trim and compress FASTQ files

./
 ├── a.fastq
 └── b.fastq

Example: sequence trimming

Using a bash-script:

for input in *.fastq
do
   sample=$(echo ${input} | sed 's/.fastq//')
   # 1. Trim fastq file (trim 5 bp from left, 10 bp from right)
   seqtk trimfq -b 5 -e 10 $input > ${sample}.trimmed.fastq
   # 2. Compress fastq file
   gzip -c ${sample}.trimmed.fastq > ${sample}.trimmed.fastq.gz
   # 3. Remove intermediate files
   rm ${sample}.trimmed.fastq
done

Execution:

$ bash trimfastq.sh

Example: sequence trimming

Using Snakemake rules:

rule trim_fastq:
    output: temp("{sample}.trimmed.fastq")
    input: "{sample}.fastq"
    shell:
        "seqtk trimfq -b 5 -e 10 {input} > {output}"
rule gzip:
    output: "{sample}.trimmed.fastq.gz"
    input: "{sample}.trimmed.fastq"
    shell:
        "gzip -c {input} > {output}"

Execution:

$ snakemake -c 1 a.trimmed.fastq.gz b.trimmed.fastq.gz

$ snakemake -c 1 a.trimmed.fastq.gz b.trimmed.fastq.gz

$ snakemake -c 1 a.trimmed.fastq.gz b.trimmed.fastq.gz
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count   jobs
2       gzip
2       trim_fastq
4
rule trim_fastq:
    input: a.fastq
    output: a.trimmed.fastq
    wildcards: sample=a
    1 of 4 steps (25%) done

rule gzip:
    input: a.trimmed.fastq
    output: a.trimmed.fastq.gz
    wildcards: sample=a
Removing temporary output file a.trimmed.fastq.
2 of 4 steps (50%) done

rule trim_fastq:
    input: b.fastq
    output: b.trimmed.fastq
    wildcards: sample=b
3 of 4 steps (75%) done

rule gzip:
    input: b.trimmed.fastq
    output: b.trimmed.fastq.gz
    wildcards: sample=b
Removing temporary output file b.trimmed.fastq.
4 of 4 steps (100%) done

Getting into the Snakemake mindset

From the Snakemake documentation:


“A Snakemake workflow is defined by specifying rules in a Snakefile.”

“Rules decompose the workflow into small steps.”

“Snakemake automatically determines the dependencies between the rules by matching file names.”

  • By themselves, rules only define what files can be generated

snakemake_dag 0               gzip               ↪ input {sample}.trimmed.fastq output → {sample}.trimmed.fastq.gz 1       trim_fastq        ↪ input {sample}.fastq output → {sample}.trimmed.fastq 1->0

  • By themselves, rules only define what files can be generated
  • The actual rules to run are determined automatically from the files you want, so called targets
$ snakemake -c 1 a.trimmed.fastq.gz b.trimmed.fastq.gz

snakemake_dag 0        gzip        ↪ input a.trimmed.fastq output → a.trimmed.fastq.gz 1 trim_fastq ↪ input a.fastq output → a.trimmed.fastq 1->0 2        gzip        ↪ input b.trimmed.fastq output → b.trimmed.fastq.gz 3 trim_fastq ↪ input b.fastq output → b.trimmed.fastq 3->2

  • By themselves, rules only define what files can be generated
  • The actual rules to run are determined automatically from the files you want, so-called targets
$ snakemake -c 1 a.trimmed.fastq.gz

snakemake_dag 0        gzip        ↪ input a.trimmed.fastq output → a.trimmed.fastq.gz 1 trim_fastq ↪ input a.fastq output → a.trimmed.fastq 1->0

  • By themselves, rules only define what files can be generated
  • The actual rules to run are determined automatically from the files you want, so called targets
  • It can therefore be helpful to think of Snakemake workflows in a bottom-up manner, starting with the output
rule trim_fastq:
    output: temp("{sample}.trimmed.fastq")
    input: "{sample}.fastq"
    shell:
        "seqtk trimfq -b 5 -e 10 {input} > {output}"
rule gzip:
    output: "{sample}.trimmed.fastq.gz"
    input: "{sample}.trimmed.fastq"
    shell:
        "gzip -c {input} > {output}"

  • By themselves, rules only define what files can be generated
  • The actual rules to run are determined automatically from the files you want, so called targets
  • It can therefore be helpful to think of Snakemake workflows in a bottom-up manner, starting with the output
  • If no target is passed at the command line, Snakemake will use the first defined rule in the Snakefile as a target
rule all:
    input:
        "a.trimmed.fastq.gz",
        "b.trimmed.fastq.gz"
rule trim_fastq:
    output: temp("{sample}.trimmed.fastq")
    input: "{sample}.fastq"
    shell:
        "seqtk trimfq -b 5 -e 10 {input} > {output}"
[...]

How does Snakemake keep track of what files to generate?

Example from the practical tutorial

  • The tutorial contains a workflow to download and map RNA-seq reads against a reference genome.

snakemake_dag 0 make_supplementary 1 generate_count_table 1->0 2 sort_bam prefix: results/bam/SRR935090 2->1 3 align_to_genome 3->2 4 get_SRA_by_accession sample_id: SRR935090 4->3 9 fastqc 4->9 5 index_genome 5->3 6 get_genome_fasta genome_id: NCTC8325 6->5 7 get_genome_gff3 genome_id: NCTC8325 7->1 8 multiqc 8->0 9->8 10 generate_rulegraph 10->0

  • The tutorial contains a workflow to download and map RNA-seq reads against a reference genome.
  • Here we ask for results/supplementary.html, which is an Quarto report generated by the rule make_supplementary:
$ snakemake -c 1 results/supplementary.html

snakemake_dag 0 make_supplementary 1 generate_count_table 1->0 2 sort_bam prefix: results/bam/SRR935090 2->1 3 align_to_genome 3->2 4 get_SRA_by_accession sample_id: SRR935090 4->3 9 fastqc 4->9 5 index_genome 5->3 6 get_genome_fasta genome_id: NCTC8325 6->5 7 get_genome_gff3 genome_id: NCTC8325 7->1 8 multiqc 8->0 9->8 10 generate_rulegraph 10->0

  • The tutorial contains a workflow to download and map RNA-seq reads against a reference genome.
  • Here we ask for results/supplementary.html, which is an Quarto report generated by the rule make_supplementary:
  • If the timestamp of a file upstream in the workflow is updated…
$ touch results/bowtie2/NCTC8325.1.bt2

snakemake_dag 0 make_supplementary 1 generate_count_table 1->0 2 sort_bam prefix: results/bowtie2/SRR935090 2->1 3 align_to_genome 3->2 4 get_SRA_by_accession sample_id: SRR935090 4->3 9 fastqc 4->9 5 index_genome* 5->3 6 get_genome_fasta genome_id: NCTC8325 6->5 7 get_genome_gff3 genome_id: NCTC8325 7->1 8 multiqc 8->0 9->8 10 generate_rulegraph 10->0

  • The tutorial contains a workflow to download and map RNA-seq reads against a reference genome.
  • Here we ask for results/supplementary.html, which is an Quarto report generated by the rule make_supplementary:
  • If the timestamp of a file upstream in the workflow is updated…
  • Snakemake detects a file change and only reruns the necessary rules.
$ snakemake -c 1 results/supplementary.html

snakemake_dag 0 make_supplementary 1 generate_count_table 1->0 2 sort_bam prefix: results/bowtie2/SRR935090 2->1 3 align_to_genome 3->2 4 get_SRA_by_accession sample_id: SRR935090 4->3 9 fastqc 4->9 5 index_genome 5->3 6 get_genome_fasta genome_id: NCTC8325 6->5 7 get_genome_gff3 genome_id: NCTC8325 7->1 8 multiqc 8->0 9->8 10 generate_rulegraph 10->0

Anatomy of a Snakemake rule

  • Rules are typically named and have input and/or output directives
rule trim_fastq:
    output: temp("{sample}.trimmed.fastq")
    input: "{sample}.fastq"
    shell:
        """
        seqtk trimfq -b 5 -e 10 {input} > {output}
        """

  • Rules are typically named and have input and/or output directives
  • Logfiles help with debugging and leave a “paper trail”
rule trim_fastq:
    output: temp("{sample}.trimmed.fastq")
    input: "{sample}.fastq"
    log: "logs/{sample}.trim_fastq.log"
    shell:
        """
        seqtk trimfq -b 5 -e 10 {input} > {output} 2> {log}
        """

  • Rules are typically named and have input and/or output directives
  • Logfiles help with debugging and leave a “paper trail”
  • Params can be used to pass on settings
rule trim_fastq:
    output: temp("{sample}.trimmed.fastq")
    input: "{sample}.fastq"
    log: "logs/{sample}.trim_fastq.log"
    params:
        leftTrim=5,
        rightTrim=10
    shell:
      """
      seqtk trimfq -b {params.leftTrim} -e {params.rightTrim} {input} > {output} 2> {log}
      """

  • Rules are typically named and have input and/or output directives
  • Logfiles help with debugging and leave a “paper trail”
  • Params can be used to pass on settings
  • The threads directive specify maximum number of threads for a rule
  • You can also define resources such as disk/memory requirements and runtime
rule trim_fastq:
    output: temp("{sample}.trimmed.fastq")
    input: "{sample}.fastq"
    log: "logs/{sample}.trim_fastq.log"
    params:
        leftTrim=5,
        rightTrim=10
    threads: 8
    resources:
        mem_mb=64,
        runtime=120
    shell:
      """
      seqtk trimfq -t {threads} -b {params.leftTrim} -e {params.rightTrim} {input} > {output} 2> {log}
      """

  • Rules are typically named and have input and/or output directives
  • Logfiles help with debugging and leave a “paper trail”
  • Params can be used to pass on settings
  • The threads directive specify maximum number of threads for a rule
  • You can also define resources such as disk/memory requirements and runtime
  • Rules can be executed in separate software environments using either the conda or container directive
rule trim_fastq:
    output: temp("{sample}.trimmed.fastq")
    input: "{sample}.fastq"
    log: "logs/{sample}.trim_fastq.log"
    params:
        leftTrim=5,
        rightTrim=10
    threads: 8
    resources:
        mem_mb=64,
        runtime=120
    conda: "envs/seqtk.yaml"
    container: "docker://quay.io/biocontainers/seqtk"
    shell:
      """
      seqtk trimfq -t {threads} -b {params.leftTrim} -e {params.rightTrim} {input} > {output} 2> {log}
      """

  • Rules are typically named and have input and/or output directives
  • Logfiles help with debugging and leave a “paper trail”
  • Params can be used to pass on settings
  • The threads directive specify maximum number of threads for a rule
  • You can also define resources such as disk/memory requirements and runtime
  • Rules can be executed in separate software environments using either the conda or container directive

envs/seqtk.yaml

name: seqtk
channels:
  - bioconda
dependencies:
  - seqtk

See more in the Snakemake documentation

https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html

Snakemake commandline

  • Generate the output of the first rule in Snakefile
$ snakemake -s Snakefile
  • Run the workflow in dry mode and print shell commands
$ snakemake -n -p
  • Execute the workflow with 8 cores
$ snakemake --cores 8
  • Specify a configuration file
$ snakemake --configfile config.yaml
  • Run rules with specific conda environments or Docker/Apptainer containers
$ snakemake --software-deployment-method conda
$ snakemake --software-deployment-method apptainer

Questions?