Variant calling workflow

Perform variant calling and genotyping. Introduction to workflow manager systems.

Author

Per Unneberg

Published

18-Sep-2025

Intended learning outcomes

Learn how workflow managers can automate complex tasks
Get familiar with the Snakemake manager

Tools

bcftools (Danecek et al., 2021)
bwa (Li, 2013)
fastqc
Genome Analysis ToolKit (GATK) (DePristo et al., 2011)
multiqc (Ewels et al., 2016)
picard (“Picard Toolkit,” 2019)
qualimap (Okonechnikov et al., 2016)
samtools (Danecek et al., 2021)
snakemake (Mölder et al., 2021)

Choose one of Modules and Virtual environment to access relevant tools.

Modules

Execute the following command to load modules:

module load bioinfo-tools \ 
    bcftools/1.20 bwa/0.7.18 fastqc/0.12.1 \ 
    gatk/4.5.0.0 multiqc/1.30 picard/3.3.0 \ 
    QualiMap/2.2.1 samtools/1.20 snakemake/9.9.0

Virtual environment

Run the pgip initialization script and activate the pgip default environment:

source /cfs/klemming/projects/supr/pgip_2025/init.sh
# Now obsolete!
# pgip_activate

Then activate the full environment:

# pgip_shell calls pixi shell -e full --as-is
pgip_shell

Copy the contents to a file pixi.toml in directory variant-calling, cd to directory and activate environment with pixi shell:

[workspace]
channels = ["conda-forge", "bioconda"]
name = "variant-calling"
platforms = ["linux-64", "osx-64"]

[dependencies]
bcftools = ">=1.22,<2"
bwa = ">=0.7.19,<0.8"
fastqc = ">=0.12.1,<0.13"
gatk4 = ">=4.6.2.0,<5"
multiqc = ">=1.30,<2"
picard = ">=3.4.0,<4"
qualimap = ">=2.3,<3"
samtools = ">=1.22.1,<2"
snakemake = ">=9.9.0,<10"

Data setup

PDC
Local

Make an analysis directory variant-calling and cd to it:

mkdir -p variant-calling && cd variant-calling

Using rsync

Use rsync to sync data to your analysis directory (hint: first use options -anv to run the command without actually copying data):

# Run rsync -anv first time
rsync -av /cfs/klemming/projects/supr/pgip_2025/data/monkeyflower/variant-calling/ .

Using pgip client

pgip exercises setup e-variant-calling

Make an analysis directory variant-calling and cd to it:

mkdir -p variant-calling && cd variant-calling

Then use wget to retrieve data to analysis directory.

wget -r -np -nH -N --cut-dirs=5 \
     https://export.uppmax.uu.se/uppstore2017171/pgip/data/monkeyflower/variant-calling/

Workflow managers

The advent of next-generation sequencing and other high-throughput technologies have contributed to increasing data complexity and data volumes, leading to scalability and reproducibility issues (Wratten et al., 2021). A number of workflow managers have beed developed to meet these needs, including Snakemake (Mölder et al., 2021) and Nextflow (Di Tommaso et al., 2017).

In this exercise, we will use Snakemake to run a variant calling workflow from start to end. We urge the reader to briefly skim the Snakefile¹, the Snakemake command file. We will briefly describe how Snakemake works in the next section, but going into any details is out of the scope of this exercise. See the Snakemake documentation for more information, and if you want to learn more, there are NBIS courses on reproducible research and Snakemake.

A very brief overview of a Snakefile

A Snakemake workflow consists of rules that determine how inputs are connected to outputs. Rules are defined in a so-called Snakefile. Below is an example of a bare minimum rule:

rule samtools_index:
    output:
        "ref/M_aurantiacus_v1.fasta.fai"
    input:
        "ref/M_aurantiacus_v1.fasta"
    shell:
        "samtools faidx {input} -o {output}"

The rule consists of a name (samtools_faidx) and keywords (output, input, shell). The shell keyword defines a shell command to be run (samtools faidx), which will take the input (ref/M_aurantiacus_v1.fasta) and produce an output (ref/M_aurantiacus_v1.fasta.fai). Note here the curly brackets; these are Snakemake wildcards which makes it possible to generalize rules to match file name patterns.

Provided the input file exists, running snakemake would produce the output, unless the output already exists. This is one neat feature of workflow managers - they are designed to detect whether input files are newer than output files, and only then will they forcefully regenerate the output².

Exercise

Copy the samtools_index rule to a file called Snakefile and run snakemake --dry-run --printshellcmds --force (alternatively snakemake -n -p). What happens?

Answer

Snakemake will output what jobs it will run, the reason, and which shell command.

A workflow is built by connecting outputs from one rule to inputs of another. A rule can depend on multiple inputs, as well as produce multiple outputs.

Exercise

In the above Snakefile, add a rule count_lines that uses the input ref/M_aurantiacus_v1.fasta.fai to generate the output file wc.txt, and where the shell command is uses wc -l to count lines in the input and redirect (>) to output. Then run the command snakemake wc.txt and look at the contents of the file.

Answer

rule samtools_index:
    output:
        "ref/M_aurantiacus_v1.fasta.fai"
    input:
        "ref/M_aurantiacus_v1.fasta"
    shell:
        "samtools faidx {input} -o {output}"

rule count_lines:
    output: "wc.txt"
    input: "ref/M_aurantiacus_v1.fasta.fai"
    shell: "wc -l {input} > {output}"

Important

Make sure to remove the Snakefile before proceeding as it otherwise will take precedence over the workflow/Snakefile.

A look into the variant calling Snakefile

Open workflow/Snakefile and look briefly at the contents. The top portion contains code to read the sampleinfo file and defines a variable REFERENCE that can be used throughout³:

import pandas as pd

# Read sampleinfo and subset to
sampleinfo = pd.read_csv("sampleinfo.csv")
sampleinfo = sampleinfo.loc[sampleinfo.SampleAlias.str.startswith("PUN")]\
                       .set_index("SampleAlias")
samples = sampleinfo.index.values

REFERENCE = "ref/M_aurantiacus_v1.fasta"

By default, Snakemake runs the argument if no filename is provided when running. By convention, the first rule is called all:

rule all:
    input: "multiqc_report.html",

This is a so-called pseudo-rule which are used to list the final desired output file. The workflow will figure out how to generate necessary inputs.

The remainder of the file contain the “regular” rule definitions. They have been kept as simple as possible, but you will notice that we have made use of some additional code constructs not mentioned above. Skim the file, and look at the multiqc rule at the bottom. Notice how it is used to “collect” necessary inputs which all have to be generated before the report is written.

It can be difficult to get an overview of the workflow by simply looking at the Snakefile. Therefore, we end by showing a rulegraph of the workflow, which shows how rules are connected:

snakemake --rulegraph | dot -Tpng | display

:::

Running the workflow

Now we turn to actually running the workflow. First use the options snakemake -n -p to check what the actual command flow looks like⁴. If everything looks ok, launch Snakemake, adding the --cores option to run jobs in parallel:

snakemake --cores 10

That’s all there is to it! Now you can take a break / listen to the next lecture while the workflow (hopefully) runs to completion without interruptions.

Exercise

Once the workflow has finished, open and have a look at multiqc_report.html. Also check the output variant files in directory gatk-genotype-gvcfs.

References

Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., & Li, H. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2), giab008. https://doi.org/10.1093/gigascience/giab008

DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C., Philippakis, A. A., del Angel, G., Rivas, M. A., Hanna, M., McKenna, A., Fennell, T. J., Kernytsky, A. M., Sivachenko, A. Y., Cibulskis, K., Gabriel, S. B., Altshuler, D., & Daly, M. J. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics, 43(5), 491–498. https://doi.org/10.1038/ng.806

Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316–319. https://doi.org/10.1038/nbt.3820

Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19), 3047–3048. https://doi.org/10.1093/bioinformatics/btw354

Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997 [q-Bio]. https://arxiv.org/abs/1303.3997

Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., Forster, J., Lee, S., Twardziok, S. O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., Nahnsen, S., & Köster, J. (2021). Sustainable data analysis with Snakemake (10:33). F1000Research. https://doi.org/10.12688/f1000research.29032.2

Okonechnikov, K., Conesa, A., & García-Alcalde, F. (2016). Qualimap 2: Advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics, 32(2), 292–294. https://doi.org/10.1093/bioinformatics/btv566

Picard toolkit. (2019). In Broad Institute, GitHub repository. https://broadinstitute.github.io/picard/; Broad Institute.

Wratten, L., Wilm, A., & Göke, J. (2021). Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nature Methods, 18(10), 1161–1168. https://doi.org/10.1038/s41592-021-01254-9

Footnotes

Snakemake borrows much of its terminology and philosophy from Make, which was originally designed to automate software builds.↩︎
You can also provide the --force flag to regenerate an output, regardless of whether the input file is younger or not.↩︎
Snakemake is written in Python. If you’re familiar with Python, you will recognize much of the syntax.↩︎
You can also add the flag --forceall/-F to trigger a rerun of all outputs.↩︎