Variant calling introduction

Introduction to variant calling and the command line interface.

Author

Per Unneberg

Published

18-Sep-2025

Intendend learning outcomes

Navigate directory tree
Look at FASTQ sequence file

Tools

fastqc

Choose one of Modules and Virtual environment to access relevant tools.

Modules

Execute the following command to load modules:

module load \ 
    fastqc/0.12.1

Virtual environment

Run the pgip initialization script and activate the pgip default environment:

source /cfs/klemming/projects/supr/pgip_2025/init.sh
# Now obsolete!
# pgip_activate

Then activate the full environment:

# pgip_shell calls pixi shell -e full --as-is
pgip_shell

Copy the contents to a file pixi.toml in directory variant-calling, cd to directory and activate environment with pixi shell:

[workspace]
channels = ["conda-forge", "bioconda"]
name = "variant-calling"
platforms = ["linux-64", "osx-64"]

[dependencies]
fastqc = ">=0.12.1,<0.13"

Data setup

PDC
Local

Make an analysis directory variant-calling and cd to it:

mkdir -p variant-calling && cd variant-calling

Using rsync

Use rsync to sync data to your analysis directory (hint: first use options -anv to run the command without actually copying data):

# Run rsync -anv first time
rsync -av /cfs/klemming/projects/supr/pgip_2025/data/monkeyflower/variant-calling/ .

Using pgip client

pgip exercises setup e-variant-calling

Make an analysis directory variant-calling and cd to it:

mkdir -p variant-calling && cd variant-calling

Then use wget to retrieve data to analysis directory.

wget -r -np -nH -N --cut-dirs=5 \
     https://export.uppmax.uu.se/uppstore2017171/pgip/data/monkeyflower/variant-calling/

Variant calling

A generic variant calling workflow consists of the following basic steps:

read quality control and filtering
read mapping
removal / marking of duplicate reads
joint / sample-based variant calling and genotyping

There are different tweaks and additions to each of these steps, depending on application and method.

1. Read quality control

Figure 1: Per base quality scores, read 1 (upper) and read 2 (lower panel), obtained from the FastQC program. Quality values are on the \(y\)-axis, base position in sequence read on \(x\)-axis.

DNA sequencers score the quality of each sequenced base as phred quality scores, which is equivalent to the probability \(P\) that the call is incorrect. The base quality scores, denoted \(Q\), are defined as

\[ Q = -10 \log_{10} P \]

which for \(P=0.01\) gives \(Q=20\). Converting from quality to probability is done by solving for \(P\):

\[ P = 10^{-Q/10} \]

Hence, a base quality score \(Q=20\) (somtimes written Q20) corresponds to a 1% probability that the call is incorrect, Q30 a 0.1% probability, and so on, where the higher the quality score, the better. Bases with low quality scores are usually discarded from downstream analyses, but what is a good threshold? The human genome has approximately 1 SNP per 1,000 bp, which means sequencing errors will be ten times as probable in a single read for Q20 base calls. A reasonable threshold is therefore around Q20-Q30 for many purposes.

The base qualities typically drop towards the end of the reads (Figure 1). Prior to mapping it may therefore be prudent to remove reads that display too high drop in quality, too low mean quality, or on some other quality metric reported by the qc software.

The quality scores are encoded using ASCII codes. An example of a FASTQ sequence is given below. The code snippet shows an example of shell commands¹ that are separated by a so-called pipe (|) character which takes the output from one process and sends it as input to the next².

Note that we use the long option names to clarify commands, and we aim to do so consistently when a new command is introduced. Once you feel confident you know what a command does, you will probably want to switch to short option names, and we may do so in the instructions for some commonly used commands (e.g., head -n) without warning. Remember to use --help to examine command options.

# Command using long (-- prefix) option names
zcat fastq/PUN-Y-INJ_R1.fastq.gz | head --lines 4 | cut --characters -30
# Equivalent command using short (single -, single character) option names
# zcat fastq/PUN-Y-INJ_R1.fastq.gz | head -n 4 | cut -c -30

@SRR9309790.10003134
TAAATCGATTCGTTTTTGCTATCTTCGTCT
+
AAFFFJJJJJJJFJJJJJJJJJJJJJJJJJ

Consequently, a FASTQ entry consists of four lines:

sequence id (prefixed by @)
DNA sequence
separator (+)
phred base quality scores

Exercise

Use the command wc to determine how many sequences are in fastq/PUN-Y-INJ_R1.fastq.gz.

Hint

Use the --help option to show documentation for the wc command (wc --help). This will show that wc prints newline, word and byte counts for a file, where newline is what we’re after. We can restrict the output to newline characters with the --lines option. Use zcat to print the contents of fastq/PUN-Y-INJ_R1.fastq.gz to the screen, piping (|) the output to wc --lines.

Answer

zcat fastq/PUN-Y-INJ_R1.fastq.gz | wc --lines

Since there are four lines per sequence (id, sequence, + separator, qualities) you need to divide the final number by four (622744 / 4).

2. Read mapping

Read mapping consists of aligning sequence reads, typically from individuals in a population (a.k.a. resequencing) to a reference sequence. The choice of read mapper depends, partly on preference, but mostly on the sequencing read length and application. For short reads, a common choice is bwa-mem, and for longer reads minimap2.

In what follows, we will assume that the sequencing protocol generates paired-end short reads (e.g., from Illumina). In practice, this means a DNA fragment has been sequenced from both ends, where fragment sizes have been selected such that reads do not overlap (i.e., there is unsequenced DNA between the reads of a given insert size).

The final output of read mapping is an alignment file in binary alignment map (BAM) format or variants thereof.

3. Removal / marking of duplicate reads

During sample preparation or DNA amplification with PCR, it may happen that a single DNA fragment is present in multiple copies and therefore produces redundant sequencing reads. This shows up as alignments with identical start and stop coordinates. These so-called duplicate reads should be marked prior to any downstream analyses. The most commonly used tools for this purpose are samtools markdup and picard MarkDuplicates.

4. Variant calling and genotyping

Once BAM files have been produced, it is time for variant calling, which is the process of identifying sites where there sequence variation. There are many different variant callers, of which we will mention four.

bcftools is a toolkit to process variant call files, but also has a variant caller command. We will use bcftools to look at and summarize the variant files.

freebayes uses a Bayesian model to call variants. It may be time-consuming in high-coverage regions, and one therefore may have to mask repetitive and other low-complexity regions.

ANGSD is optimized for low-coverage data. Genotypes aren’t called directly; rather, genotype likelihoods form the basis for all downstream analyses, such as calculation of diversity or other statistics.

Finally, GATK HaplotypeCaller performs local realignment around variant candidates, which avoids the need to run the legacy GATK IndelRealigner. Realignment improves results but requires more time to run. GATK is optimized for human data. For instance, performance drops dramatically if the reference sequence consists of many short scaffolds/contigs, and there is a size limit to how large the chromosomes can be. It also requires some parameter optimization and has a fairly complicated workflow (Hansen, 2016).

GATK best practice variant calling

We will base our work on the GATK Germline short variant discovery workflow. In addition to the steps outlined above, there is a step where quality scores are recalibrated in an attempt to correct errors produced by the base calling procedure itself.

GATK comes with a large set of tools. For a complete list and documentation, see the “Tool Documentation Index” (2024).

References

Hansen, N. F. (2016). Variant Calling From Next Generation Sequence Data. In E. Mathé & S. Davis (Eds.), Statistical Genomics: Methods and Protocols (pp. 209–224). Springer. https://doi.org/10.1007/978-1-4939-3578-9_11

Tool Documentation Index. (2024). In GATK.https://gatk.broadinstitute.org/hc/en-us/articles/21904996835867--Tool-Documentation-Index

Footnotes

For any shell command, use the option --help to print information about the commands and its options. zcat is a variant of the cat command that prints the contents of a file on the terminal; the z prefix shows the command works on compressed files, a common naming convention. head views the first lines of a file, and cut can be used to cut out columns from a tab-delimited file, or in this case, cut the longest strings to 30 characters width.↩︎
For more information, see unix pipelines ↩︎