Population Genomics in Practice 2025
  • Slides
  • Exercises
  • Code recipes
  1. Exercises
  2. Variant calling
  3. Data quality control
  • Slides
    • Listing
    • Introduction
      • Population genomics in practice
    • Population genetics foundations
      • Listing
      • Data and definitions
      • Alleles and genealogies
      • Linkage disequilibrium
      • The Wright-Fisher model
      • Genetic diversity
      • Selection
    • Variant calling
      • Listing
      • DNA sequencing data
      • Read mapping
      • Variant calling and genotyping
      • Variant calling workflows
    • Variant filtering
      • Listing
      • Variant filtering
      • Depth filtering
    • Genetic diversity
      • Listing
      • Genetic diversity
    • Population structure
      • Listing
      • Principal component analysis
      • Admixture
    • Demography
      • Listing
    • Selection
      • Listing
    • Simulation
      • Listing
      • Brief introduction to simulation packages and stdpopsim
      • Primer on the coalescent and forward simulation
      • Ancestral recombination graph inference
  • Exercises
    • Listing
    • Data
      • Compute environment
      • Monkeyflowers dataset
    • Variant calling
      • Listing
      • Variant calling introduction
      • Data quality control
      • Read mapping and duplicate removal
      • Variant calling workflow
    • Variant filtering
      • Listing
      • Basic variant filtering
      • Depth filtering on invariant sites
    • Recombination and linkage
      • Listing
      • Linkage disequilibrium decay
    • Genetic diversity
      • Listing
      • Genetic diversity landscapes
    • Population structure
      • Listing
      • Principal component analysis
      • Admixture
      • D-statistics
    • Simulation
      • Listing
      • HOWTO
      • Introduction to stdpopsim
      • Simulating selection with stdpopsim
      • Introduction to simulation with msprime
  • Code recipes
    • Code recipes

On this page

  • Preparation: reference sequence index and read QC
  1. Exercises
  2. Variant calling
  3. Data quality control

Data quality control

Introduction to the command line interface. Preparation of data, raw data quality control and filtering for downstream analyses.
Author

Per Unneberg

Published

18-Sep-2025

In this exercise we will familiarize ourselves with the command line and compile some basic quality statistics from raw sequence data files.

Note

This is a technical exercise, where the focus is not so much on the biology as on getting programs to run and interpreting the output.

Commands have been run on a subset of the data

The commands of this document have been run on a subset (a subregion) of the data. Therefore, although you will use the same commands, your results will differ from those presented here.

Intended learning outcomes
  • Run fastqc from command line
  • Perform qc on sequencing reads and interpret results
Tools
  • Listing
  • PDC
  • pixi
  • bwa (Li, 2013)
  • fastqc
  • picard (“Picard Toolkit,” 2019)
  • samtools (Danecek et al., 2021)

Choose one of Modules and Virtual environment to access relevant tools.

Modules

Execute the following command to load modules:

module load \ 
    bwa/0.7.18 fastqc/0.12.1 picard/3.3.0 \ 
    samtools/1.20
Virtual environment

Run the pgip initialization script and activate the pgip default environment:

source /cfs/klemming/projects/supr/pgip_2025/init.sh
# Now obsolete!
# pgip_activate

Then activate the full environment:

# pgip_shell calls pixi shell -e full --as-is
pgip_shell

Copy the contents to a file pixi.toml in directory variant-calling, cd to directory and activate environment with pixi shell:

[workspace]
channels = ["conda-forge", "bioconda"]
name = "variant-calling"
platforms = ["linux-64", "osx-64"]

[dependencies]
bwa = ">=0.7.19,<0.8"
fastqc = ">=0.12.1,<0.13"
picard = ">=3.4.0,<4"
samtools = ">=1.22.1,<2"

Data setup
  • PDC
  • Local

Make an analysis directory variant-calling and cd to it:

mkdir -p variant-calling && cd variant-calling
Using rsync

Use rsync to sync data to your analysis directory (hint: first use options -anv to run the command without actually copying data):

# Run rsync -anv first time
rsync -av /cfs/klemming/projects/supr/pgip_2025/data/monkeyflower/variant-calling/ .
Using pgip client
pgip exercises setup e-variant-calling

Make an analysis directory variant-calling and cd to it:

mkdir -p variant-calling && cd variant-calling

Then use wget to retrieve data to analysis directory.

wget -r -np -nH -N --cut-dirs=5 \
     https://export.uppmax.uu.se/uppstore2017171/pgip/data/monkeyflower/variant-calling/

Preparation: reference sequence index and read QC

Prior to mapping we need to create a database index. We also generate a fasta index and a sequence dictionary for use with the picard toolkit.

samtools faidx ref/M_aurantiacus_v1.fasta
picard CreateSequenceDictionary --REFERENCE ref/M_aurantiacus_v1.fasta
bwa index ref/M_aurantiacus_v1.fasta

With the program fastqc we can generate quality control reports for all input FASTQ files simultaneously, setting the output directory with the -o flag:

# Make fastqc output directory; --parents makes parent directories as
# needed
mkdir --parents fastqc
fastqc --outdir fastqc fastq/*fastq.gz
Exercise

cd to the output directory and look at the html reports. Do you notice any difference between read 1 (R1) and read 2 (R2)?

Answer
cd fastqc
open PUN-Y-INJ_R1_fastqc.html
open PUN-Y-INJ_R2_fastqc.html

The traffic light summary indicates whether a given quality metric has passed or not. Typically, read 2 has slightly lower quality and more quality metrics with warnings. Since these reads have been deposited in the Sequence Read Archive (SRA), it is likely they were filtered prior to upload, and we will not take any further action here.

We will use MultiQC later on to combine the results from several output reports.

References

Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., & Li, H. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2), giab008. https://doi.org/10.1093/gigascience/giab008
Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997 [q-Bio]. https://arxiv.org/abs/1303.3997
Picard toolkit. (2019). In Broad Institute, GitHub repository. https://broadinstitute.github.io/picard/; Broad Institute.

2025 NBIS | GPL-3 License

 

Published with Quarto v