Data quality control

Introduction to the command line interface. Preparation of data, raw data quality control and filtering for downstream analyses.

Author

Per Unneberg

Published

18-Sep-2025

In this exercise we will familiarize ourselves with the command line and compile some basic quality statistics from raw sequence data files.

Note

This is a technical exercise, where the focus is not so much on the biology as on getting programs to run and interpreting the output.

Commands have been run on a subset of the data

The commands of this document have been run on a subset (a subregion) of the data. Therefore, although you will use the same commands, your results will differ from those presented here.

Intended learning outcomes

Run fastqc from command line
Perform qc on sequencing reads and interpret results

Tools

bwa (Li, 2013)
fastqc
picard (“Picard Toolkit,” 2019)
samtools (Danecek et al., 2021)

Choose one of Modules and Virtual environment to access relevant tools.

Modules

Execute the following command to load modules:

module load \ 
    bwa/0.7.18 fastqc/0.12.1 picard/3.3.0 \ 
    samtools/1.20

Virtual environment

Run the pgip initialization script and activate the pgip default environment:

source /cfs/klemming/projects/supr/pgip_2025/init.sh
# Now obsolete!
# pgip_activate

Then activate the full environment:

# pgip_shell calls pixi shell -e full --as-is
pgip_shell

Copy the contents to a file pixi.toml in directory variant-calling, cd to directory and activate environment with pixi shell:

[workspace]
channels = ["conda-forge", "bioconda"]
name = "variant-calling"
platforms = ["linux-64", "osx-64"]

[dependencies]
bwa = ">=0.7.19,<0.8"
fastqc = ">=0.12.1,<0.13"
picard = ">=3.4.0,<4"
samtools = ">=1.22.1,<2"

Data setup

PDC
Local

Make an analysis directory variant-calling and cd to it:

mkdir -p variant-calling && cd variant-calling

Using rsync

Use rsync to sync data to your analysis directory (hint: first use options -anv to run the command without actually copying data):

# Run rsync -anv first time
rsync -av /cfs/klemming/projects/supr/pgip_2025/data/monkeyflower/variant-calling/ .

Using pgip client

pgip exercises setup e-variant-calling

Make an analysis directory variant-calling and cd to it:

mkdir -p variant-calling && cd variant-calling

Then use wget to retrieve data to analysis directory.

wget -r -np -nH -N --cut-dirs=5 \
     https://export.uppmax.uu.se/uppstore2017171/pgip/data/monkeyflower/variant-calling/

Preparation: reference sequence index and read QC

Prior to mapping we need to create a database index. We also generate a fasta index and a sequence dictionary for use with the picard toolkit.

samtools faidx ref/M_aurantiacus_v1.fasta
picard CreateSequenceDictionary --REFERENCE ref/M_aurantiacus_v1.fasta
bwa index ref/M_aurantiacus_v1.fasta

With the program fastqc we can generate quality control reports for all input FASTQ files simultaneously, setting the output directory with the -o flag:

# Make fastqc output directory; --parents makes parent directories as
# needed
mkdir --parents fastqc
fastqc --outdir fastqc fastq/*fastq.gz

Exercise

cd to the output directory and look at the html reports. Do you notice any difference between read 1 (R1) and read 2 (R2)?

Answer

cd fastqc
open PUN-Y-INJ_R1_fastqc.html
open PUN-Y-INJ_R2_fastqc.html

The traffic light summary indicates whether a given quality metric has passed or not. Typically, read 2 has slightly lower quality and more quality metrics with warnings. Since these reads have been deposited in the Sequence Read Archive (SRA), it is likely they were filtered prior to upload, and we will not take any further action here.

We will use MultiQC later on to combine the results from several output reports.

References

Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., & Li, H. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2), giab008. https://doi.org/10.1093/gigascience/giab008

Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997 [q-Bio]. https://arxiv.org/abs/1303.3997

Picard toolkit. (2019). In Broad Institute, GitHub repository. https://broadinstitute.github.io/picard/; Broad Institute.