samtools faidx ref/M_aurantiacus_v1.fasta
picard CreateSequenceDictionary --REFERENCE ref/M_aurantiacus_v1.fasta
bwa index ref/M_aurantiacus_v1.fasta
Data quality control
In this exercise we will familiarize ourselves with the command line and compile some basic quality statistics from raw sequence data files.
This is a technical exercise, where the focus is not so much on the biology as on getting programs to run and interpreting the output.
The commands of this document have been run on a subset (a subregion) of the data. Therefore, although you will use the same commands, your results will differ from those presented here.
- Run
fastqc
from command line - Perform qc on sequencing reads and interpret results
Choose one of Modules and Virtual environment to access relevant tools.
Modules
Execute the following command to load modules:
module load \
bwa/0.7.18 fastqc/0.12.1 picard/3.3.0 \
samtools/1.20
Virtual environment
Run the pgip
initialization script and activate the pgip
default environment:
source /cfs/klemming/projects/supr/pgip_2025/init.sh
# Now obsolete!
# pgip_activate
Then activate the full
environment:
# pgip_shell calls pixi shell -e full --as-is
pgip_shell
Copy the contents to a file pixi.toml
in directory variant-calling
, cd
to directory and activate environment with pixi shell
:
[workspace]
channels = ["conda-forge", "bioconda"]
name = "variant-calling"
platforms = ["linux-64", "osx-64"]
[dependencies]
bwa = ">=0.7.19,<0.8"
fastqc = ">=0.12.1,<0.13"
picard = ">=3.4.0,<4"
samtools = ">=1.22.1,<2"
Make an analysis directory variant-calling
and cd to it:
mkdir -p variant-calling && cd variant-calling
Using rsync
Use rsync to sync data to your analysis directory (hint: first use options -anv
to run the command without actually copying data):
# Run rsync -anv first time
rsync -av /cfs/klemming/projects/supr/pgip_2025/data/monkeyflower/variant-calling/ .
Using pgip client
pgip exercises setup e-variant-calling
Make an analysis directory variant-calling
and cd to it:
mkdir -p variant-calling && cd variant-calling
Then use wget to retrieve data to analysis directory.
wget -r -np -nH -N --cut-dirs=5 \
https://export.uppmax.uu.se/uppstore2017171/pgip/data/monkeyflower/variant-calling/
Preparation: reference sequence index and read QC
Prior to mapping we need to create a database index. We also generate a fasta index and a sequence dictionary for use with the picard
toolkit.
With the program fastqc
we can generate quality control reports for all input FASTQ
files simultaneously, setting the output directory with the -o
flag:
# Make fastqc output directory; --parents makes parent directories as
# needed
mkdir --parents fastqc
fastqc --outdir fastqc fastq/*fastq.gz
cd
to the output directory and look at the html
reports. Do you notice any difference between read 1 (R1) and read 2 (R2)?
cd fastqc
open PUN-Y-INJ_R1_fastqc.html
open PUN-Y-INJ_R2_fastqc.html
The traffic light summary indicates whether a given quality metric has passed or not. Typically, read 2 has slightly lower quality and more quality metrics with warnings. Since these reads have been deposited in the Sequence Read Archive (SRA), it is likely they were filtered prior to upload, and we will not take any further action here.
We will use MultiQC
later on to combine the results from several output reports.