samtools faidx ref/M_aurantiacus_v1.fasta
picard CreateSequenceDictionary --REFERENCE ref/M_aurantiacus_v1.fasta
bwa index ref/M_aurantiacus_v1.fastaData quality control
In this exercise we will familiarize ourselves with the command line and compile some basic quality statistics from raw sequence data files.
This is a technical exercise, where the focus is not so much on the biology as on getting programs to run and interpreting the output.
The commands of this document have been run on a subset (a subregion) of the data. Therefore, although you will use the same commands, your results will differ from those presented here.
- Run
fastqcfrom command line - Perform qc on sequencing reads and interpret results
Choose one of Modules and Virtual environment to access relevant tools.
Modules
Execute the following command to load modules:
module load \
bwa/0.7.18 fastqc/0.12.1 picard/3.3.0 \
samtools/1.20
Virtual environment
Run the pgip initialization script and activate the pgip default environment:
source /cfs/klemming/projects/supr/pgip_2025/init.sh
# Now obsolete!
# pgip_activateThen activate the full environment:
# pgip_shell calls pixi shell -e full --as-is
pgip_shellCopy the contents to a file pixi.toml in directory variant-calling, cd to directory and activate environment with pixi shell:
[workspace]
channels = ["conda-forge", "bioconda"]
name = "variant-calling"
platforms = ["linux-64", "osx-64"]
[dependencies]
bwa = ">=0.7.19,<0.8"
fastqc = ">=0.12.1,<0.13"
picard = ">=3.4.0,<4"
samtools = ">=1.22.1,<2"
Make an analysis directory variant-calling and cd to it:
mkdir -p variant-calling && cd variant-callingUsing rsync
Use rsync to sync data to your analysis directory (hint: first use options -anv to run the command without actually copying data):
# Run rsync -anv first time
rsync -av /cfs/klemming/projects/supr/pgip_2025/data/monkeyflower/variant-calling/ .Using pgip client
pgip exercises setup e-variant-callingMake an analysis directory variant-calling and cd to it:
mkdir -p variant-calling && cd variant-callingThen use wget to retrieve data to analysis directory.
wget -r -np -nH -N --cut-dirs=5 \
https://export.uppmax.uu.se/uppstore2017171/pgip/data/monkeyflower/variant-calling/Preparation: reference sequence index and read QC
Prior to mapping we need to create a database index. We also generate a fasta index and a sequence dictionary for use with the picard toolkit.
With the program fastqc we can generate quality control reports for all input FASTQ files simultaneously, setting the output directory with the -o flag:
# Make fastqc output directory; --parents makes parent directories as
# needed
mkdir --parents fastqc
fastqc --outdir fastqc fastq/*fastq.gzcd to the output directory and look at the html reports. Do you notice any difference between read 1 (R1) and read 2 (R2)?
cd fastqc
open PUN-Y-INJ_R1_fastqc.html
open PUN-Y-INJ_R2_fastqc.htmlThe traffic light summary indicates whether a given quality metric has passed or not. Typically, read 2 has slightly lower quality and more quality metrics with warnings. Since these reads have been deposited in the Sequence Read Archive (SRA), it is likely they were filtered prior to upload, and we will not take any further action here.
We will use MultiQC later on to combine the results from several output reports.