Read mapping and duplicate removal

Read mapping to reference sequence and removal of duplicate reads.

Author

Per Unneberg

Published

18-Sep-2025

Intended learning outcomes

Map reads to a reference genome
Mark duplicate read mappings

Tools

bwa (Li, 2013)
qualimap (Okonechnikov et al., 2016)
samtools (Danecek et al., 2021)

Choose one of Modules and Virtual environment to access relevant tools.

Modules

Execute the following command to load modules:

module load bioinfo-tools \ 
    bwa/0.7.18 QualiMap/2.2.1 samtools/1.20

Virtual environment

Run the pgip initialization script and activate the pgip default environment:

source /cfs/klemming/projects/supr/pgip_2025/init.sh
# Now obsolete!
# pgip_activate

Then activate the full environment:

# pgip_shell calls pixi shell -e full --as-is
pgip_shell

Copy the contents to a file pixi.toml in directory variant-calling, cd to directory and activate environment with pixi shell:

[workspace]
channels = ["conda-forge", "bioconda"]
name = "variant-calling"
platforms = ["linux-64", "osx-64"]

[dependencies]
bwa = ">=0.7.19,<0.8"
qualimap = ">=2.3,<3"
samtools = ">=1.22.1,<2"

Data setup

PDC
Local

Make an analysis directory variant-calling and cd to it:

mkdir -p variant-calling && cd variant-calling

Using rsync

Use rsync to sync data to your analysis directory (hint: first use options -anv to run the command without actually copying data):

# Run rsync -anv first time
rsync -av /cfs/klemming/projects/supr/pgip_2025/data/monkeyflower/variant-calling/ .

Using pgip client

pgip exercises setup e-variant-calling

Make an analysis directory variant-calling and cd to it:

mkdir -p variant-calling && cd variant-calling

Then use wget to retrieve data to analysis directory.

wget -r -np -nH -N --cut-dirs=5 \
     https://export.uppmax.uu.se/uppstore2017171/pgip/data/monkeyflower/variant-calling/

Read mapping

We will start by mapping FASTQ read pairs to the reference. We will use the bwa read mapper together with samtools to process the resulting output.

Read group information identifies sequence sets

Some of the downstream processes require that reads have been assigned read groups (“Read Groups,” 2023), which is a compact string representation of a set of reads that originate from a sequencing unit and sample. Assigning read groups becomes particularly important for multiplexing protocols, or when a sample has been sequenced on different lanes or platform units, as it allows for the identification of sequence batch issues (e.g., poor sequence quality). Here, we want to assign a unique ID, the sample id (SM), and the sequencing platform (PL), where the latter informs the algorithms on what error model to apply. The read group is formatted as @RG\tID:uniqueid\tSM:sampleid\tPU:platform, where \t is the tab character. More fields can be added; see the SAM specification, section 1.3 (HTS Format Specifications, 2023) for a complete list.

Sample information is available in the sampleinfo.csv file:

head -n 3 sampleinfo.csv

Sample,Run,ScientificName,SampleName,AuthorSample,SampleAlias,Taxon,Latitude,Longitude,% Reads aligned,Seq. Depth
SRS4979271,SRR9309782,Diplacus longiflorus,LON-T33_1,T33,LON-T33,ssp. longiflorus,34.3438,-118.5099,94.6,18.87
SRS4979267,SRR9309785,Diplacus longiflorus,LON-T8_8,T8,LON-T8,ssp. longiflorus,34.1347,-118.6452,82.6,25.11

The sample information is a combination of the run information obtained from the SRA (BioProject PRJNA549183) and the sample sheet provided with the article. An additional column SampleAlias has been added that names samples using a three-letter abbreviation for population hyphenated with the sample identifier. For the ssp. puniceus, an additional one-letter character denoting the color ecotype is inserted between population and sample id. PUN-Y-BCRD then is a sample from the puniceus subspecies with the yellow ecotype. We will use the Run column as unique ID, SampleAlias as the sample id SM, and ILLUMINA as the platform PL.

Read mapping with bwa and conversion to BAM format with samtools

Let’s map the FASTQ files corresponding to sample PUN-Y-BCRD:

bwa mem -R "@RG\tID:SRR9309788\tSM:PUN-Y-BCRD\tPL:ILLUMINA" -t 4 \
    -M ref/M_aurantiacus_v1.fasta \
    fastq/PUN-Y-BCRD_R1.fastq.gz \
    fastq/PUN-Y-BCRD_R2.fastq.gz | \
    samtools sort - | \
    samtools view --with-header --output bam/PUN-Y-BCRD.bam

There’s a lot to unpack here. First, the -R flag to bwa mem passes the read group information to the mapper. -t sets the number of threads, and -M marks shorter split hits as secondary, which is for Picard compatibility¹. The first positional argument is the reference sequence, followed by the FASTQ files for read 1 and 2, respectively.

The output would be printed on the screen (try running the bwa mem command alone!), but we pipe the output to samtools sort to sort the mappings (by default by coordinate). The - simply means “read from the pipe”.

Finally, samtools view converts the text output to binary format (default), including the header information (short option -h). You can use the same command to view the resulting output on your screen:

samtools view bam/PUN-Y-BCRD.bam | head -n 2

SRR9309788.7313829  129 LG4 29  60  103M2I46M   =   83824   83796   GTCAATTTCATGTTTGACTTTTAGATTTTTAATTAATTATATATTTTTTGCAATTTGTAACCTCTTTAACCTTTATTTAATTTTTTGAATTTCTTTTTTATTTTATTTTCAAATACAATTCACCCCAATTAATTATTTTAATTATAACAAT AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJAFJJFJJJF<JJJJJJJJ<AJJJJJJJJJ-FFJJAJJJJA-7-7----<A---7<-)7AAA-AA7-<-7--<A---7---7 NM:i:6  MD:Z:107T25A1C6A6   MC:Z:88M8D63M   MQ:i:60 AS:i:121    XS:i:80 RG:Z:SRR9309788
SRR9309788.9554822  99  LG4 58  60  74M2I75M    =   256 321 TAATTAATTATATATTTTTTGCAATTTGTAACCTCTTTAACCTTTATTTAATTTTTTGAATTTCTTTTTTATTTTATTTTCAAATACAATTCACCCCAATTAATTAATCTAATTAAAACAATTAAATAATCAACCCGAATGATTAACCAAT AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFAAJJJJJJJJFJJJFJJ<FJJAJJJFJJJJ< NM:i:5  MD:Z:78T54G0C14 MC:Z:21S82M2I9M5I32M    MQ:i:60 AS:i:126    XS:i:81 RG:Z:SRR9309788

Exercise

Look at the header information of the output BAM file. What does the @SQ tag stand for, and what does the information on that line tell you?

Hint

To get a list of options, type samtools view. The -H or --header-only option views the header only.

Answer

samtools view -H bam/PUN-Y-BCRD.bam

@HD VN:1.5  SO:coordinate
@SQ SN:LG4  LN:100000
@RG ID:SRR9309788   SM:PUN-Y-BCRD   PL:ILLUMINA
@PG ID:bwa  PN:bwa  VN:0.7.19-r1273 CL:bwa mem -R @RG\tID:SRR9309788\tSM:PUN-Y-BCRD\tPL:ILLUMINA -t 4 -M ref/M_aurantiacus_v1.fasta fastq/PUN-Y-BCRD_R1.fastq.gz fastq/PUN-Y-BCRD_R2.fastq.gz
@PG ID:samtools PN:samtools PP:bwa  VN:1.22.1   CL:samtools sort -
@PG ID:samtools.1   PN:samtools PP:samtools VN:1.22.1   CL:samtools view --with-header --output bam/PUN-Y-BCRD.bam
@PG ID:samtools.2   PN:samtools PP:samtools.1   VN:1.22.1   CL:samtools view -H bam/PUN-Y-BCRD.bam

Although you can probably figure it out by looking at the data, do have a glance at the SAM format specification mentioned above. The @SQ tag corresponds to the reference sequence dictionary and tells you what region you are looking at (chromosome LG4, which has a length LN 100000 bases; the example reference sequence was created by extracting the region on LG4 from position 12000000 to 12100000).

Mark duplicate reads with Picard MarkDuplicates

Once mapping is completed, we must find and mark duplicate reads as these can distort the results of downstream analyses, such as variant calling. We here use Picard MarkDuplicates.

To facilitate downstream processing, we will from now on make use of environment variables ² to refer to a sample and the reference sequence. Retrieve the SRR id from the sampleinfo file.

export SRR=SRR9309790
export SAMPLE=PUN-Y-INJ
export REF=ref/M_aurantiacus_v1.fasta

picard MarkDuplicates --INPUT bam/${SAMPLE}.bam \
    --METRICS_FILE md/${SAMPLE}.dup_metrics.txt \
    --OUTPUT md/${SAMPLE}.bam

The metrics output file contains information on the rate of duplication. We will include the output in the final MultiQC report.

An additional mapping quality metric of interest is percentage mapped reads and average read depth. We can use qualimap bamqc to collect mapping statistics from a BAM file:

qualimap bamqc -bam bam/${SAMPLE}.bam -outdir qualimap/${SAMPLE}_stats

A summary of the results is exported to qualimap/${SAMPLE}_stats/genome_results.txt; we show percent mapping and average coverage below as examples:

grep "number of mapped reads" qualimap/${SAMPLE}_stats/genome_results.txt
grep "mean coverageData" qualimap/${SAMPLE}_stats/genome_results.txt

     number of mapped reads = 7,643 (96.94%)
     mean coverageData = 11.0246X

Moving on

By now it should become clear that it quickly becomes tedious to manually write commands for each and every step. We would like to speed things up, and in the interest of time, the following exercise will introduce a workflow manager (e.g., Wratten et al. (2021)). However, we stress that you should not blindly run workflows without understanding the programs and their options. The only way to investigate the effects of parameters and settings is to manually run the programs. Hopefully, you have gained some insight into how this is done with this exercise.

References

Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., & Li, H. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2), giab008. https://doi.org/10.1093/gigascience/giab008

HTS format specifications. (2023). https://samtools.github.io/hts-specs/

Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997 [q-Bio]. https://arxiv.org/abs/1303.3997

Okonechnikov, K., Conesa, A., & García-Alcalde, F. (2016). Qualimap 2: Advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics, 32(2), 292–294. https://doi.org/10.1093/bioinformatics/btv566

Read groups. (2023). In GATK. https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups

Wratten, L., Wilm, A., & Göke, J. (2021). Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nature Methods, 18(10), 1161–1168. https://doi.org/10.1038/s41592-021-01254-9

Footnotes

bwa consistently uses short option names. Also, there is no --help option. To get a list of options, at the command line simply type bwa mem, or man bwa mem for general help and a complete list of options.↩︎
Briefly, environment variables are a great way to generalise commands. To reuse the command, one only needs to modify the value of the variable.↩︎