We will start by mapping FASTQ read pairs to the reference. We will use the bwa read mapper together with samtools to process the resulting output.
Read group information identifies sequence sets
Some of the downstream processes require that reads have been assigned read groups(“Read Groups,” 2023), which is a compact string representation of a set of reads that originate from a sequencing unit and sample. Assigning read groups becomes particularly important for multiplexing protocols, or when a sample has been sequenced on different lanes or platform units, as it allows for the identification of sequence batch issues (e.g., poor sequence quality). Here, we want to assign a unique ID, the sample id (SM), and the sequencing platform (PL), where the latter informs the algorithms on what error model to apply. The read group is formatted as @RG\tID:uniqueid\tSM:sampleid\tPU:platform, where \t is the tab character. More fields can be added; see the SAM specification, section 1.3 (HTS Format Specifications, 2023) for a complete list.
Sample information is available in the sampleinfo.csv file:
The sample information is a combination of the run information obtained from the SRA (BioProject PRJNA549183) and the sample sheet provided with the article. An additional column SampleAlias has been added that names samples using a three-letter abbreviation for population hyphenated with the sample identifier. For the ssp. puniceus, an additional one-letter character denoting the color ecotype is inserted between population and sample id. PUN-Y-BCRD then is a sample from the puniceus subspecies with the yellow ecotype. We will use the Run column as unique ID, SampleAlias as the sample id SM, and ILLUMINA as the platform PL.
Read mapping with bwa and conversion to BAM format with samtools
Let’s map the FASTQ files corresponding to sample PUN-Y-BCRD:
There’s a lot to unpack here. First, the -R flag to bwa mem passes the read group information to the mapper. -t sets the number of threads, and -M marks shorter split hits as secondary, which is for Picard compatibility1. The first positional argument is the reference sequence, followed by the FASTQ files for read 1 and 2, respectively.
The output would be printed on the screen (try running the bwa mem command alone!), but we pipe the output to samtools sort to sort the mappings (by default by coordinate). The - simply means “read from the pipe”.
Finally, samtools view converts the text output to binary format (default), including the header information (short option -h). You can use the same command to view the resulting output on your screen:
Although you can probably figure it out by looking at the data, do have a glance at the SAM format specification mentioned above. The @SQ tag corresponds to the reference sequence dictionary and tells you what region you are looking at (chromosome LG4, which has a length LN 100000 bases; the example reference sequence was created by extracting the region on LG4 from position 12000000 to 12100000).
Mark duplicate reads with Picard MarkDuplicates
Once mapping is completed, we must find and mark duplicate reads as these can distort the results of downstream analyses, such as variant calling. We here use Picard MarkDuplicates.
To facilitate downstream processing, we will from now on make use of environment variables2 to refer to a sample and the reference sequence. Retrieve the SRR id from the sampleinfo file.
The metrics output file contains information on the rate of duplication. We will include the output in the final MultiQC report.
An additional mapping quality metric of interest is percentage mapped reads and average read depth. We can use qualimap bamqc to collect mapping statistics from a BAM file:
A summary of the results is exported to qualimap/${SAMPLE}_stats/genome_results.txt; we show percent mapping and average coverage below as examples:
grep"number of mapped reads" qualimap/${SAMPLE}_stats/genome_results.txtgrep"mean coverageData" qualimap/${SAMPLE}_stats/genome_results.txt
number of mapped reads = 7,643 (96.94%)
mean coverageData = 11.0246X
Moving on
By now it should become clear that it quickly becomes tedious to manually write commands for each and every step. We would like to speed things up, and in the interest of time, the following exercise will introduce a workflow manager (e.g., Wratten et al. (2021)). However, we stress that you should not blindly run workflows without understanding the programs and their options. The only way to investigate the effects of parameters and settings is to manually run the programs. Hopefully, you have gained some insight into how this is done with this exercise.
References
Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., & Li, H. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2), giab008. https://doi.org/10.1093/gigascience/giab008
Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997 [q-Bio]. https://arxiv.org/abs/1303.3997
Okonechnikov, K., Conesa, A., & García-Alcalde, F. (2016). Qualimap 2: Advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics, 32(2), 292–294. https://doi.org/10.1093/bioinformatics/btv566
Wratten, L., Wilm, A., & Göke, J. (2021). Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nature Methods, 18(10), 1161–1168. https://doi.org/10.1038/s41592-021-01254-9
Footnotes
bwa consistently uses short option names. Also, there is no --help option. To get a list of options, at the command line simply type bwa mem, or man bwa mem for general help and a complete list of options.↩︎
Briefly, environment variables are a great way to generalise commands. To reuse the command, one only needs to modify the value of the variable.↩︎