Population Genomics in Practice 2025

Sequencing technologies

Illumina NovaSeq 600

Scale up and down with a tunable output of up to 6 Tb and 20B single reads in < 2 days.

Up to 2X250 bp read length. Price example: 8,000 SEK total for resequencing 3Gbp genome to 30X

https://www.illumina.com/systems/sequencing-platforms/novaseq.html

PacBio Revio

Up to 360 Gb of HiFi reads per day, equivalent to 1,300 human whole genomes per year.

Tens of kilobases long HiFi reads. Price example (Sequel II): ~35kSEK per library and SMRT cell

https://www.pacb.com/revio/

Sequencing approaches

Figure 1: DNA sequencing costs (Wetterstrand, KA)

Despite price drop, still need to make choices regarding depth and breadth of sequencing coverage and number of samples.

Figure 2: Comparison of common sequencing approaches. Restriction site-associated DNA sequencing (RAD-seq, top), targets regions flanking given restriction sites, but misses much of genome. Pooled sequencing (Pool-seq, middle) is cost-effective, but loses information about individuals. Low-coverage whole genome sequencing (lcWGS, bottom) is increasing in popularity, but genotyping low coverage data is problematic.

Lou et al. (2021)

Our focus will be on Whole Genome reSequencing (WGS), mostly high-coverage.

Despite the fact that sequencing costs have dropped dramatically (left), there still are choices to be made regarding the distribution of costs along 1) sequencing coverage depth, i.e., the mean depth of sequencing 2) sequencing coverage breadth, i.e., whether or not to do targeted or whole-genome resequencing or 3) sample size; how many individuals to sample. Whole-genome resequencing of individuals from populations to sufficient depth (30X) is still very expensive, but often needed to understand mechanisms of adaptation (Lou et al., 2021, p. 5967). Various protocols have been developed to meet the challenges that cost imposes:

RAD-seq, restriction site-associated DNA sequencing, targets regions flanking given restriction sites. Downside: much of genome is missed
pool-seq, pooled sequencing. Cost-effective, but loses information about individuals
lcWGS, low-coverage whole genome sequencing increasing in popularity. Genotyping low coverage is problematic however.

Our focus here is WGS (whole-genome resequencing), primarily high-coverage, despite the cost it may incur.

Genome assembly and population resequencing

Genome assembly

Allendorf et al. (2022)

Population resequencing

Figure 4: Overview of short reads mapped to a reference sequence. Reads have been colored according to population (red or yellow). There are three individuals from each population. Note the variation in sequence coverage. Mismatches in reads are highlighted as different colors with respect to population color.

DNA sequences in FASTQ format

ls --long --human fastq/*.fastq.gz

-rw-r--r-- 1 runner runner 302K Sep 18 22:10 fastq/PUN-Y-INJ_R1.fastq.gz
-rw-r--r-- 1 runner runner 393K Sep 18 22:10 fastq/PUN-Y-INJ_R2.fastq.gz

Count number of lines:

zcat fastq/PUN-Y-INJ_R1.fastq.gz | wc --lines

Format:

sequence id (prefixed by @)
DNA sequence
separator (+)
Phred base quality scores

zcat fastq/PUN-Y-INJ_R1.fastq.gz | head --lines 8 | cut --characters -30

@SRR9309790.10003134
TAAATCGATTCGTTTTTGCTATCTTCGTCT
+
AAFFFJJJJJJJFJJJJJJJJJJJJJJJJJ
@SRR9309790.10003222
TAAATCGATTCGTTTTTGCTATCTTCGTCT
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJ

DNA sequence quality control

Quality values represent the probability P that the call is incorrect. They are coded as Phred quality scores Q. Here, Q=20 implies 1% probability of error, Q=30 implies 0.1% and so on. Typically you should not rely on quality values below 20.

Q = -10 \log_{10} P

A common way to do QC is with fastqc:

fastqc --outdir fastqc --extract fastq/*fastq.gz

For Illumina paired-end sequencing, the second read pair usually shows a larger drop in quality towards ends
Trimming the sequences for adapter sequence and quality is good practice (e.g., with CutAdapt (Martin, 2011))

Bibliography

Allendorf, F. W., Funk, W. C., Aitken, S. N., Byrne, M., & Luikart, G. (2022). Population Genomics. In F. W. Allendorf, W. C. Funk, S. N. Aitken, M. Byrne, G. Luikart, & A. Antunes (Eds.), Conservation and the Genomics of Populations (p. 0). Oxford University Press. https://doi.org/10.1093/oso/9780198856566.003.0004

Lou, R. N., Jacobs, A., Wilder, A. P., & Therkildsen, N. O. (2021). A beginner’s guide to low-coverage whole genome sequencing for population genomics. Molecular Ecology, 30(23), 5966–5993. https://doi.org/10.1111/mec.16077

Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, 17(1), pp. 10–12. https://doi.org/10.14806/ej.17.1.200

Wetterstrand, KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP). www.genome.gov/sequencingcostsdata

DNA sequencing data