Variant calling

From sequence data to variant call set

Per Unneberg

NBIS

15-Nov-2023

Yesterday

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
T	T	A	C	A	A	T	C	C	G	A	T	C	G	T
T	T	A	C	G	A	T	G	C	G	C	T	C	G	T
T	C	A	C	A	A	T	G	C	G	A	T	G	G	A
T	T	A	C	G	A	T	G	C	G	C	T	C	G	T
	*			*			*			*		*		*

\[ \begin{align} \pi & = \sum_{j=1}^S h_j = \sum_{j=1}^{S} \frac{n}{n-1}\left(1 - \sum_i p_i^2 \right) \\ & \stackrel{S=6,\\ n=4}{=} \sum_{j=1}^{6} \frac{4}{3}\left(1 - \sum_i p_i^2\right) \\ & = \frac{4}{3}\left(\mathbf{\color{#a7c947}{4}}\left(1-\frac{1}{16}- \frac{9}{16}\right) + \mathbf{\color{#a7c947}{2}}\left(1 - \frac{1}{4} - \frac{1}{4}\right)\right) = \frac{10}{3} \end{align} \]

\[ \begin{align} \pi & = \frac{\sum_{i=1}^{n-1}i(n-i)\xi_i}{n(n-1)/2} \\ & \stackrel{n=4}{=} \frac{1*(4-1)*4 + 2*(4-2)*2}{6} = \frac{10}{3} \end{align} \]

This is not how real data looks like from the beginning…

The real data

@SRR9309790.10003134
TAAATCGATTCGTTTTTGCTATCTTCGTCT
+
AAFFFJJJJJJJFJJJJJJJJJJJJJJJJJ
@SRR9309790.10003222
TAAATCGATTCGTTTTTGCTATCTTCGTCT
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJ

LG4:30430

 30431     30441     30451     30461           
CATTGGCAATGGCATCAGTTGAGCATCTTAGTACGAACTAAAAGCTG
...............M...............................
...............A...                            
...............A.................              
.............................................. 
.............................................. 
.............................................. 
...............A...............................
...............A...............................
,,,,,,,,,,,,a,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,a,,,,,aa,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
  ,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

The process

Variant and genotype calling

Nielsen et al. (2011)

SNP calling: Identification of polymorphic sites (\(>1\)% allele frequency)
Variant calling: Identification of variant sites (sufficient that any allele differs)
Genotype calling: Determine the allele combination for each individual (aa, aA, or AA for bi-allelic variants)

Knowing variant sites informs us of possible genotypes and improves genotyping.

Example: knowing a site has A or C limits possible genotype calls to AA, AC, or CC

Figure caption from (Nielsen et al., 2011):

Pre-processing steps (shown in yellow) transform the raw data from next-generation sequencing technology into a set of aligned reads that have a measure of confidence, or quality score, associated with the bases of each read. The per-base quality scores produced by base-calling algorithms may need to be recalibrated to accurately reflect the true error rates. Depending on the number of samples and the depth of coverage, either a multi-sample calling procedure (green) or a single-sample calling procedure (orange) may then be applied to obtain SNP or genotype calls and associated quality scores. Note that the multi-sample procedure may include a linkage-based analysis, which can substantially improve the accuracy of SNP or genotype calls. Finally, post-processing (purple) uses both known data and simple heuristics to filter the set of SNPs and/or improve the associated quality scores. Optional, although recommended, steps are shown in dashed lines.

Sequencing DNA

Sequencing technologies

Illumina NovaSeq 600

Scale up and down with a tunable output of up to 6 Tb and 20B single reads in < 2 days.

Up to 2X250 bp read length. Price example: 8,000 SEK total for resequencing 3Gbp genome to 30X

https://www.illumina.com/systems/sequencing-platforms/novaseq.html

PacBio Revio

Up to 360 Gb of HiFi reads per day, equivalent to 1,300 human whole genomes per year.

Tens of kilobases long HiFi reads. Price example (Sequel II): ~35kSEK per library and SMRT cell

https://www.pacb.com/revio/

DNA sequences in FASTQ format

ls --long --human *.fastq.gz

-rw-r--r-- 1 runner docker 302K Nov 15 11:10 PUN-Y-INJ_R1.fastq.gz
-rw-r--r-- 1 runner docker 393K Nov 15 11:10 PUN-Y-INJ_R2.fastq.gz

Count number of lines:

zcat PUN-Y-INJ_R1.fastq.gz | wc --lines

Format:

sequence id (prefixed by @)
DNA sequence
separator (+)
Phred base quality scores

zcat PUN-Y-INJ_R1.fastq.gz | head --lines 8 | cut --characters -30

@SRR9309790.10003134
TAAATCGATTCGTTTTTGCTATCTTCGTCT
+
AAFFFJJJJJJJFJJJJJJJJJJJJJJJJJ
@SRR9309790.10003222
TAAATCGATTCGTTTTTGCTATCTTCGTCT
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJ

DNA sequence quality control

Quality values represent the probability \(P\) that the call is incorrect. They are coded as Phred quality scores \(Q\). Here, \(Q=20\) implies 1% probability of error, \(Q=30\) 0.1% and so on. Typically you should not rely on quality values below \(20\).

\[ Q = -10 \log_{10} P \]

A common way to do QC is with fastqc:

fastqc --outdir . --extract *fastq.gz

Sequencing approaches

Despite price drop, still need to make choices regarding depth and breadth of sequencing coverage and number of samples.

Lou et al. (2021)

Here focus on Whole Genome reSequencing (WGS), mostly high-coverage.

Despite the fact that sequencing costs have dropped dramatically (left), there still are choices to be made regarding the distribution of costs along 1) sequencing coverage depth, i.e., the mean depth of sequencing 2) sequencing coverage breadth, i.e., whether or not to do targeted or whole-genome resequencing or 3) sample size; how many individuals to sample. Whole-genome resequencing of individuals from populations to sufficient depth (30X) is still very expensive, but often needed to understand mechanisms of adaptation (Lou et al., 2021, p. 5967). Various protocols have been developed to meet the challenges that cost imposes:

RAD-seq, restriction site-associated DNA sequencing, targets regions flanking given restriction sites. Downside: much of genome is missed
pool-seq, pooled sequencing. Cost-effective, but loses information about individuals
lcWGS, low-coverage whole genome sequencing increasing in popularity. Genotyping low coverage is problematic however.

Our focus here is WGS (whole-genome resequencing), primarily high-coverage, despite the cost it may incur.

Genome assembly and population resequencing

Genome assembly

Allendorf et al. (2022)

Population resequencing

Read mapping

Sequence alignment maps reads to a reference

Figure 1: Screenshot of reference sequence (top) and aligned reads (bottom). Second line with `.` characters is the consensus sequence. Bases are colored by nucleotide. Letter case indicates forward (upper-case) or reverse (lower-case) alignment. `*` is placeholder for deleted base.

Aim of sequence alignment (read mapping) is to determine source in reference sequence. Some commonly used read mappers for resequencing are

BWA, BWA-MEM (H. Li, 2013)
Novoalign (https://www.novocraft.com/)
Minimap2 (H. Li, 2018)

For a recent comprehensive comparison see Donato et al. (2021)

Alignments are stored in BAM format

Header information

samtools view --header-only PUN-Y-INJ.sort.dup.bam | head --lines 4

@HD VN:1.6  SO:coordinate
@SQ SN:LG4  LN:100000
@RG ID:SRR9309790   SM:PUN-Y-INJ    PL:ILLUMINA
@PG ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem -R @RG\tID:SRR9309790\tSM:PUN-Y-INJ\tPL:ILLUMINA -p -t 14 -M tiny/M_aurantiacus_v1_splitline_ordered.fasta -

Format: metadata record types prefixed with @, e.g., @RG is the read group

Alignments

samtools view PUN-Y-INJ.sort.dup.bam | head --lines 1

SRR9309790.7750070  65  LG4 1   39  45S9M1I96M  =   83782   83781   TGAACTATAGTCGATGGGACGAATACCCCCCTGAACTTGCGAAGGGGACAATTACCCCCCTCTGTTATGTTTCAGTCAATTTCATGTTTGATTTTTAGATTTTTAATTAATTATATATTTTTTGCAATTTGTAACCTCTTTAACCTTTATT AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJAJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJFFJFJJJJJJJJJJFJJJJJJJJFJJJJJJJJJAJJJJJJJJJJJJJJJJJFA<FFJJ7FF SA:Z:LG4,83018,+,30M2I28M91S,37,5;  MC:Z:12S10M2I123M4S MD:Z:16C28C59   PG:Z:MarkDuplicates RG:Z:SRR9309790 NM:i:3  MQ:i:50 AS:i:88 XS:i:70 ms:i:5185

Some important columns: 1:QUERY, 3:REFERENCE, 4:POSITION, 5:MAPQ, 6:CIGAR. The CIGAR string compiles information on the alignment, such as match (M), soft clipping (S), and insertion to reference (I)

cf https://samtools.github.io/hts-specs/SAMv1.pdf

Mapped alignments can be viewed with `samtools tview`

samtools tview -p LG4:30430 -d H -w 60 \
   PUN-Y-INJ.sort.dup.bam \
   M_aurantiacus_v1.fasta

LG4:30430

 30431     30441     30451     30461     30471              
CATTGGCAATGGCATCAGTTGAGCATCTTAGTACGAACTAAAAGCTGCGAAAAAATATTT
...............M............................................
...............A...                               ,,,,,,,,,,
...............A.................                 ,,,,,,,,,,
..............................................      ,,,,,,,,
..............................................              
..............................................              
...............A............................................
...............A............................................
,,,,,,,,,,,,a,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,a,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,a,,,,,aa,a,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
  ,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

samtools tview -p LG4:30430 -d H -w 60 \
   PUN-R-ELF.sort.dup.bam \
   M_aurantiacus_v1.fasta

LG4:30430

 30431     30441     30451     30461     30471              
CATTGGCAATGGCATCAGTTGAGCATCTTAGTACGAACTAAAAGCTGCGAAAAAATATTT
...............A............................................
,,,,,,,                           ..........................
...............A.....                  .....................
...............A...A....               .....................
...............A.........                                   
...............A...........C.............                   
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...............A............................................
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...............A............................................
,,,g,,,,,,,a,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
  .............A............................................

Aka pileup format. Forward (.) and backward (,) mapping reads. Mismatches shown as letters.

Potential error corrections and pitfalls

Instrument

PCR duplicates -> MarkDuplicates
systematic errors from sequencing machine -> Base Quality Score Recalibration (BQSR)

Reference

quality of reference sequence!
repetitive sequence

Variant calling and genotyping

Variants show up in pileup alignments

Sample PUN-Y-INJ

LG4:30430

 30431     30441     30451     30461     30471              
CATTGGCAATGGCATCAGTTGAGCATCTTAGTACGAACTAAAAGCTGCGAAAAAATATTT
...............M............................................
...............A...                               ,,,,,,,,,,
...............A.................                 ,,,,,,,,,,
..............................................      ,,,,,,,,
..............................................              
..............................................              
...............A............................................
...............A............................................
,,,,,,,,,,,,a,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,a,,a,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,a,,,,,aa,a,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
  ,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Sample PUN-R-ELF

LG4:30430

 30431     30441     30451     30461     30471              
CATTGGCAATGGCATCAGTTGAGCATCTTAGTACGAACTAAAAGCTGCGAAAAAATATTT
...............A............................................
,,,,,,,                           ..........................
...............A.....                  .....................
...............A...A....               .....................
...............A.........                                   
...............A...........C.............                   
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...............A............................................
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...............A............................................
,,,g,,,,,,,a,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
  .............A............................................

Potential variants show up as multiple mismatches in a column. Two questions arise:

how do we detect variant sites?
how do we distinguish variants from sequencing error?

Simple approach: filter bases on quality (e.g., Q20), call heterozygous if 20-80% bases non-reference.

Issues: undercalls heterozygotes, no measure of uncertainty

Solution: probabilistic methods!

(Nielsen et al., 2011)

We can calculate likelihoods of observed data

Example (exluding sequencing error!)

   
TGC
.K.
...
,,,
.T.
.T.
...
,,,
,t,

Goal: calculate likelihood of observing \(X=\)G,g,T,T,G,g,t from a genotype \(G\). We assume each observation \(X_i\) can be treated independently. We restrict possible genotypes to the observed alleles (i.e., G, T). Some observations:

Prob(\(X_1=\)G assuming genotype T,T) = P(\(X_1=\)G|T,T) = \(0\)

Prob(\(X_1=\)G assuming genotype G,G) = P(\(X_1=\)G|G,G) = \(1\)

Prob(\(X_1=\)G assuming genotype G,T) = P(\(X_1=\)G|G,T) = \(0.5\)

To get total likelihood \(P(X|G)\) assuming a genotype \(G\) (here G,T), we can multiply over all observations (reads):

\[ \begin{align} P(X|G,T) & = P(X_1=G|G,T)P(X_2=g|G,T)P(X_3=T|G,T)P(X_4=T|G,T) \\ & P(X_5=G|G,T)P(X_6=g|G,T)P(X_7=t|G,T) = 0.5^7 \end{align} \]

We can use Bayes’ theorem to genotype

Example

   
TGC
.K.
...
,,,
.T.
.T.
...
,,,
,t,

For a given site, we have a number of observations \(X\). We have shown we can calculate the likelihood of observing \(X\) given a genotype \(G\), \(P(X|G)\).

However; what we really want to know is the most likely genotype \(G\) given the data \(X\), or \(P(G|X)\).

Apply Bayes’ theorem:

\[ P(G|X) \sim P(X|G)\cdot P(G) \]

\[ \text{posterior} \sim \text{likelihood} \cdot \text{prior} \]

Consequently we need to set a prior on \(G\). If allele frequencies are known, we can constrain the frequencies; for example, if A is known to be low (\(\sim1\)%) AA genotype is very unlikely. Otherwise, could set all equal (flat prior).

cf https://gatk.broadinstitute.org/hc/en-us/articles/360035890511

R. Li et al. (2009), Table 1, shows a nice numerical example of one way of setting priors. The authors assume a specific allele, G, in the reference sequence:

–G–

They start by calculating the frequency of haploid genotypes. They first determine \(p_G\) by assuming a heterozygous SNP rate \(f=0.001\), which means 1 in a 1000 sites has G/G genotype mutated to G/X, where X is one of {A,C,T}. They assume a transition to transversion (ts/tv) ratio of 4, meaning X=A four times as often as C or T (there is an error in the text where C is taken to be the transition; the numbers in the table are correct however). This gives the following haploid genotype frequencies:

\[\begin{align} p_G & = 1-f = 0.999 \\ p_A & = 4f/6 = 6.67\times10^{-4} \\ p_C & = f/6 = 1.67\times10^{-4}\\ p_T & = f/6 =1.67\times10^{-4} \end{align}\]

To get the diploid genotypes, we simply multiply the corresponding entries, e.g., \(p_{AC} = p_Ap_C\). For homozygote ALT, we need to account for the homozygous SNP rate \(r = 0.0005\), where G/G mutates to X/X, for X one of {A,C,T}:

\[\begin{align} p_{AA} & = p_Ap_A + 4r/6 = 3.33\times10^{-4} \\ p_{AC} & = p_Ap_C = 1.11\times10^{-7}\\ p_{AT} & = p_Ap_T = 1.11\times10^{-7}\\ p_{CC} & = p_Cp_C + r/6 = 8.34\times10^{-5} \\ p_{CG} & = p_Cp_G = 1.67\times10^{-4}\\ p_{CT} & = p_Cp_T = 2.78\times10^{-8}\\ p_{GG} & = 1 - f - r = 0.9985 \\ p_{GT} & = p_Gp_T = 1.67\times10^{-4}\\ p_{TT} & = p_Tp_T + r = 8.34\times10^{-5}\\ \end{align}\]

Genotype likelihoods

We have outlined a probabilistic approach to variant calling where we obtain a posterior probability of observing a genotype \(G\) given data \(X\):

\[ P(G|X) \sim P(X|G)P(G) \]

Assuming a bi-allelic site, and letting \(H_1, H_2\) denote the two alleles, we have three possible genotype likelihoods \(P(H_1H_1|X)\), \(P(H_1H_2|X)\), and \(P(H_2H_2|X)\).

The highest posterior probability is typically chosen as the genotype call, with a measure confidence represented by the genotype probability or ratio between the two most probable calls.

Genotype likelihoods are often represented as Phred-scaled likelihoods (again!):

\[ \text{QUAL} = -10 \log_{10} P(G|X) \]

Variant Call Format (VCF) - header

bcftools view --header-only allsites.vcf.gz | head --lines 1
bcftools view --header-only allsites.vcf.gz | grep "##FILTER"
bcftools view -h allsites.vcf.gz | grep "##INFO" | head -n 4
bcftools view -h allsites.vcf.gz | grep "##FORMAT" | head -n 8

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=LowQual,Description="Low quality">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">

FILTER defines applied filters , INFO fields provide additional information to genotypes, FORMAT specification fields define genotype entries, and more. NB: PL format definition.

https://samtools.github.io/hts-specs/VCFv4.4.pdf

Variant Call Format (VCF) - data

bcftools view --header-only --samples PUN-R-ELF,PUN-Y-INJ allsites.vcf.gz |\
 tail --lines 1
bcftools view --no-header  --samples PUN-R-ELF,PUN-Y-INJ allsites.vcf.gz LG4:6886

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  PUN-R-ELF   PUN-Y-INJ
LG4 6886    .   C   T   722.43  .   AC=2;AF=0.222;AN=4;BaseQRankSum=0;DP=82;ExcessHet=1.8123;FS=1.309;InbreedingCoeff=-0.155;MLEAC=4;MLEAF=0.222;MQ=60;MQRankSum=0;QD=18.52;ReadPosRankSum=0.577;SOR=0.44   GT:AD:DP:GQ:PGT:PID:PL:PS   0/1:3,8:11:50:.:.:189,0,50:.    0/1:2,2:4:45:.:.:45,0,45:.

QUAL: Phred-scaled quality score for Prob(ALT is wrong): \(722.43\) (\(p=10^{-Q/10}=5.7e-73\))

INFO field summarizes data for all samples. For instance:

allele count 2 (AC=2)
allele frequency minor allele 0.222 (AF=0.222)

https://samtools.github.io/hts-specs/VCFv4.4.pdf

Variant Call Format (VCF) - data

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  PUN-R-ELF   PUN-Y-INJ
LG4 6886    .   C   T   722.43  .   AC=2;AF=0.222;AN=4;BaseQRankSum=0;DP=82;ExcessHet=1.8123;FS=1.309;InbreedingCoeff=-0.155;MLEAC=4;MLEAF=0.222;MQ=60;MQRankSum=0;QD=18.52;ReadPosRankSum=0.577;SOR=0.44   GT:AD:DP:GQ:PGT:PID:PL:PS   0/1:3,8:11:50:.:.:189,0,50:.    0/1:2,2:4:45:.:.:45,0,45:.

Genotypes (`GT:AD:DP:GQ:PGT:PID:PL:PS`)

PUN-R-ELF: 0/1:3,8:11:50:.:.:189,0,50:.

GT=0/1, AD=3,8 => 3 REF, 8 ALT, DP=11 => sequence depth = 11, PL=189,0,50

PUN-Y-INJ: 0/1:2,2:4:45:.:.:45,0,45:.

GT=0/1, AD=2,2 => 2 REF, 2 ALT, DP=4 => sequence depth = 4, PL=45,0,45

Relative genotype probabilities

Can convert Phred-scaled quality scores to probabilities as

\[ p = 10^{-Q/10} \]

For PUN-R-ELF the relative probabilities are \(10^{-189/10}\approx1.26e-9\), \(10^{0}=1\), \(10^{50}=10^{-5}\).

Interpretation: 0/1 10,000 times more likely than 1/1 (\(1/10^{-5}\))

ELF

   
TCT
.Y.
.T.
...
.T.
,,,
.T.
,,,
.T.
...
...
...
,t,
,t,
,t,
,t,

INJ

   
TCT
.Y.
.T.
...
,,,
,,,
,,,
.T.

:::::

::::::

GATK best practice

https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-

Pros

Best practices
Large documentation
Variant quality score recalibration

Cons

Human-centric - very slow runtime on genomes with many sequences
Complicated setup

Alternative variant callers

freebayes

Bayesian genetic variant detector. Simpler setup.

May struggle in high-coverage regions.

(Garrison & Marth, 2012)

bcftools

Utilities for variant calling and manipulating VCFs and BCFs.

(Danecek et al., 2021)

ANGSD

For low-coverage sequencing. Doesn’t do explicit genotyping; most methods take genotype uncertainty into account.

(Korneliussen et al., 2014)

Exercise on variant calling

The monkeyflower system

The allure of monkeyflowers (DOI: 10.1126/science.365.6456.854)

From https://jgi.doe.gov/csp-2021-genomic-resources-for-mimulus/

Plants in the genus Mimulus inhabit highly variable habitats and are famous for their extraordinary ecological diversity. Mimulus is now a powerful system for ecological genomic studies, thanks to its experimental tractability, rapidly growing research community, and the JGI-generated reference genome for M. guttatus.

Learning outcomes

Follow GATK best practices to

Perform qc on sequencing reads and interpret results
Prepare reference for read mapping
Map reads to reference
Mark duplicates
Perform raw variant calling to generate a set of sites to exclude from recalibration
Perform base quality score recalibration
Perform variant calling on base recalibrated data
Do genotyping on all samples and combine results to a raw variant call set

Bibliography

Allendorf, F. W., Funk, W. C., Aitken, S. N., Byrne, M., & Luikart, G. (2022). Population Genomics. In F. W. Allendorf, W. C. Funk, S. N. Aitken, M. Byrne, G. Luikart, & A. Antunes (Eds.), Conservation and the Genomics of Populations (p. 0). Oxford University Press. https://doi.org/10.1093/oso/9780198856566.003.0004

Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., & Li, H. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2), giab008. https://doi.org/10.1093/gigascience/giab008

DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C., Philippakis, A. A., del Angel, G., Rivas, M. A., Hanna, M., McKenna, A., Fennell, T. J., Kernytsky, A. M., Sivachenko, A. Y., Cibulskis, K., Gabriel, S. B., Altshuler, D., & Daly, M. J. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics, 43(5), 491–498. https://doi.org/10.1038/ng.806

Donato, L., Scimone, C., Rinaldi, C., D’Angelo, R., & Sidoti, A. (2021). New evaluation methods of read mapping by 17 aligners on simulated and empirical NGS data: An updated comparison of DNA- and RNA-Seq data from Illumina and Ion Torrent technologies. Neural Computing and Applications, 33(22), 15669–15692. https://doi.org/10.1007/s00521-021-06188-z

Garrison, E., & Marth, G. (2012). Haplotype-based variant detection from short-read sequencing. arXiv:1207.3907 [q-Bio]. http://arxiv.org/abs/1207.3907

Korneliussen, T. S., Albrechtsen, A., & Nielsen, R. (2014). ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics, 15(1), 356. https://doi.org/10.1186/s12859-014-0356-4

Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997 [q-Bio]. https://arxiv.org/abs/1303.3997

Li, H. (2014). Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics, 30(20), 2843–2851. https://doi.org/10.1093/bioinformatics/btu356

Li, H. (2018). Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics, 34(18), 3094–3100. https://doi.org/10.1093/bioinformatics/bty191

Li, R., Li, Y., Fang, X., Yang, H., Wang, J., Kristiansen, K., & Wang, J. (2009). SNP detection for massively parallel whole-genome resequencing. Genome Research, 19(6), 1124–1132. https://doi.org/10.1101/gr.088013.108

Lou, R. N., Jacobs, A., Wilder, A. P., & Therkildsen, N. O. (2021). A beginner’s guide to low-coverage whole genome sequencing for population genomics. Molecular Ecology, 30(23), 5966–5993. https://doi.org/10.1111/mec.16077

Maruki, T., & Lynch, M. (2017). Genotype Calling from Population-Genomic Sequencing Data. G3 Genes|Genomes|Genetics, 7(5), 1393–1404. https://doi.org/10.1534/g3.117.039008

Nielsen, R., Paul, J. S., Albrechtsen, A., & Song, Y. S. (2011). Genotype and SNP calling from next-generation sequencing data. Nature Reviews Genetics, 12(6), 443–451. https://doi.org/10.1038/nrg2986

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
T	T	A	C	A	A	T	C	C	G	A	T	C	G	T
T	T	A	C	G	A	T	G	C	G	C	T	C	G	T
T	C	A	C	A	A	T	G	C	G	A	T	G	G	A
T	T	A	C	G	A	T	G	C	G	C	T	C	G	T
	*			*			*			*		*		*

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
T	T	A	C	A	A	T	C	C	G	A	T	C	G	T
T	T	A	C	G	A	T	G	C	G	C	T	C	G	T
T	C	A	C	A	A	T	G	C	G	A	T	G	G	A
T	T	A	C	G	A	T	G	C	G	C	T	C	G	T
	*			*			*			*		*		*

Variant calling

From sequence data to variant call set

Per Unneberg

NBIS

15-Nov-2023

Yesterday

The real data

The process

Variant and genotype calling

Sequencing DNA

Sequencing technologies

Illumina NovaSeq 600

PacBio Revio

DNA sequences in FASTQ format

DNA sequence quality control

Sequencing approaches

Genome assembly and population resequencing

Genome assembly

Population resequencing

Read mapping

Sequence alignment maps reads to a reference

Alignments are stored in BAM format

Header information

Alignments

Mapped alignments can be viewed with samtools tview

Potential error corrections and pitfalls

Instrument

Reference

Variant calling and genotyping

Variants show up in pileup alignments

Sample PUN-Y-INJ

Sample PUN-R-ELF

We can calculate likelihoods of observed data

Example (exluding sequencing error!)

We can use Bayes’ theorem to genotype

Example

Genotype likelihoods

Variant Call Format (VCF) - header

Variant Call Format (VCF) - data

Variant Call Format (VCF) - data

Genotypes (GT:AD:DP:GQ:PGT:PID:PL:PS)

Relative genotype probabilities

GATK best practice

Pros

Cons

Alternative variant callers

freebayes

bcftools

ANGSD

Exercise on variant calling

The monkeyflower system

Learning outcomes

Bibliography

Mapped alignments can be viewed with `samtools tview`

Genotypes (`GT:AD:DP:GQ:PGT:PID:PL:PS`)

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
T	T	A	C	A	A	T	C	C	G	A	T	C	G	T
T	T	A	C	G	A	T	G	C	G	C	T	C	G	T
T	C	A	C	A	A	T	G	C	G	A	T	G	G	A
T	T	A	C	G	A	T	G	C	G	C	T	C	G	T
	*			*			*			*		*		*