Variant filtering

From raw variant calls to high-quality call sets

Per Unneberg

NBIS

15-Nov-2023

Variant filtering

Why we need to filter variants

Figure 1: Overlap in **raw** variant calls for different combinations of read mappers and variant callers.

Error rate of variant calls (SNPs and INDELs) largely unknown. Two major sources of error are

erroneous realignment in low-complexity regions
incomplete reference sequence

Figure 2: Overlap in **filtered** variant calls for different combinations of read mappers and variant callers

Li (2014)

Manual filtering sets thresholds on context statistics

Table 1: Key data filters (Table 3 Lou et al., 2021, p. 5974)
Category	Filter	Recommendation (examples)
General filters	Base quality	Recalibrate / <Q20
	Mapping quality	MAQ < 20 / improper pairs
	Minimum depth and/ or number of individuals	Varies; e.g. <50% individuals, <0.8X average depth
	Maximum depth	1-2 sd above median depth
	Duplicate reads	Remove
	Indels	Realign reads / haplotype-based caller / exclude bases flanking indels
	Overlapping sections of paired-end reads	Soft-clip to avoid double-counting
Filters on polymorphic sites	\(p\)-value	\(10^{-6}\)
	SNPs with more than two alleles	Filter; methods often assume bi-allelic sites
	Minimum minor allele frequency (MAF)	1%-10% for some analyses (PCA/admixture/LD/\(\mathsf{F_{ST}}\))
Restricting analysis to a predefined site list	List of global SNPs	Use global call set for analyses requiring shared sites

Procedure

Look at annotations (context statistics) and set thresholds.

Example: filter all sites with MAF<1%

NB: bypassing recommendations often means doing custom analyses. For instance, Talla et al. (2019) include GATK tri-allelic sites due to different bi-allelic pairs segregating in different subpopulations (e.g. A/G in pop 1, A/T in pop 2)

Verbose explanations of filters

Source: (Lou et al., 2021)

Base quality scores are factored into the calculation of genotype likelihoods, so if they accurately reflect the probability of sequencing error, bases with low scores also carry useful information. However, base quality scores are sometimes miscalibrated, so noise may be reduced if bases with scores below a threshold (e.g., 20) are either trimmed off prior to analysis or ignored. Alternatively, all base quality scores can be recalibrated based on estimated error profiles in the data (see Section 3.1).

Mapping quality is not considered in genotype likelihood estimation in currently available tools, so it is often advisable to remove low-confidence and/or nonuniquely mapped reads prior to analysis (e.g., reads with mapping quality <20). Filtering out reads that do not map in proper pairs should also further increase confidence in reads being mapped to the correct location, but could cause biases in regions with structural variation.

To avoid sites with low or confounding data support in downstream analysis, minimum depth and/or minimum number of individual filters can be used to exclude sites with much reduced sequencing coverage compared to the rest of the genome (e.g., regions with low unique mapping rates, such as repetitive sequences). Appropriate thresholds will vary between data sets, but could, for example, exclude sites with read data for <50% of individuals (globally or within each population), or with <0.8× average depth across individuals (after filtering on mapping quality)

Maximum depth filters are used to exclude sites with exceptionally high coverage (e.g., regions that are susceptible to dubious mapping, such as copy number variants). Common maximum depth thresholds could be one or two standard deviations above the median genome-wide depth.

PCR and optical duplicates can give inflated impressions of how many unique molecules have been sequenced, which—particularly in the presence of preferential amplification of one allele— could bias genotype likelihood estimation. We therefore recommend removing duplicate reads prior to any analysis.

Reads mapped across indels are frequently misaligned, especially if the ends of reads span an indel. To avoid false SNP calls, we recommend either using dedicated tools to realign reads covering indels, using a haplotype-based variant caller (e.g., freebayes or gatk) to estimate genotype likelihoods, or excluding bases flanking indels.

If the DNA insert in a library fragment is shorter than the combined length of paired reads, there will be a section of overlap between the forward and reverse reads. While some variant callers (e.g., gatk) account for the pseudoreplication in overlapping ends of read pairs, the current implementation of angsd treats each end of a read pair as independent (this may change in a future release (T. Korneliussen, personal communication)). When treated as independent, read support for overlapping sections will be “double counted,” which may bias genotype likelihoods. A conservative approach is to soft-clip one of the overlapping read ends.

The significance threshold (often in the form of maximum p-value) can be adjusted to fine-tune the sensitivity of polymorphism detection, with lower p-values leading to fewer, but higher confidence, SNP calls. A commonly used cut-off is 10 −6.

Most software programs for downstream analyses assume that all SNPs are biallelic, so SNPs with more than two alleles can be filtered out in the SNP identification step to avoid violation of such assumptions.

For many types of analysis, such as PCA, admixture analysis, detection of FST outliers and estimation of LD, low-frequency SNPs are uninformative and can even bias results (e.g. Linck & Battey, 2019; Roesti et al., 2012). For those types of analysis, imposing a minimum MAF filter of 1%–10% can substantially speed up computation time. Appropriate thresholds depend on coverage, sample size (how many copies does an MAF threshold correspond to) and the type of downstream analysis.

For comparison of parameter estimates for multiple populations, it is important to ensure that data are obtained for a shared set of sites and that SNP polarization (which allele we track the frequency of) is consistent. For programs such as angsd where population-specific estimates are obtained by analysing the data from each population separately, a good strategy is to first conduct a global SNP calling with all samples and then restrict population-specific analysis to those SNPs with consistent major and minor allele designations (-doMajorMinor 3 in angsd) no MAF or SNP p-value filter (because that would incorrectly generate “missing data” if a site is fixed in a particular population).

Guidelines? What guidelines?

GATK hard filters

However, because we want to help, we have formulated some generic recommendations that should at least provide a starting point for people to experiment with their data.

SNPs

QualByDepth (QD) < 2.0
RMSMappingQuality (MQ) < 40.0
FisherStrand (FS) > 60.0
StrandOddsRatio (SOR) > 3.0
MappingQualityRankSumTest (MQRankSum) < -12.5
ReadPosRankSumTest (ReadPosRankSum) < -8.0

Indels

QualByDepth (QD) < 2.0
ReadPosRankSum (ReadPosRankSumTest) < -20.0
InbreedingCoeff < -0.8
FisherStrand (FS) > 200.0
StrandOddsRatio (SOR) > 10.0

That said, you ABSOLUTELY SHOULD NOT expect to run these commands and be done with your analyses.

https://gatk.broadinstitute.org/hc/en-us/articles/360037499012

On RAD-seq filtering

… the effects of SNP filtering practices on population genetic inference have received much less attention

There Is No ‘Rule of Thumb’: Genomic Filter Settings for a Small Plant Population to Obtain Unbiased Gene Flow Estimates (Nazareno & Knowles, 2021)

General guidelines on manual filters are not discussed much in the literature, simply due to the fact that there is no set of rule of thumbs. Every problem requires its own settings, as the GATK developers maintain.

GATK guidelines explained (see https://gatk.broadinstitute.org/hc/en-us/articles/360035890471):

QualByDepth (QFD): variant confidence (QUAL) divided by unfiltered depth
FisherStrand (FS): checks for strand bias (i.e., if minor allele occurs more often on one strand)
StrandOddsRatio (SOR): alternative strand bias test
RMSMappingQuality (MQ): root mean square mapping quality over all reads
MappingQualytRankSumTest (MQRankSum): compares mapping qualities of ref and alt alleles
ReadPosRankSumTest (ReadPosRankSum): looks at site position within reads
InbreedingCoeff: population-level statistics that requires at least 10 individuals

What about machine learning?

DePristo et al. (2011)

Variant Quality Score Recalibration

Motivation: look at context statistics and integrate over multiple dimensions

training data: subset of known variants (from validated resources, e.g. 1000 Genomes)
compile multiple statistics (allele depth, read count, quality, …)
fit Gaussian mixture model
reassign quality scores to variant call set

Caveat: database of known variants often not known for non-model organisms.

Key take home: thresholds that previously were binary yes/no filters now depend on context; for instance, an AD (allele depth) cutoff of 4 will in VQSR sometimes pass, sometimes not, depending on other information

Figure legend:

Relationship in the HiSeq call set between strand bias and quality by depth for genomic locations in HapMap3 (red) and dbSNP (orange) used for training the variant quality score recalibrator (left), (b) and the same annotations applied to differentiate likely true positive (green) from false positive (purple) new SNPs. (c–e) Quality tranches in the recalibrated HiSeq (c), exome (d) and low-pass CEU (e) calls beginning with (top) the highest quality but smallest call set with an estimated false positive rate among new SNP calls of <1/1000 to a more comprehensive call set (bottom) that includes effectively all true positives in the raw call set along with more false positive calls for a cumulative false positive rate of 10%. Each successive call set contains within it the previous tranche’s true- and false-positive calls (shaded bars) as well as tranche-specific calls of both classes (solid bars). The tranche selected for further analyses here is indicated.

Filtering VCF with variant sites

Monkeyflower variants

bcftools stats variantsites.vcf.gz | grep "^SN"

SN  0   number of samples:  10
SN  0   number of records:  12673
SN  0   number of no-ALTs:  0
SN  0   number of SNPs: 10403
SN  0   number of MNPs: 0
SN  0   number of indels:   2291
SN  0   number of others:   0
SN  0   number of multiallelic sites:   1042
SN  0   number of multiallelic SNP sites:   210

Use vcftools to compile data to generate summary statistics

Plot and select thresholds

Mean depth and variant quality distribution

Code

vcf <- "variantsites.vcf.gz"
system(paste("vcftools --gzvcf", vcf, "--site-depth 2>/dev/null"))
data <- read.table("out.ldepth", header = TRUE)
x <- as.data.frame(table(data$SUM_DEPTH))
lower <- 0.8 * median(data$SUM_DEPTH)
upper <- median(data$SUM_DEPTH) + 1 * sd(data$SUM_DEPTH)
xupper <- ceiling(upper/100) * 100
ggplot(x, aes(x = as.numeric(as.character(Var1)), y = Freq)) + geom_line() + xlab("Depth") +
    ylab("bp") + xlim(0, xupper) + geom_vline(xintercept = lower, color = "red",
    size = 1.3) + geom_vline(xintercept = upper, color = "red", size = 1.3) + ggtitle("Example threshold: 0.8X median depth, median depth + 2sd")

Depth uneven. High coverage often repetitive sequence. Too low coverage will bias SNP calling due to undersampling of alleles.

Code

system(paste("vcftools --gzvcf", vcf, "--site-quality 2>/dev/null"))
data <- read.table("out.lqual", header = TRUE)
ggplot(subset(data, QUAL < 1000), aes(x = QUAL)) + geom_histogram(fill = "white",
    color = "black", bins = 50) + xlab("Quality value") + ylab("Count") + geom_vline(xintercept = 30,
    color = "red", size = 1.3) + ggtitle("Example threshold: Q30")

Filter variants with too low quality (Q30=0.001% chance of being wrong)

Missing data per individual and site

Code

system(paste("vcftools --gzvcf", vcf, "--missing-indv 2>/dev/null"))
data <- read.table("out.imiss", header = TRUE)
ggplot(data, aes(x = F_MISS, y = INDV)) + geom_point(size = 3) + ggtitle("Missing data per individual")

Missing number of sites per individual. Too many would indicate poor sample quality.

Code

system(paste("vcftools --gzvcf", vcf, "--missing-site 2>/dev/null"))
data <- read.table("out.lmiss", header = TRUE)
ggplot(data, aes(x = F_MISS)) + geom_histogram(fill = "white", color = "black", bins = 10) +
    xlab("F_MISS") + ylab("Count") + geom_vline(xintercept = 0.25, color = "red",
    size = 1.3) + ggtitle("Missing data per site: example threshold F_MISS=0.25")

Fraction missing calls per site. Could warrant separate filters when comparing populations (e.g., total missing 0.2, but population A has 0.1 missing, population B 0.4).

Minor allele frequency and heterozygosity

Code

system(paste("vcftools --gzvcf", vcf, "--freq2 --max-alleles 2 2>/dev/null"))
data <- read.table("out.frq", skip = 1)
colnames(data) <- c("CHROM", "POS", "N_ALLELES", "N_CHR", "FREQ1", "FREQ2")
data$MAF <- apply(data, 1, function(x) as.numeric(min(x[5], x[6])))
ggplot(data, aes(x = MAF)) + geom_histogram(fill = "white", color = "black", bins = 10) +
    xlab("MAF") + ylab("Count") + geom_vline(xintercept = 0.1, color = "red", size = 1.3) +
    ggtitle("Minor allele frequency: example threshold MAF=0.1")

n=12; mutations 0, 4, 5 (red) are singletons and would fail MAF<=0.1

Reasonable cutoff 0.05-0.1 for PCA, population structure.

But! Statistics based on diversity or the SFS should not be filtered on MAF

Code

system(paste("vcftools --gzvcf", vcf, "--het 2>/dev/null"))
data <- read.table("out.het", header = TRUE)
ggplot(data, aes(x = F, y = INDV)) + geom_point(size = 3) + ggtitle("Inbreeding coefficient")

F=0: Hardy-Weinberg Equilibrium
F>0: deficit of heterozygotes; inbreeding, Wahlund effect (population substructure), allele dropout
F<0: surplus of heterozygotes; could be sample contamination, poor sequence quality (mismapping)

Filtering VCF with invariant sites

Monkeyflower call set with invariant sites

bcftools stats allsites.vcf.gz | grep "^SN"

SN  0   number of samples:  10
SN  0   number of records:  10195
SN  0   number of no-ALTs:  9330
SN  0   number of SNPs: 429
SN  0   number of MNPs: 0
SN  0   number of indels:   134
SN  0   number of others:   0
SN  0   number of multiallelic sites:   62
SN  0   number of multiallelic SNP sites:   8

Filtering as before but excluding MAF, variant quality filters

Filtering on bam files

Motivation

Some organisms generate a lot of data…

Total variant file size: 7.4T!!!

Without invariant sites!

Solution: sequence masks

…it may be possible for more advanced users to achieve similar results with existing tools. For example, with the inclusion of a user-created “accessibility mask”, it should be possible to avoid the “missing sites” effect…

(Korunes & Samuk, 2021)

Spruce variant files, chromosome 1

48G     PA_chr01_10.vcf.gz
45G     PA_chr01_11.vcf.gz
51G     PA_chr01_12.vcf.gz
50G     PA_chr01_13.vcf.gz
45G     PA_chr01_14.vcf.gz
51G     PA_chr01_15.vcf.gz
51G     PA_chr01_16.vcf.gz
35G     PA_chr01_17.vcf.gz
49G     PA_chr01_1.vcf.gz
50G     PA_chr01_2.vcf.gz
52G     PA_chr01_3.vcf.gz
51G     PA_chr01_4.vcf.gz
54G     PA_chr01_5.vcf.gz
51G     PA_chr01_6.vcf.gz
37G     PA_chr01_7.vcf.gz
51G     PA_chr01_8.vcf.gz
47G     PA_chr01_9.vcf.gz

Coverage tracks and sequence masks

Filters and masks


Reference
Coverage mask

>LG4 LG4:12000001-12100000
GGACAATTACCCCCTCCGTTATGTTTCAGTCAATTTCATGTTTGACTTTTAGATTTTTAA
000000000011111111110000000000000011111111111000000000000110

Mask could also represent annotation features, such as exons, four-fold degenerate sites etc to be combined with coverage mask:


Reference
Coverage mask
Exons

Combined

>LG4 LG4:12000001-12100000
GGACAATTACCCCCTCCGTTATGTTTCAGTCAATTTCATGTTTGACTTTTAGATTTTTAA
111111111100000000001111111111111100000000000111111111111001
111110000000000000000000000000111111111100000000001111111111

111111111100000000001111111111111111111100000111111111111111

Use with vcftools --mask to restrict analyses to certain positions.

NB! Here 0 is a position that is unmasked, >0 masked

Bibliography

DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C., Philippakis, A. A., del Angel, G., Rivas, M. A., Hanna, M., McKenna, A., Fennell, T. J., Kernytsky, A. M., Sivachenko, A. Y., Cibulskis, K., Gabriel, S. B., Altshuler, D., & Daly, M. J. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics, 43(5), 491–498. https://doi.org/10.1038/ng.806

Korunes, K. L., & Samuk, K. (2021). Pixy: Unbiased estimation of nucleotide diversity and divergence in the presence of missing data. Molecular Ecology Resources, 21(4), 1359–1368. https://doi.org/10.1111/1755-0998.13326

Li, H. (2014). Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics, 30(20), 2843–2851. https://doi.org/10.1093/bioinformatics/btu356

Lou, R. N., Jacobs, A., Wilder, A. P., & Therkildsen, N. O. (2021). A beginner’s guide to low-coverage whole genome sequencing for population genomics. Molecular Ecology, 30(23), 5966–5993. https://doi.org/10.1111/mec.16077

Nazareno, A. G., & Knowles, L. L. (2021). There Is No “Rule of Thumb”: Genomic Filter Settings for a Small Plant Population to Obtain Unbiased Gene Flow Estimates. Frontiers in Plant Science, 12. https://www.frontiersin.org/articles/10.3389/fpls.2021.677009

Talla, V., Soler, L., Kawakami, T., Dincă, V., Vila, R., Friberg, M., Wiklund, C., & Backström, N. (2019). Dissecting the Effects of Selection and Mutation on Genetic Diversity in Three Wood White (Leptidea) Butterfly Species. Genome Biology and Evolution, 11(10), 2875–2886. https://doi.org/10.1093/gbe/evz212