Base quality scores are factored into the calculation of genotype likelihoods, so if they accurately reflect the probability of sequencing error, bases with low scores also carry useful information. However, base quality scores are sometimes miscalibrated, so noise may be reduced if bases with scores below a threshold (e.g., 20) are either trimmed off prior to analysis or ignored. Alternatively, all base quality scores can be recalibrated based on estimated error profiles in the data (see Section 3.1).
Mapping quality is not considered in genotype likelihood estimation in currently available tools, so it is often advisable to remove low-confidence and/or nonuniquely mapped reads prior to analysis (e.g., reads with mapping quality <20). Filtering out reads that do not map in proper pairs should also further increase confidence in reads being mapped to the correct location, but could cause biases in regions with structural variation.
To avoid sites with low or confounding data support in downstream analysis, minimum depth and/or minimum number of individual filters can be used to exclude sites with much reduced sequencing coverage compared to the rest of the genome (e.g., regions with low unique mapping rates, such as repetitive sequences). Appropriate thresholds will vary between data sets, but could, for example, exclude sites with read data for <50% of individuals (globally or within each population), or with <0.8× average depth across individuals (after filtering on mapping quality)
Maximum depth filters are used to exclude sites with exceptionally high coverage (e.g., regions that are susceptible to dubious mapping, such as copy number variants). Common maximum depth thresholds could be one or two standard deviations above the median genome-wide depth.
PCR and optical duplicates can give inflated impressions of how many unique molecules have been sequenced, which—particularly in the presence of preferential amplification of one allele— could bias genotype likelihood estimation. We therefore recommend removing duplicate reads prior to any analysis.
Reads mapped across indels are frequently misaligned, especially if the ends of reads span an indel. To avoid false SNP calls, we recommend either using dedicated tools to realign reads covering indels, using a haplotype-based variant caller (e.g., freebayes or gatk) to estimate genotype likelihoods, or excluding bases flanking indels.
If the DNA insert in a library fragment is shorter than the combined length of paired reads, there will be a section of overlap between the forward and reverse reads. While some variant callers (e.g., gatk) account for the pseudoreplication in overlapping ends of read pairs, the current implementation of angsd treats each end of a read pair as independent (this may change in a future release (T. Korneliussen, personal communication)). When treated as independent, read support for overlapping sections will be “double counted,” which may bias genotype likelihoods. A conservative approach is to soft-clip one of the overlapping read ends.
The significance threshold (often in the form of maximum p-value) can be adjusted to fine-tune the sensitivity of polymorphism detection, with lower p-values leading to fewer, but higher confidence, SNP calls. A commonly used cut-off is 10 −6.
Most software programs for downstream analyses assume that all SNPs are biallelic, so SNPs with more than two alleles can be filtered out in the SNP identification step to avoid violation of such assumptions.
For many types of analysis, such as PCA, admixture analysis, detection of FST outliers and estimation of LD, low-frequency SNPs are uninformative and can even bias results (e.g. Linck & Battey, 2019; Roesti et al., 2012). For those types of analysis, imposing a minimum MAF filter of 1%–10% can substantially speed up computation time. Appropriate thresholds depend on coverage, sample size (how many copies does an MAF threshold correspond to) and the type of downstream analysis.
For comparison of parameter estimates for multiple populations, it is important to ensure that data are obtained for a shared set of sites and that SNP polarization (which allele we track the frequency of) is consistent. For programs such as angsd where population-specific estimates are obtained by analysing the data from each population separately, a good strategy is to first conduct a global SNP calling with all samples and then restrict population-specific analysis to those SNPs with consistent major and minor allele designations (-doMajorMinor 3 in angsd) no MAF or SNP p-value filter (because that would incorrectly generate “missing data” if a site is fixed in a particular population).