The usual questions:
What evolutionary forces maintain genetic diversity in natural populations? How do diversity levels relate to census population sizes…? Do low levels of diversity limit adaptation to selective pressures?
Leffler et al. (2012)
After allozyme era, the study of genetic diversity was largely neglected due to lack of genome-wide data, but with advent of population genomics becoming a hot topic again.
Ellegren & Galtier (2016)
Lewontin’s paradox: genetic diversity range smaller than variation among species in population size
Reduces diversity at loss \(\propto \frac{1}{N}\)
Adaptive selection decreases variation, more so if acting on new mutations compared to standing variation.
Balancing selection may increase variation.
Low recombination rates lead to less “reshuffling” of variation and hence lower diversity.
Genetic variation patterns are informative of evolutionary and demographic processes. We often use summary statistics to describe the patterns, and to estimate parameters such as effective population size and mutation rate from genetic variation data (\(\theta = 4N_e\mu\))
Often critical first step of analysis, such as
…no global relationship between numerically coded IUCN extinction risk categories and estimated heterozygosity…
Low genetic diversity symptom of past genetic drift inbreeding (higher levels of homozygosity), caused by low \(N_e\)
García-Dorado & Caballero (2021)
However: if population decline is rapid, may be too little time for inbreeding to occur \(\Rightarrow\) genetic diversity within species not necessarily aligned to extinction risk
Lewis (2023)
sample 0: 00100
sample 1: 00001
sample 2: 01010
sample 3: 10010
Diversity: for each site, count and sum differences between all (unique) pairs of samples, and divide by unique pairs. For \(n\) samples, there are \(n \choose 2\) such pairs.
Example: for site 0, start comparing samples 0-1 (0 diffs), samples 0-2 (0), samples 0-3 (1), samples 1-2 (0) and so on. Call these differences \(k_{ij}\). Then
\[ \pi = \frac{\sum_{i<j}k_{ij}}{n \choose 2} \]
sample 0: 00100
sample 1: 00001
sample 2: 01010
sample 3: 10010
Alternative measure of diversity: simply count the number of segregating sites (\(S\)). However, must correct for the number of samples \(n\) as we expect that more samples \(\Rightarrow\) more sites
\[ \theta_W = \frac{S}{a} = \frac{S}{\sum_{i=1}^{n-1}\frac{1}{i}} \]
Important: under neutrality, \(\theta = E(\pi) = E(\theta_W)\). Difference between two the basis for Tajima’s D that is a test for selection
sample 0: 00100
sample 1: 00001
sample 2: 01010
sample 3: 10010
from trees import ts_small_mut as ts
# Calculate correction factor a for Watterson's
# theta: the larger the sample size, the more
# segregating sites we expect to see
a = sum([1/i for i in range(1, ts.num_samples)])
pi = ts.diversity()
thetaW = ts.num_sites / a / ts.sequence_length
print(f"Diversity: {pi:.6f}",
f"Watterson's theta: {thetaW:.6f}",
f"Sequence length: {ts.sequence_length:.0f}",
sep="\n")
Diversity: 0.000267
Watterson's theta: 0.000273
Sequence length: 10000
sample 0: 00100
sample 1: 00001
sample 2: 01010
sample 3: 10010
Divergence: for each site, count and sum differences between all pairs of samples between two populations
Example: for site 0, compare samples 0-2 (0 diffs), samples 0-3 (1 diff), samples 1-2 (0 diffs), samples 1-3 (1 diff), and so on. Call differences \(k_ij\), let \(n_X\), \(n_Y\) be sample size in populations \(X\), \(Y\). Then
\[ d_{XY} = \frac{1}{n_Xn_Y}\sum_{i=1}^{n_X}\sum_{i=1}^{n_Y}k_{ij} \]
sample 0: 00100
sample 1: 00001
sample 2: 01010
sample 3: 10010
Allele Frequency Difference (AFD) proposed as intuitive alternative to \(F_{\mathrm{ST}}\). For each site, count the difference in allele frequency between two populations.
Example: site 0, frequency in blue is 0, in black 1/2, so difference=1/2, site 1, frequency in blue 0, in black 1/2, and so on
\[ AFD = \frac{1}{2}\sum_{i=1}^n| (f_{i1} - f_{i2})| \]
where \(n\) is the number of different alleles (\(n=2\) for biallelic SNPs), \(f_{ij}\) is the frequency of allele \(i\) in population \(j\)
Berner (2019)
sample 0: 00100
sample 1: 00001
sample 2: 01010
sample 3: 10010
\(F_{\mathrm{ST}}\) is another measure of differention among subpopulations (Wright, 1931). It ranges between 0 and 1 and has the interpretation 0: no differentiation, 1: complete fixation of alternate alleles in subpopulations
Example: site 3 has \(F_{\mathrm{ST}}\)=1 as it is fixed in black, not present in blue
There are many ways to express and calculate \(F_{\mathrm{ST}}\). Example:
\[ F_{\mathrm{ST}} = \frac{h_{\mathrm{T}} - h_{\mathrm{S}}}{h_{\mathrm{T}}} \]
where \(h_{\mathrm{T}}\) is the expected heterozygosity in the total population, \(h_{\mathrm{S}}\) the average of expected heterozygosities across subpopulations. For site 3, \(h_{\mathrm{S}}=0\), \(h_{\mathrm{T}}=2/3\).
Caveats: strongly influenced by within subpopulation levels of variation. Therefore considered relative measure (cf \(d_{\mathrm{XY}}\), which is an absolute measure).
sample 0: 00100
sample 1: 00001
sample 2: 01010
sample 3: 10010
from trees import ts_small_mut as ts
sample_sets = [[0, 1], [2, 3]]
win = [0] + [int(x.position)+1 for x in ts.sites()]
_ = win.pop()
win = win + [ts.sequence_length]
fst = ts.Fst(sample_sets=sample_sets)
fst_sites = ts.Fst(sample_sets=sample_sets,
windows=win)
print(f"Site id: {list(range(5))}",
f"Fst per site: {fst_sites}",
f"Overall Fst: {fst:.6f}",
sep="\n")
Site id: [0, 1, 2, 3, 4]
Fst per site: [0. 0. 0. 1. 0.]
Overall Fst: 0.200000
Diversity:
\[ \pi = \frac{\sum_{i<j}k_{ij}}{n \choose 2} \]
Divergence:
\[ d_{XY} = \frac{1}{n_Xn_Y}\sum_{i=1}^{n_X}\sum_{j=1}^{n_Y}k_{ij} \]
Here, \(n\) is the number of samples, \(k_{ij}\) tally of allelic differences between two haplotypes within (\(\pi\)) a population or between (\(d_{XY}\)) populations
Fundamental questions:
Storz (2005), Fig 1
vcftools --gzvcf allsites.vcf.gz --weir-fst-pop PUN-Y.txt \
--weir-fst-pop PUN-R.txt \
--fst-window-size 1000
csvtk plot line --tabs out.windowed.weir.fst \
-x BIN_START -y MEAN_FST \
--point-size 2 --xlab "Position (bp)" \
--ylab "Fst" --title "LG4: PUN-Y vs PUN-R" \
--width 9.0 --height 3.5 --scatter \
> out.windowed.weir.fst.mean.png
Raw data can be converted to Z-scores to highlight outliers. A Z-score is a measure of how far a data point is from the mean in terms of the number of standard deviations:
\[ Z = \frac{X - \mu}{\sigma} \]
Threshold of a couple of standard deviations common.
Properties of genetic variation and inferred demographic history in sampled A. millepora. Fuller et al. (2020), Figure 2. Upper left plot illustrates LD as a function of physical distance. Here, choosing a window size 20-30kb would ensure that most windows are independent.
csvtk filter2 --tabs annotation.gff --filter ' $3 == "CDS" ' |\
csvtk mutate2 --tabs -H -e '$4 - 12000000' -w 0 |\
csvtk mutate2 --tabs -H -e '$5 - 12000000' -w 0 |\
csvtk cut --tabs --fields 1,10,11 | bedtools sort | bedtools merge \
> CDS.bed 2>/dev/null
head -n 3 CDS.bed
LG4 12032 12121
LG4 12214 12658
LG4 12774 12830
Genetic diversity