Introduction to foundations of population genetics with an emphasis on genealogies
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
T | T | A | C | A | A | T | C | C | G | A | T | C | G | T |
T | T | A | C | G | A | T | G | C | G | C | T | C | G | T |
T | C | A | C | A | A | T | G | C | G | A | T | G | G | A |
T | T | A | C | G | A | T | G | C | G | C | T | C | G | T |
The main data for molecular population genetics are DNA sequences. The alignment above shows a sample of four DNA sequences. Each sequence has 15 nucleotides (sites) “from the same locus (location) on a chromosome” (p.2 Hahn, 2019)
Alternative names for sequence:
We will preferentially use sequence or chromosome to refer to an entire sequence, and allele to refer to individual nucleotides that differ.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
T | T | A | C | A | A | T | C | C | G | A | T | C | G | T |
T | T | A | C | G | A | T | G | C | G | C | T | C | G | T |
T | C | A | C | A | A | T | G | C | G | A | T | G | G | A |
T | T | A | C | G | A | T | G | C | G | C | T | C | G | T |
* | * | * | * | * | * | * | * | * | T |
The alignment has 4 DNA sequences where each sequence has length \(L=15\). A site where all nucleotides (alleles) are identical is called a monomorphic site (indicated with asterisks above). There are 9 monomorphic sites.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
T | T | A | C | A | A | T | C | C | G | A | T | C | G | T |
T | T | A | C | G | A | T | G | C | G | C | T | C | G | T |
T | C | A | C | A | A | T | G | C | G | A | T | G | G | A |
T | T | A | C | G | A | T | G | C | G | C | T | C | G | T |
* | * | * | * | * | * |
A site where there are different nucleotides (alleles) is called a segregating site (indicated with asterisks above), often denoted S. There are \(S=6\) segregating sites.
Alternative names for segregating site are:
mutation here and onwards refers to the process that generates new variation and the new variants generated by this process
In contrast to mutation which corresponds to within-species variation, a substitution refers to DNA differences between species.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
T | T | A | C | A | A | T | C | C | G | A | T | C | G | T |
T | T | A | C | G | A | T | G | C | G | C | T | C | G | T |
T | C | A | C | A | A | T | G | C | G | A | T | G | G | A |
T | T | A | C | G | A | T | G | C | G | C | T | C | G | T |
* | * | * | * | * | * |
Much of the nucleotide variation we study consists of bi-allelic SNPs. The most common variant is called the major allele, and the least common the minor allele.
The set of alleles found on a single sequence is called haplotype.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
T | T | A | C | A | A | T | C | C | G | A | T | C | G | T |
T | T | A | C | G | A | T | G | C | G | C | T | C | G | T |
T | C | A | C | A | A | T | G | C | G | A | T | G | G | A |
T | T | A | C | G | A | T | G | C | G | C | T | C | G | T |
* | * | * | * | * | * |
Once we have a sample of sequences we want to describe the observed variation. At any position the ith allele has sample frequency \(p_i\), where the sum of all allele frequencies is 1. For instance, at site 1, \(p_T=1\) (and by extension \(p_A=p_C=p_G=0\)), and at site 2 \(p_C=1/4\) and \(p_T=3/4\).
The heterozygosity at a site \(j\) is given by
\[ h_j = \frac{n}{n-1}\left(1 - \sum_i p_i^2\right) \]
where the summation is over all alleles and \(p_i\) is the frequency of the \(i\)-th allele
\[ h_1 = \frac{4}{3} \left(1 - p_T^2 \right) = 0 \\ h_2 = \frac{4}{3} \left(1 - \left(p_C^2 + p_T^2\right) \right) = \frac{4}{3} \left( 1 - \left(\frac{1}{16} + \frac{9}{16}\right)\right) = \frac{1}{2}\\ h_5 = \frac{4}{3} \left(1 - \left(p_A^2 + p_G^2\right) \right) = \frac{4}{3} \left( 1 - \left(\frac{1}{4} + \frac{1}{4}\right)\right) = \frac{2}{3} \]
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
T | T | A | C | A | A | T | C | C | G | A | T | C | G | T |
T | T | A | C | G | A | T | G | C | G | C | T | C | G | T |
T | C | A | C | A | A | T | G | C | G | A | T | G | G | A |
T | T | A | C | G | A | T | G | C | G | C | T | C | G | T |
* | * | * | * | * | * |
The nucleotide diversity is the sum of site heterozygosities:
\[ \pi = \sum_{j=1}^S h_j \]
where \(S\) is the number of segregating sites
Observation: \(h_i\) either 1/2 or 2/3 (for sites with \(p_{major}=p_{minor}\)).
\[ \pi = \frac{1}{2} + \frac{2}{3} + \frac{1}{2} + \frac{2}{3} + \frac{1}{2} + \frac{1}{2} = 3\frac{1}{3} \]
Often we provide \(\pi\) per site:
\[ \pi = 3.33/15 = 0.222 \]
Recall: alleles refer to different variants of a sequence at a locus (genomic position).
Whatever the underlying molecular nature (gene, chromosome, nucleotide, protein), let’s represent a locus by a letter, e.g., \(A\) (\(B\) if two loci, and so on)
If locus has many alleles \(1, 2, ...\) , could use indexing \(A_1, A_2, ...\).
Will use combination \(A\), \(a\) for bi-allelic loci from now on
Example: gene coding for flower color
Genotype
aa
Aa
AA
Phenotype
Heterozygote has intermediate color phenotype (pink).
We will be interested in looking at the dynamics of alleles, i.e., how their abundances in the population change over time. Therefore we want to measure the frequencies of alleles \(A\) and \(a\).
Example
Assume following population (\(n=10\), with \(n_{AA}=5\), \(n_{Aa}=4\), \(n_{aa}=1\)):
Let \(p\) be frequency of \(A\) alleles, \(q=1-p\) frequency of \(a\) alleles; then
5 \(AA\) individuals, 4 \(Aa\) individuals \(\Rightarrow p=\frac{5\cdot2 + 4\cdot1}{10\cdot2}=\frac{14}{20}=0.7\)
and \(q=1-p=\frac{6}{20}=0.3\)
Inserting frequencies into Punnett square gives expected frequency of offspring genotypes.
\
\(A\) (\(p=0.7\))
\(a\) (\(q=0.3\))
\(A\) (\(p=0.7\))
\(p\cdot p = 0.49\)
\(p\cdot q = 0.21\)
\(a\) (\(q=0.3\))
\(q\cdot p = 0.21\)
\(q\cdot q = 0.09\)
Expected allele frequencies after mating: \(p=p^2 + pq=0.7\), \(q=1-p=0.3\)
For a locus, let \(A\) and \(a\) be two different alleles and let \(p\) be the frequency of the \(A\) allele and \(q=1-p\) the frequency of the \(a\) allele. In the absence of mutation, drift, migration, and other evolutionary processes, the equilibrium state is given by the Hardy-Weinberg equilibrium (HWE).
\(A\) (\(p\)) | \(a\) (\(q\)) | |
---|---|---|
\(A\) (\(p\)) | \(p^2\) | \(pq\) |
\(a\) (\(q\)) | \(qp\) | \(q^2\) |
Genotype: | \(AA\) | \(Aa\) | \(aa\) |
Frequency: | \(p^2\) | \(2pq\) | \(q^2\) |
\(f_{AA}\) | \(f_{Aa}\) | \(f_{aa}\) |
Under HWE assumption, neither allele nor genotype frequencies change over time.
Importantly, we can calculate allele frequencies from genotype frequencies and vice versa:
\[ p = f_{AA} + \frac{f_{Aa}}{2} = p^2 + pq\\ q = f_{aa} + \frac{f_{Aa}}{2} = q^2 + pq\\ \]
Population genetics is about (Gillespie, 2004)
Questions to ponder:
\(p=0.1\)
\(\large\rightarrow\)
\(p=0.5\)
\(\large\rightarrow\)
\(p=0.9\)
Model of populations that describes genealogical relationships of genes (chromosomes) in a population under the following assumptions (Hein et al., 2005):
Model of populations that describes genealogical relationships of genes (chromosomes) in a population under the following assumptions (Hein et al., 2005):
Model of populations that describes genealogical relationships of genes (chromosomes) in a population under the following assumptions (Hein et al., 2005):
Model of populations that describes genealogical relationships of genes (chromosomes) in a population under the following assumptions (Hein et al., 2005):
Model of populations that describes genealogical relationships of genes (chromosomes) in a population under the following assumptions (Hein et al., 2005):
Model of populations that describes genealogical relationships of genes (chromosomes) in a population under the following assumptions (Hein et al., 2005):
Model of populations that describes genealogical relationships of genes (chromosomes) in a population under the following assumptions (Hein et al., 2005):
Model of populations that describes genealogical relationships of genes (chromosomes) in a population under the following assumptions (Hein et al., 2005):
Model of populations that describes genealogical relationships of genes (chromosomes) in a population under the following assumptions (Hein et al., 2005):
Model of populations that describes genealogical relationships of genes (chromosomes) in a population under the following assumptions (Hein et al., 2005):
Let’s formalise the sampling process of the Wright-Fisher model1. We assume
Each generation we sample \(2N\) new chromosomes from the previous generation. The probability of choosing a chromosome \(v\) is \(1/2N\) (coin flip with probability of success \(1/2N\)). Since the trials are independent, and we perform \(2N\) trials, the number of offspring \(k\) of a given chromosome \(v\) is binomially distributed \(\mathrm{Bin}(m, p)\), with parameters \(m=2N\) and probability of success \(p=\frac{1}{2N}\).
\[ P(v=k) \approx \frac{1}{k!}e^{-k} \]
Mutation
Selection
Recombination
Drift
Alleles can randomly fix or be lost through process called genetic drift
Wright-Fisher model showing the evolution of population of 10 genes over 16 generations. Allele variants are shown in white and black. Starting frequency black variant is 0.3.
We assume two alleles \(A\), \(a\), each with \(i\) and \(j=2N-i\) copies in generation \(t\).
\(i=8\), \(j=2\cdot 6-8=4\)
Let \(p_t=i/2N\) be the frequency of \(A\) in generation \(t\), and \(q_t=1-p_t\) the frequency of \(a\).
\(p_t = 8/12\)
\(p_{t+1} = 4/12\)
Prob(\(k\) \(A\) alleles in next generation) is \(\mathsf{Bin}(2N, \frac{i}{2N})\)
To capture dynamics, follow allele frequency trajectory (\(p_t\)) as function of time.
##' Wright Fisher model - follow allele frequency distribution
##'
##' @param p0 Starting frequency
##' @param n Population size
##' @param generations Number of generations to simulate
##'
wright_fisher <- function(p0, n, generations) {
x <- vector(mode = "numeric", length = generations)
x[1] <- p0
for (i in seq(2, length(x))) {
x[i] <- rbinom(1, size = n, prob = x[i - 1])/n
}
x
}
Instead of looking at frequencies let’s switch to distributions of alleles for one individual, one locus. Then there are three possible genotypes (states) \(aa\), \(aA\), and \(AA\). Let \(n=0,1,2\) be an integer corresponding to each genotype (i.e., it counts the number of \(A\) alleles).
Assume individual mates with itself at random(!) starting in either of the three states. How does distribution evolve?
t=0
t=1
t=2
Mathematical treatment of drift can become complicated: easier to study dynamics of heterozygosity
Let \(\mathcal{H}_t\) be the probability that two alleles are different by state. One can show that the time course evolution of \(\mathcal{H}_t\) in a randomly mating population consisting of \(N\) diploid hermaphroditic individuals is
\[ \mathcal{H}_t = \mathcal{H}_0 \left( 1 - \frac{1}{2N} \right)^t \]
Important consequence: heterozygosity in WF population lost at rate \(1/2N\).
Example of how rapid decline in population size can affect heterozygosity.
Population size influences genetic diversity!
However, census population size not (always) the correct measure.
Assumptions underlying Wright-Fisher model seldom fulfilled for natural populations. In particular
Therefore, magnitude of drift experienced by a population different from that predicted by population size
Technically correct definition (but see Waples (2022)):
\(N_e\) is the size of an ideal population that would experience the same rate of genetic drift as the population in question.
|
|
Genetic drift “moves” frequencies to the point that variation is lost via allele fixation or loss. New variation is introduced through mutation. We typically assume mutations are described by a Poisson process with rate \(\mu\) (per generation).
The mutation rate is denoted \(\mu\), and the population scaled mutation rate is \(2N_e\mu\) for haploid populations, \(4N_e\mu\) for diploid, where \(N_e\) is the effective population size.
The mutation - drift balance is when the diversity lost due to drift equals the diversity gained due to mutation.
Observation: most mutations are in fact lost
Recall: fixation probability \(\pi(p)=p\)
Drift removes variation. Mutation reintroduces it. At equilibrium the change in variation by definition is 0. In terms of \(\mathcal{H}_t\) (the probability that two alleles are not identical by state), \(\Delta\mathcal{H}=0\).
One can show1 the classical formula that the equilibrium heterozygosity value is
\[ \hat{\mathcal{H}} = \frac{4N_e\mu}{1 + 4N_e\mu} \]
\(\mu\) is often assumed known, and heterozygosity is easily calculated from data, which provides a way of estimating \(N_e\).
The compound parameter \(4N_e\mu\) is called the population scaled mutation rate and is commonly named \(\theta\) such that
\[ \hat{\mathcal{H}} = \frac{\theta}{1 + \theta} \]
Mutation drift balance, together with the observation during 50’s-60’s that polymorphism was more common than expected, is the foundation of the neutral theory of evolution (Kimura, 1983): allele frequencies may change and fix due to chance alone and not selection; most mutations behave as if they are neutral.
Nearly neutral theory (Ohta, 1973) was later developed to explain failure to predict scaling of polymorphism with population size: most mutations are not neutral but slightly deleterious and purged from population by natural selection.
Mutation enters populations and may be fixed by drift. Therefore, with time there will be fixed differences, or substitions (typically in the evolution of species) between populations, or species. In molecular evolution, the substition rate, \(\rho\), is the most interesting quantity.
The total number of new mutations in every generation is \(2N\mu\) (total number of gametes times mutation rate)
New mutations fix at a rate \(1/2N\)
Therefore, the average rate of substitution, \(\rho\), is \(2N\mu\times1/2N\), or
\[ \rho=\mu \]
which is independent of population size!
Practical implication: we can estimate mutation rate from the substitution rate at neutrally evolving sites (e.g., Kumar & Subramanian (2002))
Much confusion exists in the literature regarding how various types of selection are defined, in particular because some of the terminology is used slightly differently within different scientific communities (Nielsen, 2005)
\[ \begin{matrix} \mathrm{Genotype} & AA & Aa & aa \\ \mathrm{Frequency\ in\ newborns} & p^2 & 2pq & q^2\\ \mathrm{Viability} & w_{AA} & w_{Aa} & w_{aa}\\ \mathrm{Frequency\ after\ selection} & p^2w_{AA} / \bar{w} & 2pqw_{Aa} / \bar{w} & q^2w_{aa} / \bar{w} \\ \mathrm{Relative\ fitness} & 1 & 1-hs & 1-s\\ \end{matrix} \]
where \(\bar{w} = p^2w_{AA} + 2pqw_{Aa} + q^2w_{aa}\) is the mean fitness.
\(h=0\) | \(A\) dominant, \(a\) recessive |
\(h=1\) | \(a\) dominant, \(A\) recessive |
\(0<h<1\) | incomplete dominance |
\(h<0\) | overdominance (heterozygote advantage) |
\(h>1\) | underdominance |
Notation follows Gillespie (2004), pp. 61–64.
\[ p^\prime - p = \Delta_sp = \frac{pq[p(w_{AA} - w_{Aa}) + q(w_{Aa} - w_{aa})]}{p^2w_{AA} + 2pqw_{Aa} + q^2w_{aa}} \]
In red region (\(|N_es|<0.05\)) the probability of fixation is within 10% of neutral fixation.
Consequence: for any population size there exists range of selection coefficients where mutant alleles \(\approx\) neutral (effective neutrality).
For genes, the ratio of nonsynonymous to synonymous substitutions can tell us about protein evolution:
Synonymous substitution
Protein L
DNA --- CTT ---
*
DNA --- CTC ---
Protein L
Nonsynonymous substitution
Protein L
DNA --- CTT ---
*
DNA --- CHT ---
Protein H
Not all mutations fall in genes. Methods for detecting direct selection not applicable to studying selection on single mutation, or e.g., balancing. This requires looking for specific patterns of diversity surrounding locus under selection.
Example of a selective sweep. If a sweep completes at a locus, it will become monomorphic, as will the neighbouring sites. Mutation could reintroduce variation. Recombination could increase diversity in neighbourhood, but in a manner that depends on the distance from the locus under selection.
Miller (2020), Fig. 5.12.3
Once per chromosome! But: rates vary between loci (hotspots), sex chromosomes vs autosomes, and in some species, recombination only occurs in one sex (e.g., D.melanogaster).
Main effect: association between loci breaks up.
Association between loci can be written as:
\[ D_{AB} = p_{AB} - p_Ap_B \]
Similar expressions hold for other pairs; only need to know one \(D_{ij}\) (e.g., \(D_{AB}\)) so drop subscript and rewrite:
\[ p_{AB}= p_Ap_B + D \]
If \(D\neq0\) the loci are in linkage disequilibrium.
Can show that decay over time is
\[ D_t = (1-r)^tD_0 \]
Recombination decreases D (linkage)
Amount of diversity depends on fixation time. A neutral locus fixes in \(4N_e\) generations; for \(s=0.0001\), it takes approximately \(0.29N_e\) generations.
Selections changes the genealogy (different topology, shorter branches), an aspect used in many linkage-based tests for selection.
We have looked at the Wright-Fisher model as a model of populations and genealogies*
Genetic drift moves allele frequencies up and down at random and removes variation at rate \(\propto 1/2N\)
Mutation reintroduces variation. The Neutral theory posits most mutations are neutral and dynamics follow mutation drift equilibrium.
Methods to detect selection are based on direct selection or studying patterns of variation caused by linked selection.
Population genetics foundations