Population genetics

Foundations

Per Unneberg

NBIS

15-Nov-2023

Intended learning outcomes

Introduction to foundations of population genetics with an emphasis on genealogies

  • Description of DNA variation data
  • Wright-Fisher population model and genealogies
  • Genetic drift
  • Wright-Fisher model with mutation
  • Mutation-drift balance
  • Neutral theory
  • Selection basics

DNA variation

DNA variation

Sequence aligmnent of four DNA sequences (Hahn, 2019, Fig 1.1).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T T A C A A T C C G A T C G T
T T A C G A T G C G C T C G T
T C A C A A T G C G A T G G A
T T A C G A T G C G C T C G T

The main data for molecular population genetics are DNA sequences. The alignment above shows a sample of four DNA sequences. Each sequence has 15 nucleotides (sites) “from the same locus (location) on a chromosome” (p.2 Hahn, 2019)

Alternative names for sequence:

  • chromosome
  • gene
  • allele (different by origin)
  • sample
  • cistron

We will preferentially use sequence or chromosome to refer to an entire sequence, and allele to refer to individual nucleotides that differ.

DNA variation - monomorphic sites

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T T A C A A T C C G A T C G T
T T A C G A T G C G C T C G T
T C A C A A T G C G A T G G A
T T A C G A T G C G C T C G T
* * * * * * * * * T

The alignment has 4 DNA sequences where each sequence has length \(L=15\). A site where all nucleotides (alleles) are identical is called a monomorphic site (indicated with asterisks above). There are 9 monomorphic sites.

DNA variation - segregating sites

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T T A C A A T C C G A T C G T
T T A C G A T G C G C T C G T
T C A C A A T G C G A T G G A
T T A C G A T G C G C T C G T
* * * * * *

A site where there are different nucleotides (alleles) is called a segregating site (indicated with asterisks above), often denoted S. There are \(S=6\) segregating sites.

Alternative names for segregating site are:

  • polymorphism
  • mutation
  • single nucleotide polymorphism (SNP)

mutation here and onwards refers to the process that generates new variation and the new variants generated by this process

In contrast to mutation which corresponds to within-species variation, a substitution refers to DNA differences between species.

DNA variation - major and minor alleles

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T T A C A A T C C G A T C G T
T T A C G A T G C G C T C G T
T C A C A A T G C G A T G G A
T T A C G A T G C G C T C G T
* * * * * *

Much of the nucleotide variation we study consists of bi-allelic SNPs. The most common variant is called the major allele, and the least common the minor allele.

The set of alleles found on a single sequence is called haplotype.

Describing DNA variation - heterozygosity

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T T A C A A T C C G A T C G T
T T A C G A T G C G C T C G T
T C A C A A T G C G A T G G A
T T A C G A T G C G C T C G T
* * * * * *

Once we have a sample of sequences we want to describe the observed variation. At any position the ith allele has sample frequency \(p_i\), where the sum of all allele frequencies is 1. For instance, at site 1, \(p_T=1\) (and by extension \(p_A=p_C=p_G=0\)), and at site 2 \(p_C=1/4\) and \(p_T=3/4\).

Heterozygosity

The heterozygosity at a site \(j\) is given by

\[ h_j = \frac{n}{n-1}\left(1 - \sum_i p_i^2\right) \]

where the summation is over all alleles and \(p_i\) is the frequency of the \(i\)-th allele

Exercise: calculate the heterozygosity at sites 1, 2 and 5

\[ h_1 = \frac{4}{3} \left(1 - p_T^2 \right) = 0 \\ h_2 = \frac{4}{3} \left(1 - \left(p_C^2 + p_T^2\right) \right) = \frac{4}{3} \left( 1 - \left(\frac{1}{16} + \frac{9}{16}\right)\right) = \frac{1}{2}\\ h_5 = \frac{4}{3} \left(1 - \left(p_A^2 + p_G^2\right) \right) = \frac{4}{3} \left( 1 - \left(\frac{1}{4} + \frac{1}{4}\right)\right) = \frac{2}{3} \]

Describing DNA variation - nucleotide diversity

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T T A C A A T C C G A T C G T
T T A C G A T G C G C T C G T
T C A C A A T G C G A T G G A
T T A C G A T G C G C T C G T
* * * * * *

Nucleotide diversity \(\pi\)

The nucleotide diversity is the sum of site heterozygosities:

\[ \pi = \sum_{j=1}^S h_j \]

where \(S\) is the number of segregating sites

Calculate the nucleotide diversity

Observation: \(h_i\) either 1/2 or 2/3 (for sites with \(p_{major}=p_{minor}\)).

\[ \pi = \frac{1}{2} + \frac{2}{3} + \frac{1}{2} + \frac{2}{3} + \frac{1}{2} + \frac{1}{2} = 3\frac{1}{3} \]

Often we provide \(\pi\) per site:

\[ \pi = 3.33/15 = 0.222 \]

Alleles as algebraic entities

Recall: alleles refer to different variants of a sequence at a locus (genomic position).

Whatever the underlying molecular nature (gene, chromosome, nucleotide, protein), let’s represent a locus by a letter, e.g., \(A\) (\(B\) if two loci, and so on)

If locus has many alleles \(1, 2, ...\) , could use indexing \(A_1, A_2, ...\).

Will use combination \(A\), \(a\) for bi-allelic loci from now on

Example: gene coding for flower color

\(A\) red color

\(a\) white color

Punnett square

\

A

a

A

a

Genotype

aa

Aa

AA

Phenotype

Heterozygote has intermediate color phenotype (pink).

Alleles and frequencies

We will be interested in looking at the dynamics of alleles, i.e., how their abundances in the population change over time. Therefore we want to measure the frequencies of alleles \(A\) and \(a\).

Example

Assume following population (\(n=10\), with \(n_{AA}=5\), \(n_{Aa}=4\), \(n_{aa}=1\)):

Let \(p\) be frequency of \(A\) alleles, \(q=1-p\) frequency of \(a\) alleles; then

5 \(AA\) individuals, 4 \(Aa\) individuals \(\Rightarrow p=\frac{5\cdot2 + 4\cdot1}{10\cdot2}=\frac{14}{20}=0.7\)

and \(q=1-p=\frac{6}{20}=0.3\)

Inserting frequencies into Punnett square gives expected frequency of offspring genotypes.

\

\(A\) (\(p=0.7\))

\(a\) (\(q=0.3\))

\(A\) (\(p=0.7\))

\(p\cdot p = 0.49\)

\(p\cdot q = 0.21\)

\(a\) (\(q=0.3\))

\(q\cdot p = 0.21\)

\(q\cdot q = 0.09\)

Expected allele frequencies after mating: \(p=p^2 + pq=0.7\), \(q=1-p=0.3\)

In absence of evolutionary forces alleles are in equilibrium

The Hardy-Weinberg equilibrium

For a locus, let \(A\) and \(a\) be two different alleles and let \(p\) be the frequency of the \(A\) allele and \(q=1-p\) the frequency of the \(a\) allele. In the absence of mutation, drift, migration, and other evolutionary processes, the equilibrium state is given by the Hardy-Weinberg equilibrium (HWE).

\(A\) (\(p\)) \(a\) (\(q\))
\(A\) (\(p\)) \(p^2\) \(pq\)
\(a\) (\(q\)) \(qp\) \(q^2\)
Genotype: \(AA\) \(Aa\) \(aa\)
Frequency: \(p^2\) \(2pq\) \(q^2\)
\(f_{AA}\) \(f_{Aa}\) \(f_{aa}\)

Under HWE assumption, neither allele nor genotype frequencies change over time.

Importantly, we can calculate allele frequencies from genotype frequencies and vice versa:

\[ p = f_{AA} + \frac{f_{Aa}}{2} = p^2 + pq\\ q = f_{aa} + \frac{f_{Aa}}{2} = q^2 + pq\\ \]

Natural populations do mate randomly?

Figure 1: Hardy-Weinberg proportions in 10,000 SNPs on chromosome 22 from three populations based on 1000 genomes data. For each SNP, genotypes are given as counts (minor/heterozygote/major), converted to frequencies and plotted on the y-axis. Allele frequencies are obtained from genotype frequencies and plotted on the x-axis. Most observations follow HWE proportions. Deviations from HWE can indicate sample QC issues, or that there is population structure. Illustration inspired by cooplab (2011).

The obsession of population genetics

Population genetics is about (Gillespie, 2004)

  1. describing the genetic structure of populations
  2. constructing theories on the forces that influence genetic variation

Questions to ponder:

  • why does variation look the way it does?
  • how is variation maintained?
  • how does variation change over time (\(\Delta p\))?
  • what forces shape the genetic structure of populations?

\(p=0.1\)

\(\large\rightarrow\)

\(p=0.5\)

\(\large\rightarrow\)

\(p=0.9\)

Models of populations

Wright-Fisher model

Model of populations that describes genealogical relationships of genes (chromosomes) in a population under the following assumptions (Hein et al., 2005):

  • discrete and non-overlapping generations
  • haploid individuals or two subpopulations (males and females)
  • constant population size
  • all individuals are equally fit
  • population has no geographical or social structure
  • no recombination

Wright-Fisher model

Model of populations that describes genealogical relationships of genes (chromosomes) in a population under the following assumptions (Hein et al., 2005):

  • discrete and non-overlapping generations
  • haploid individuals or two subpopulations (males and females)
  • constant population size
  • all individuals are equally fit
  • population has no geographical or social structure
  • no recombination

Algorithm

  1. Setup starting population at time zero

Wright-Fisher model

Model of populations that describes genealogical relationships of genes (chromosomes) in a population under the following assumptions (Hein et al., 2005):

  • discrete and non-overlapping generations
  • haploid individuals or two subpopulations (males and females)
  • constant population size
  • all individuals are equally fit
  • population has no geographical or social structure
  • no recombination

Algorithm

  1. Setup starting population at time zero
  2. Add offspring (same size) at time one

Wright-Fisher model

Model of populations that describes genealogical relationships of genes (chromosomes) in a population under the following assumptions (Hein et al., 2005):

  • discrete and non-overlapping generations
  • haploid individuals or two subpopulations (males and females)
  • constant population size
  • all individuals are equally fit
  • population has no geographical or social structure
  • no recombination

Algorithm

  1. Setup starting population at time zero
  2. Add offspring (same size) at time one
  3. Select parents to offspring at random

Wright-Fisher model

Model of populations that describes genealogical relationships of genes (chromosomes) in a population under the following assumptions (Hein et al., 2005):

  • discrete and non-overlapping generations
  • haploid individuals or two subpopulations (males and females)
  • constant population size
  • all individuals are equally fit
  • population has no geographical or social structure
  • no recombination

Algorithm

  1. Setup starting population at time zero
  2. Add offspring (same size) at time one
  3. Select parents to offspring at random

Wright-Fisher model

Model of populations that describes genealogical relationships of genes (chromosomes) in a population under the following assumptions (Hein et al., 2005):

  • discrete and non-overlapping generations
  • haploid individuals or two subpopulations (males and females)
  • constant population size
  • all individuals are equally fit
  • population has no geographical or social structure
  • no recombination

Algorithm

  1. Setup starting population at time zero
  2. Add offspring (same size) at time one
  3. Select parents to offspring at random

Wright-Fisher model

Model of populations that describes genealogical relationships of genes (chromosomes) in a population under the following assumptions (Hein et al., 2005):

  • discrete and non-overlapping generations
  • haploid individuals or two subpopulations (males and females)
  • constant population size
  • all individuals are equally fit
  • population has no geographical or social structure
  • no recombination

Algorithm

  1. Setup starting population at time zero
  2. Add offspring (same size) at time one
  3. Select parents to offspring at random

Wright-Fisher model

Model of populations that describes genealogical relationships of genes (chromosomes) in a population under the following assumptions (Hein et al., 2005):

  • discrete and non-overlapping generations
  • haploid individuals or two subpopulations (males and females)
  • constant population size
  • all individuals are equally fit
  • population has no geographical or social structure
  • no recombination

Algorithm

  1. Setup starting population at time zero
  2. Add offspring (same size) at time one
  3. Select parents to offspring at random

Wright-Fisher model

Model of populations that describes genealogical relationships of genes (chromosomes) in a population under the following assumptions (Hein et al., 2005):

  • discrete and non-overlapping generations
  • haploid individuals or two subpopulations (males and females)
  • constant population size
  • all individuals are equally fit
  • population has no geographical or social structure
  • no recombination

Algorithm

  1. Setup starting population at time zero
  2. Add offspring (same size) at time one
  3. Select parents to offspring at random

Wright-Fisher model

Model of populations that describes genealogical relationships of genes (chromosomes) in a population under the following assumptions (Hein et al., 2005):

  • discrete and non-overlapping generations
  • haploid individuals or two subpopulations (males and females)
  • constant population size
  • all individuals are equally fit
  • population has no geographical or social structure
  • no recombination

Algorithm

  1. Setup starting population at time zero
  2. Add offspring (same size) at time one
  3. Select parents to offspring at random

Wright-Fisher model

Figure 2: Wright-Fisher model

Wright-Fisher model

Figure 3: WF model indicating time direction from past (top) to present (bottom).

Figure 4: WF model tracing the genealogies of three extant chromosomes

The Wright-Fisher sampling model

Let’s formalise the sampling process of the Wright-Fisher model1. We assume

  1. a single locus in a haploid population of size \(2N\) (or diploid of size \(N\) when random mating)
  2. no mutation and selection
  3. discrete generations

Each generation we sample \(2N\) new chromosomes from the previous generation. The probability of choosing a chromosome \(v\) is \(1/2N\) (coin flip with probability of success \(1/2N\)). Since the trials are independent, and we perform \(2N\) trials, the number of offspring \(k\) of a given chromosome \(v\) is binomially distributed \(\mathrm{Bin}(m, p)\), with parameters \(m=2N\) and probability of success \(p=\frac{1}{2N}\).

Properties of Wright-Fisher sampling

The expected number of offspring is one

Poisson approximation for large \(N\)

\[ P(v=k) \approx \frac{1}{k!}e^{-k} \]

Prob(pick same parent) = 1/2N

Time for two sequences to coalesce \(\sim 1/2N\)

Origin and change of variation

 

Mutation

Selection

 

Recombination

Drift

Wright-Fisher model with alleles

Alleles can randomly fix or be lost through process called genetic drift

Wright-Fisher model showing the evolution of population of 10 genes over 16 generations. Allele variants are shown in white and black. Starting frequency black variant is 0.3.

Binomial process models allele sampling

We assume two alleles \(A\), \(a\), each with \(i\) and \(j=2N-i\) copies in generation \(t\).

\(i=8\), \(j=2\cdot 6-8=4\)

Let \(p_t=i/2N\) be the frequency of \(A\) in generation \(t\), and \(q_t=1-p_t\) the frequency of \(a\).

\(p_t = 8/12\)

\(p_{t+1} = 4/12\)

Prob(\(k\) \(A\) alleles in next generation) is \(\mathsf{Bin}(2N, \frac{i}{2N})\)

Genetic drift

To capture dynamics, follow allele frequency trajectory (\(p_t\)) as function of time.

##' Wright Fisher model - follow allele frequency distribution
##'
##' @param p0 Starting frequency
##' @param n Population size
##' @param generations Number of generations to simulate
##'
wright_fisher <- function(p0, n, generations) {
    x <- vector(mode = "numeric", length = generations)
    x[1] <- p0
    for (i in seq(2, length(x))) {
        x[i] <- rbinom(1, size = n, prob = x[i - 1])/n
    }
    x
}
# Example simulation and plot
set.seed(1223)
generations <- 100
n <- 100  # NB: haploid population size!
plot(1:generations, wright_fisher(0.5, n, generations), type = "l", ylab = "frequency",
    xlab = "generation", ylim = c(0, 1))
Figure 5: Genetic drift for different haploid(!) population sizes, starting frequency \(p_0\)=0.5. Note dependency of variance on population size N.

Genetic drift

Figure 6: Genetic drift for different combinations of starting frequency and population size for n=50 repetitions per parameter combination. Note how variation and time to fixation depends on population size and starting frequency.

  • fate of allele: fixation or loss \(\rightarrow\) eventually loss of variation
  • probability of fixation \(\pi(p)=p\), where \(p\) is the current frequency
  • rate of drift (loss of variation) \(\propto \frac{1}{2N}\)

Allele frequency distribution for N=1

Instead of looking at frequencies let’s switch to distributions of alleles for one individual, one locus. Then there are three possible genotypes (states) \(aa\), \(aA\), and \(AA\). Let \(n=0,1,2\) be an integer corresponding to each genotype (i.e., it counts the number of \(A\) alleles).

Assume individual mates with itself at random(!) starting in either of the three states. How does distribution evolve?

t=0

t=1

t=2

Probability distributions of allele frequencies

Figure 7: Histogram showing the course of change of the allele frequency distribution with time (Kimura, 1983, fig. 3.4). When N large (\(\gtrsim 100\)) histogram can be approximated by continuous distribution (diffusion theory). Try recipe for different values of N.

Figure 8: Frequency distributions of the brown eye (\(bw^{75}\)) allele in replicate experimental populations (\(n\sim 100\)) of Drosophila melanogaster (8 , 8 ) (Buri, 1956)

Mathematical treatment of drift can become complicated: easier to study dynamics of heterozygosity

Heterozygosity dynamics

Figure 9: Illustration of identity by descent (IBD) and state (IBS). Alleles in generation \(n\) are IBD but not IBS.

Let \(\mathcal{H}_t\) be the probability that two alleles are different by state. One can show that the time course evolution of \(\mathcal{H}_t\) in a randomly mating population consisting of \(N\) diploid hermaphroditic individuals is

\[ \mathcal{H}_t = \mathcal{H}_0 \left( 1 - \frac{1}{2N} \right)^t \]

Important consequence: heterozygosity in WF population lost at rate \(1/2N\).

Heterozygosity dynamics

Figure 10: Plot of \(\mathcal{H}_t\) illustrating dependency on population size

Figure 11: Heterozygosity in black-footed ferret (Wisely et al., 2002). Example from Graham Coop (2020), Fig. 4.5

Example of how rapid decline in population size can affect heterozygosity.

Population size influences genetic diversity!

However, census population size not (always) the correct measure.

Effective population size

Assumptions underlying Wright-Fisher model seldom fulfilled for natural populations. In particular

  • non-random mating (population structure)
  • fluctuations of population census size

Therefore, magnitude of drift experienced by a population different from that predicted by population size

Technically correct definition (but see Waples (2022)):

\(N_e\) is the size of an ideal population that would experience the same rate of genetic drift as the population in question.

Mutation

Two-allele
Derive popgen stats
Finite sites
Recurrent mutations
Infinite alleles
Protein electrophoresis
Inifinite sites
DNA sequences

Mutation and drift

Genetic drift “moves” frequencies to the point that variation is lost via allele fixation or loss. New variation is introduced through mutation. We typically assume mutations are described by a Poisson process with rate \(\mu\) (per generation).

The mutation rate is denoted \(\mu\), and the population scaled mutation rate is \(2N_e\mu\) for haploid populations, \(4N_e\mu\) for diploid, where \(N_e\) is the effective population size.

The mutation - drift balance is when the diversity lost due to drift equals the diversity gained due to mutation.

Figure 12: Variation is introduced by mutations (black) at rate \(\mu=1e^{-4}\) and is occasionally lost through genetic drift.

Tracing the evolution of mutations

Figure 13: Different mutations suffer different fates. Most mutations are lost in a couple of generations. Mutant alleles are colored black and their genealogies are highlighted with thicker edges.

Observation: most mutations are in fact lost

Recall: fixation probability \(\pi(p)=p\)

Mutation drift balance

Drift removes variation. Mutation reintroduces it. At equilibrium the change in variation by definition is 0. In terms of \(\mathcal{H}_t\) (the probability that two alleles are not identical by state), \(\Delta\mathcal{H}=0\).

One can show1 the classical formula that the equilibrium heterozygosity value is

\[ \hat{\mathcal{H}} = \frac{4N_e\mu}{1 + 4N_e\mu} \]

\(\mu\) is often assumed known, and heterozygosity is easily calculated from data, which provides a way of estimating \(N_e\).

The compound parameter \(4N_e\mu\) is called the population scaled mutation rate and is commonly named \(\theta\) such that

\[ \hat{\mathcal{H}} = \frac{\theta}{1 + \theta} \]

The neutral theory of evolution

Mutation drift balance, together with the observation during 50’s-60’s that polymorphism was more common than expected, is the foundation of the neutral theory of evolution (Kimura, 1983): allele frequencies may change and fix due to chance alone and not selection; most mutations behave as if they are neutral.

Nearly neutral theory (Ohta, 1973) was later developed to explain failure to predict scaling of polymorphism with population size: most mutations are not neutral but slightly deleterious and purged from population by natural selection.

Figure 14: Heterozygosity H= predicted by the neutral theory. Shaded region shows typical heterozygosities in animals (y-axis). The observed \(N_e\mu\) range is higher than predicted from plot. From Hurst (2009), Fig 1.

Mutation rate can be estimated from substitution rate

Mutation enters populations and may be fixed by drift. Therefore, with time there will be fixed differences, or substitions (typically in the evolution of species) between populations, or species. In molecular evolution, the substition rate, \(\rho\), is the most interesting quantity.

The total number of new mutations in every generation is \(2N\mu\) (total number of gametes times mutation rate)

New mutations fix at a rate \(1/2N\)

Therefore, the average rate of substitution, \(\rho\), is \(2N\mu\times1/2N\), or

\[ \rho=\mu \]

which is independent of population size!

Practical implication: we can estimate mutation rate from the substitution rate at neutrally evolving sites (e.g., Kumar & Subramanian (2002))

Selection

Selection and fitness

Figure 15: The life cycle used in the fundamental model of selection (Gillespie, 2004, fig. 3.2)

Much confusion exists in the literature regarding how various types of selection are defined, in particular because some of the terminology is used slightly differently within different scientific communities (Nielsen, 2005)

\[ \begin{matrix} \mathrm{Genotype} & AA & Aa & aa \\ \mathrm{Frequency\ in\ newborns} & p^2 & 2pq & q^2\\ \mathrm{Viability} & w_{AA} & w_{Aa} & w_{aa}\\ \mathrm{Frequency\ after\ selection} & p^2w_{AA} / \bar{w} & 2pqw_{Aa} / \bar{w} & q^2w_{aa} / \bar{w} \\ \mathrm{Relative\ fitness} & 1 & 1-hs & 1-s\\ \end{matrix} \]

where \(\bar{w} = p^2w_{AA} + 2pqw_{Aa} + q^2w_{aa}\) is the mean fitness.

\(h=0\) \(A\) dominant, \(a\) recessive
\(h=1\) \(a\) dominant, \(A\) recessive
\(0<h<1\) incomplete dominance
\(h<0\) overdominance (heterozygote advantage)
\(h>1\) underdominance

Notation follows Gillespie (2004), pp. 61–64.

The most important equation in population genetics

\[ p^\prime - p = \Delta_sp = \frac{pq[p(w_{AA} - w_{Aa}) + q(w_{Aa} - w_{aa})]}{p^2w_{AA} + 2pqw_{Aa} + q^2w_{aa}} \]

Figure 16: Allele frequency change over time for directional, balancing, and disruptive selection, for different values of \(p_0\).

Figure 17: Rate of allele frequency change as a function of allele frequency for directional, balancing, and disruptive selection.

Selection and drift - population size matters

Figure 18: The fixation probability relative to the neutral probability of fixation (\(p=1/2N\)) under the assumption \(s<0.1\). Red highlights region where \(|N_es|<0.05\). Adapted from Lynch (2007), Fig. 4.2.

In red region (\(|N_es|<0.05\)) the probability of fixation is within 10% of neutral fixation.

Consequence: for any population size there exists range of selection coefficients where mutant alleles \(\approx\) neutral (effective neutrality).

Direct selection can be inferred from protein substitutions

For genes, the ratio of nonsynonymous to synonymous substitutions can tell us about protein evolution:

Synonymous substitution

Protein          L
DNA         --- CTT ---
                  *
DNA         --- CTC ---
Protein          L
Nonsynonymous substitution

Protein          L
DNA         --- CTT ---
                 *
DNA         --- CHT ---
Protein          H
\(\mathbf{d_N/d_S << 1}\)
negative (purifying) selection
\(\mathbf{d_N/d_S < 1}\)
majority nonsynonymous deleterious, some advantageous
\(\mathbf{d_N/d_S = 1}\)
neutral or mix neutral / advantageous / deleterious mutations
\(\mathbf{d_N/d_S > 1}\)
positive selection
Figure 19: \(d_n/d_s\) comparisons for human-rat orthologs. For most genes, \(d_n/d_s << 1\) indicating purifying selection. A handful of genes (n=9) have \(d_n/d_s > 1.0\) which could indicate positive selection.

Not all mutations fall in genes. Methods for detecting direct selection not applicable to studying selection on single mutation, or e.g., balancing. This requires looking for specific patterns of diversity surrounding locus under selection.

Linked selection reduces diversity at neighbour loci

Figure 20: A selective sweep of an advantageous mutation (gray dot). Adapted from Charlesworth & Charlesworth (2010), Fig. 8.13

Example of a selective sweep. If a sweep completes at a locus, it will become monomorphic, as will the neighbouring sites. Mutation could reintroduce variation. Recombination could increase diversity in neighbourhood, but in a manner that depends on the distance from the locus under selection.

Recombination breaks association between loci

Miller (2020), Fig. 5.12.3

Once per chromosome! But: rates vary between loci (hotspots), sex chromosomes vs autosomes, and in some species, recombination only occurs in one sex (e.g., D.melanogaster).

Main effect: association between loci breaks up.

Linkage disequilibrium and its decay

Association between loci can be written as:

\[ D_{AB} = p_{AB} - p_Ap_B \]

Similar expressions hold for other pairs; only need to know one \(D_{ij}\) (e.g., \(D_{AB}\)) so drop subscript and rewrite:

\[ p_{AB}= p_Ap_B + D \]

If \(D\neq0\) the loci are in linkage disequilibrium.

Can show that decay over time is

\[ D_t = (1-r)^tD_0 \]

Recombination decreases D (linkage)

LD between pairs of autosomal SNPs for human and mouse. From (Laurie et al., 2007, fig. 2)

The effect of a selective sweep on diversity

Code
pgip-slim --seed 42 -n 1000 -r 1e-6 -m 1e-7 --threads 12 recipes/slim/selective_sweep.slim -l 1000000 --outdir results/slim
pgip-tsstat results/slim/slim*.trees -n 10 --seed 31 -s pi -s S -s TajD -w 500 --threads 10 | gzip -v - > results/slim/selective_sweep.w500.csv.gz

Figure 21: The effect of a selective sweep on diversity. The arrow points to the site under selection. The y-axis shows Tajima’s D which is proportional to the difference between two measures of diversity, nucleotide diversity \(\pi\) and Watterson’s \(\theta_W\).

The effect of a selective sweep on diversity

Code
pgip-slim --seed 42 -n 1000 -r 1e-6 -m 1e-7 --threads 12 recipes/slim/selective_sweep.slim -l 1000000 --outdir results/slim
pgip-tsstat results/slim/slim*.trees -n 10 --seed 31 -s pi -s S -s TajD -w 500 --threads 10 | gzip -v - > results/slim/selective_sweep.w500.csv.gz

Figure 22: The effect of a selective sweep on diversity. The arrow points to the site under selection. The y-axis shows Tajima’s D which is proportional to the difference between two measures of diversity, nucleotide diversity \(\pi\) and Watterson’s \(\theta_W\).

The effect of a selective sweep on diversity

Figure 23: The effect of a selective sweep on diversity. The figure shows the mean of 1000 simulations with the selected locus indicated with an arrow.

The phases of a selective sweep

Figure 24: Time goes from left to right. As sweep progresses, tree topology changes. Adapted from Hahn (2019), Figure 8.1

Amount of diversity depends on fixation time. A neutral locus fixes in \(4N_e\) generations; for \(s=0.0001\), it takes approximately \(0.29N_e\) generations.

Selections changes the genealogy (different topology, shorter branches), an aspect used in many linkage-based tests for selection.

Linked selection may constrain levels of diversity

Figure 25: Hitchhiking (left) versus background selection (right).

Hitchhiking

Background selection

  • loci linked to a deleterious locus will be purged from population and thus reduce diversity (Charlesworth et al., 1993)
  • similar patterns to hitchhiking

Summary

We have looked at the Wright-Fisher model as a model of populations and genealogies*

Genetic drift moves allele frequencies up and down at random and removes variation at rate \(\propto 1/2N\)

Mutation reintroduces variation. The Neutral theory posits most mutations are neutral and dynamics follow mutation drift equilibrium.

Methods to detect selection are based on direct selection or studying patterns of variation caused by linked selection.

Bibliography

Barton, N. H., Briggs, D. E. G., Eisen, J. A., Goldstein, D. B., & Patel, N. H. (2007). Evolution. Cold Spring Harbor Laboratory Press.
Buri, P. (1956). Gene Frequency in Small Populations of Mutant Drosophila. Evolution, 10(4), 367–402. https://doi.org/10.1111/j.1558-5646.1956.tb02864.x
Casillas, S., & Barbadilla, A. (2017). Molecular Population Genetics. Genetics, 205(3), 1003–1035. https://doi.org/10.1534/genetics.116.196493
Charlesworth, B., & Charlesworth, D. (2010). Elements of Evolutionary Genetics. Roberts and Company Publishers.
Charlesworth, B., Morgan, M. T., & Charlesworth, D. (1993). The Effect of Deleterious Mutations on Neutral Molecular Variation. Genetics, 134(4), 1289–1303.
cooplab. (2011). Population genetics course resources: Hardy-Weinberg Eq. In gcbias. https://gcbias.org/2011/10/13/population-genetics-course-resources-hardy-weinberg-eq/
Corbett-Detig, R. B., Hartl, D. L., & Sackton, T. B. (2015). Natural Selection Constrains Neutral Diversity across A Wide Range of Species. PLOS Biology, 13(4), e1002112. https://doi.org/10.1371/journal.pbio.1002112
Ewens, W. J. (2004). Mathematical Population Genetics (S. S. Antman, J. E. Marsden, L. Sirovich, & S. Wiggins, Eds.; Vol. 27). Springer. https://doi.org/10.1007/978-0-387-21822-9
Gillespie, J. H. (2004). Population Genetics: A Concise Guide (2nd edition). Johns Hopkins University Press.
Graham Coop. (2020). Notes on Population Genetics. https://github.com/cooplab/popgen-notes
Hahn, M. (2019). Molecular Population Genetics (First). Oxford University Press.
Hein, J., Schierup, M. H., & Wiuf, C. (2005). Gene genealogies, variation and evolution: A primer in coalescent theory. Oxford University Press. https://books.google.se/books?id=CCmLNAEACAAJ
Hein, J., Schierup, M., & Wiuf, C. (2004). Gene genealogies, variation and evolution. A primer in coalescent theory. In Systematic Biology - SYST BIOL (Vol. 54).
Hermisson, J. (2017). Mathematical population genetics. https://www.mabs.at/fileadmin/user_upload/p_mabs/Lecture_Notes_2017
Hubisz, M., & Siepel, A. (2020). Inference of Ancestral Recombination Graphs Using ARGweaver. In J. Y. Dutheil (Ed.), Statistical Population Genomics (pp. 231–266). Springer US. https://doi.org/10.1007/978-1-0716-0199-0_10
Hurst, L. D. (2009). Genetics and the understanding of selection. Nature Reviews Genetics, 10(2), 83–93. https://doi.org/10.1038/nrg2506
Kimura, M. (1983). The neutral theory of molecular evolution. Cambridge University Press. https://doi.org/10.1017/CBO9780511623486
Kimura, M., & Ohta, T. (1971). Protein Polymorphism as a Phase of Molecular Evolution. Nature, 229(5285), 467–469. https://doi.org/10.1038/229467a0
Kumar, S., & Subramanian, S. (2002). Mutation rates in mammalian genomes. Proceedings of the National Academy of Sciences, 99(2), 803–808. https://doi.org/10.1073/pnas.022629899
Laurie, C. C., Nickerson, D. A., Anderson, A. D., Weir, B. S., Livingston, R. J., Dean, M. D., Smith, K. L., Schadt, E. E., & Nachman, M. W. (2007). Linkage Disequilibrium in Wild Mice. PLOS Genetics, 3(8), e144. https://doi.org/10.1371/journal.pgen.0030144
Leffler, E. M., Bullaughey, K., Matute, D. R., Meyer, W. K., Ségurel, L., Venkat, A., Andolfatto, P., & Przeworski, M. (2012). Revisiting an Old Riddle: What Determines Genetic Diversity Levels within Species? PLOS Biology, 10(9), e1001388. https://doi.org/10.1371/journal.pbio.1001388
Lynch, M. (2007). The origins of genome architecture. Sinauer Associates.
Miller, C. (2020). Human Biology. Thompson Rivers University.
Nei, M., & Kumar, S. (2000). Molecular Evolution and Phylogenetics. Oxford University Press.
Nielsen, R. (2005). Molecular Signatures of Natural Selection. Annual Review of Genetics, 39(1), 197–218. https://doi.org/10.1146/annurev.genet.39.073003.112420
Ohta, T. (1973). Slightly Deleterious Mutant Substitutions in Evolution. Nature, 246(5428), 96. https://doi.org/10.1038/246096a0
Smith, J. M., & Haigh, J. (1974). The hitch-hiking effect of a favourable gene. Genetics Research, 23(1), 23–35. https://doi.org/10.1017/S0016672300014634
Waples, R. S. (2022). What Is Ne, Anyway? Journal of Heredity, 113(4), 371–379. https://doi.org/10.1093/jhered/esac023
Wisely, S. M., Buskirk, S. W., Fleming, M. A., McDonald, D. B., & Ostrander, E. A. (2002). Genetic Diversity and Fitness in Black-Footed Ferrets Before and During a Bottleneck. Journal of Heredity, 93(4), 231–237. https://doi.org/10.1093/jhered/93.4.231