Population structure

Principal Component Analysis, Admixture and F-statistics

Nikolay Oskolkov

(uses material from lectures of Ben Peter, Anders Albrechtsen, Ida Moltke, Graham Coop)

NBIS

15-Nov-2023

PCA in population genetics: Cavalli-Sforza

(Menozzi P, 1978) noticed that genetic distances seem to correlate with geographical distances

PCA in population genetics: John Novembre

  • genetics seems to mirrow the map of Europe
  • caveate: it is not always the case
  • J. Novembre selected samples

PCA and MDS are matrix factorization techniques

Toy example of PCA for population genetics

gen <- t(matrix(c(1, 0, 2, 0, 2, 0, 2, 1, 1, 1, 0, 1, 0, 2, 1, 2, 1, 1, 1, 1, 1,
    0, 1, 0, 2, 0, 1, 1, 0, 2, 1, 2, 0, 1, 0), 5, by = TRUE))
colnames(gen) <- paste0("Ind", 1:5)
rownames(gen) <- paste0("SNP", 1:7)
print(gen)
     Ind1 Ind2 Ind3 Ind4 Ind5
SNP1    1    1    1    0    0
SNP2    0    1    2    1    2
SNP3    2    1    1    0    1
SNP4    0    0    1    2    2
SNP5    2    1    1    0    0
SNP6    0    0    1    1    1
SNP7    2    2    1    1    0
gen_centered <- scale(gen, center = TRUE, scale = FALSE)
covariance <- t(gen_centered) %*% gen_centered
eig <- eigen(covariance)
barplot(eig$values/sum(eig$values), names = paste0("PC", 1:5))

plot(eig$vectors[, 1:2], xlab = "PC1", ylab = "PC2", pch = 16, cex = 5, col = 2:6)
points(eig$vectors[, 1:2], pch = as.character(1:5))

Idea: project the data into a low dimensional space that explains the largest amount of variance

Variance maximization and eigen value decomposition

Toy example of MDS for population genetics

gen <- t(matrix(c(1, 0, 2, 0, 2, 0, 2, 1, 1, 1, 0, 1, 0, 2, 1, 2, 1, 1, 1, 1, 1,
    0, 1, 0, 2, 0, 1, 1, 0, 2, 1, 2, 0, 1, 0), 5, by = TRUE))
colnames(gen) <- paste0("Ind", 1:5)
rownames(gen) <- paste0("SNP", 1:7)
print(gen)
     Ind1 Ind2 Ind3 Ind4 Ind5
SNP1    1    1    1    0    0
SNP2    0    1    2    1    2
SNP3    2    1    1    0    1
SNP4    0    0    1    2    2
SNP5    2    1    1    0    0
SNP6    0    0    1    1    1
SNP7    2    2    1    1    0
my_dist <- dist(t(gen), method = "manhattan", upper = TRUE, diag = TRUE)
my_dist
     Ind1 Ind2 Ind3 Ind4 Ind5
Ind1    0    3    7   10   11
Ind2    3    0    4    7    8
Ind3    7    4    0    5    4
Ind4   10    7    5    0    3
Ind5   11    8    4    3    0
my_mds <- cmdscale(my_dist)
my_mds
          [,1]       [,2]
Ind1 -6.114866  0.2977272
Ind2 -3.077094  0.2402932
Ind3  0.621230 -1.7944048
Ind4  3.723118  2.1483083
Ind5  4.847612 -0.8919238
plot(my_mds, xlab = "MDS1", ylab = "MDS2", pch = 16, cex = 5, col = 6:2)
points(my_mds, pch = as.character(5:1))

Idea: project the data into a low dimensional space that preserves distances

Example of PCA plot from 1000G project

We will use ANGSD (Korneliussen T, 2014) and compute genotype likelihoods on 435 bam-files from the 1000G project.

cd /home/nikolay/Documents/Teaching/PopGen_2023/data/1000G_bam_files
find $PWD -name '*.bam' > 1000G_bam_list.txt
angsd -bam 1000G_bam_list.txt -GL 2 -doMajorMinor 1 -doMaf 1 -SNP_pval 2e-6 \
-minMapQ 30 -minQ 20 -minInd 25 -minMaf 0.05 -doGlf 2 -out 1000G -P 5

Next, we will run PCAANGSD to infer admixures proportions:

pcangsd -b 1000G.beagle.gz -o pca_1000G -t 4

and finally we will plot the results using custom R scripts:

C <- as.matrix(read.table("pca_1000G.cov"))
e <- eigen(C)
pops <- readLines("1000G_bam_list.txt")
pops <- sapply(strsplit(pops, "\\."), function(x) x[6])
mycolor <- rep("red", length(pops))
mycolor[pops == "CEU"] <- "blue"
mycolor[pops == "CHB"] <- "green"
mycolor[pops == "MXL"] <- "brown"
mycolor[pops == "ASW"] <- "magenta"
plot(e$vectors[, 1:2], xlab = "PC1", ylab = "PC2", main = "PCA 1000G Project",
    col = mycolor, pch = 19)
legend("topright", c("YRI", "CEU", "CHB", "MXL", "ASW"), fill = c("red", "blue",
    "green", "brown", "magenta"), cex = 2, inset = 0.02)

ANGSD: genotype likelihood tool for low-coverage data

(Korneliussen T, 2014)

Hard genotype calls vs. genotype likelihoods

  • Minor allele frequency based on hard calls: \[\rm{MAF_{hc}}=\frac{1+1+2}{2*5}=\frac{4}{10}\]
  • Minor allele frequency based on genotype likelihoods: \[\rm{MAF_{gl}}=\frac{0*(0.98+0.20+0.89+0.10+0.15)+1*(0.02+0.75+0.10+0.80+0.20)+2*(0.00+0.05+0.01+0.10+0.65)}{2*5}=\frac{3.49}{10}\]

A known pitfall of PCA: uneven sampling

β€œThe results described here provide an explanation. First, from Equation 10 it can be seen that the matrix M is influenced by the relative sample size from each population through the components \(t_i\). For instance, even if all populations are equally divergent from each other, those for which there are fewer samples will have larger values of \(t_i\) because relatively more pairwise comparisons are between populations.”

\[M=XX^T=\frac{1}{N}\sum_{ij}x_ix_j\] \[N=N_{pop1}+N_{pop2}+N_{pop3}+...=\sum_k N_k\] \[M_{uneven}=\sum_{ijk}\frac{1}{N_k}x_{ik}x_{jk}\]

  • Potential solution: normalize each sample by its population size before computing the covariance matrix.
  • Is it still an unsupervised technique?

Example of uneven sampling from 1000G project

Downsampled Europeans

Downsampled Asians

Admixture: underlying assumtions

  • genetic clustering algorithm
  • at least two unadmixed populations
  • at least one admixed population
  • Maximum Likelihood: ADMIXTURE
  • Bayesian: STRUCTURE
  • delivers admixture proportions Q
  • delivers allele frequencies F for all loci for all K populations

Q and F outputs of admixture analysis

NGSadmix vs. ADMIXTURE

Example of admixture plot from 1000G project

We will use ANGSD (Korneliussen T, 2014) and compute genotype likelihoods on 435 bam-files from the 1000G project.

cd /home/nikolay/Documents/Teaching/PopGen_2023/data/1000G_bam_files
find $PWD -name '*.bam' > 1000G_bam_list.txt
angsd -bam 1000G_bam_list.txt -GL 2 -doMajorMinor 1 -doMaf 1 -SNP_pval 2e-6 \
-minMapQ 30 -minQ 20 -minInd 25 -minMaf 0.05 -doGlf 2 -out 1000G -P 5

Next, we will run NGSadmix to infer admixures proportions:

NGSadmix -likes 1000G.beagle.gz -K 3 -minMaf 0.05 -seed 1 -o 1000G

and finally we will plot the results using custom R scripts:

pops <- readLines("1000G_bam_list.txt")
pops <- sapply(strsplit(pops, "\\."), function(x) x[6])
source("https://raw.githubusercontent.com/GenisGE/evalAdmix/master/visFuns.R")
qopts <- read.table("1000G.qopt")
ord <- orderInds(pop = pops, q = qopts, popord = c("YRI", "ASW", "CEU", "MXL",
    "CHB"))
barplot(t(qopts)[, ord], col = c(3, 2, 4), las = 2, space = 0, border = NA)
text(sort(tapply(1:length(pops), pops[ord], mean)), -0.05, unique(pops[ord]))
abline(v = cumsum(sapply(unique(pops[ord]), function(x) {
    sum(pops[ord] == x)
})), col = 1, lwd = 1.2)

Admixture bar plots should not be over-interpreted

(Lawson DJ, 2018)

Haplotype clustering

  • Ancestry assignments in STRUCTURE / ADMIXTURE do not identify where admixture has occurred

  • Haplotype-based methods explore local ancestry, LD, recombination and subsequently fine-scale patterns of population structure at different scales

  • Chromosome painting: the genome is a mosaic of LD blocks separated by recombination

  • ChromoPainter and fineSTRUCTURE are tools for resolving subtle differences between populations.

(D. Lawson, 2012)

(Schraiber & Akey, 2015)

Fixation index \(F_{st}\)

  • \(F_{st}\) is a measure of population differentiation due to population structure
  • Expressed through nucleotide diversity \(\pi\) as relative nucleotide diversity between populations minus average nucleotide diversity within populations. If we have two populations 1 and 2, then:

\[F_{st}=\frac{\pi_{12}-\frac{\pi_{1}+\pi_{2}}{2}}{\pi_{12}} \sim \frac{\sigma_{subpops}^2}{\sigma_{total}^2}\]

library("admixtools")
library("tidyverse")

admixtools::extract_f2("AADR", outdir = "f2_AADR")

f2_aadr <- read_f2("f2_AADR")
fst_aadr <- fst(f2_aadr)
fst_aadr

mat <- f2(f2_aadr, unique_only = F) %>%
    select(-se) %>%
    pivot_wider(names_from = pop2, values_from = est) %>%
    column_to_rownames("pop1") %>%
    as.matrix()

library("pheatmap")
pheatmap(mat, cluster_rows = TRUE, cluster_cols = TRUE)

Allen Ancient DNA Resource (AADR) dataset

head data/DavidReich/AADR.ind
             ALT-116 M    Tubalar
             ALT-117 M    Tubalar
             ALT-165 M    Tubalar
             ALT-845 M    Tubalar
             ALT-846 M    Tubalar
             GEO-002 M   Georgian
             GEO-005 M   Georgian
             GEO-010 M   Georgian
             GEO-015 M   Georgian
             GEO-020 M   Georgian
head -n 8 data/DavidReich/AADR.snp
           rs3094315     1        0.020130          752566 G A
           rs7419119     1        0.022518          842013 T G
          rs13302957     1        0.024116          891021 G A
           rs6696609     1        0.024457          903426 C T
              rs8997     1        0.025727          949654 A G
           rs9442372     1        0.026288         1018704 A G
         rs147606383     1        0.026665         1045331 G A
           rs4970405     1        0.026674         1048955 A G
head data/DavidReich/AADR.geno
12109000100001100012110220021200101020000110010200001011011000011010000011000111000011011010212912100101100100011102112122221212221110000010100000001002291001111100000011000010000000100000000001100001100000111102222220111211202221121121120000001210110100001000100000001111111021222210000000000110000001011122010100011000101120101211000000010100000000000000101010100111100110011000100000011001100011001111111110110211120012021102222101112011222221122222222221212110011001100011010000000001100101011001000111011009001122112010011121112112102101010000001000000010100001000000000000101100011001000010000000000000002000000001011000000001911001
12122211222121212210121002100111200122211022221121010222112122001012002201122111211111221221222222112211222122111111212202221212212112221222222222212222202122112221222222212212212102222222122222222221122222221222112222222222222222212012222212222111210222221222122221222111111211011122222222222122222201111222222121212210222121112222222122222222212211112222122222212122121122122221222021001112122210210212212122222121111222222222222112122212221212212212222222022202222212122122112121112112211212122212222212212122222222212221222212101021220022021222222121122222222222222222221222222212220122222122202222121222122222222121221212001121102210
22000100100000000010011000001111001120001111001110001010021001111200020010000000100001110000001112000000000010012020211201001012111022221222222112200121000010000100000000000000110001000101001102101121011029011011112000001110010000101101110001000111211022121221111200020001000000000000000000000100001001002110000100001100000001000101201010001100010201111210000110001010100000000000010200210111100010001010101001001101111112020011000110200011010200000000111001000020100000100000000200012000101111201000001000002000100000011011112000201110200100110100101000100010100000010000000101000000010000000111000111011001110001100111011010111021111110
20212111212120212212102212121100202122211121121202122222122221121022022222022112100111210121122221122212222212210201012100221100021120000001011001121122210112121111222222210212102101221121221020121212201121210000210210111002192111010100111211220001210200101001111020202122222221222222222122221221222100210001211121221211222111112101021112212122212121111122121112112022222222122221112221222122922211211100111121112121000000100001219011001100009000110000110011001902211211222222222022211221121010121221222212211012212100000020221112221221101121021222212122122212222222111222212112122222210121222112222111210221122222222111221212011101201111
00000000011111900000000000000000000000000000000100010000000000000000000000000010000010001000000000000010000000011120121122110122212120000000000000010101200000000010200000100020000000000010000010000000090900102100110112101209110010121111220000000212911100000900000000002000000000000000000000000000000101202102000100010000011001001111000001001000000000001000100100000100000000000010000000000000000000000211211001010201021101200111102112121911221121110112102112111101000010010000010000000002000100012000000100010012000112111110100001001001000011000000001000000001000010001000000001000110000000001000000000000000001001000000000000000100000000
00001100010112211012210220121110121000000120020200101001101000111011100011000100021110210010101211210010011110000110010002000122102000000010101001121220011221100122210001100100010110000110000000000000009000011111202211100122211112201121121101012111000000000000000001000110101201010111111100100000001121111200000210011121111101000110010001010001011000001200101121122210100101000121000001011002100101001201021110211111112211112012111121111120112011201112111111121112011011011000110000210001000210210012000210021122012110021211100011012111122211112100011110001010000020100000000000000010001101000000100001001002111000000000010000000100000111
22222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222211222222222222222222222222222222222222222222222222222222292222222222222222222222222222222222222222222222222222222222222222222222221222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222122222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222
22222212222122112220021200001212221221222101222121221222211121212222222112222222222222121222121122122212222222222222222222222222222221212222221221102221212121222201122222112222222112222122222122222222222222222222222222222222222222222222221122211222222222222222222221222112111020212211222222122212112222222222022122922122222221121222222212212221222222221022221222112122222212122221212221212210221222222222222222222222222222222222222222222222222222222222222222222222222212221222222222212222222222112212222222221212222222222212222211110122121011122222120112221212211212122222211222212222221121122222222221221221112222222222222222222212222222
22222102212120012220021200001212201221222101222121121222111121212221222111221121201112021212111012012201211122121122121101221111112121212222221221102201201000122100022222022122222112222112222122292222222222112000220011221211121129121212221121210122002122222222222221210112111020212210121122122211111101211110022122121111111120121110922211212221211222221022221121111021122111121111212221211210221922222212122221211220222002221119222012110201202122221211202212222201222211211222212222112222222212110211222222211201220011112212212211110121101011120122110012221212211212122222211222112212221121122222122221221220011222229222222222222112222221
22222222222222222222220220222222222222222222221222212222221222222222222222222212122221212222222222222222222222211221101220111211121210212222112221222222222212222222222222222222222222222222222222222222222222111110110012100212012222111202211222222111110022222222222222221012919990910202222122222222222222120100222222221222222222222200222222222222222222229222221222222221122222222222222222122212221222222221212111112211021110111112122002101211201122021211011121121101122222122221222221222221122212221121222222212122222110221202212221222222122212221222222222122222222222222122222222222222222222222212222212222221212222222222222222222222222222

Example of Fst computation from 1000G project

We will use ANGSD (Korneliussen T, 2014) and compute Fst between CEU and YRI populations from the 1000G project.

angsd -b 1000G_CEU_bam_list.txt -anc anc.fa -out CEU -dosaf 1 -gl 1
angsd -b 1000G_YRI_bam_list.txt -anc anc.fa -out YRI -dosaf 1 -gl 1
realSFS CEU.saf.idx YRI.saf.idx > CEU.YRI.ml
realSFS fst index CEU.saf.idx YRI.saf.idx -sfs CEU.YRI.ml -fstout CEU_YRI
realSFS fst stats  CEU_YRI.fst.idx
realSFS fst stats2 CEU_YRI.fst.idx -win 50 -step 10 > slidingwindow_win50_step10

and finally we will plot the results using custom R scripts:

df <- read.delim("slidingwindow_win50_step10", header = FALSE)
df <- df[-1, ]
df$V2 <- as.numeric(df$V2)
df$V3 <- as.numeric(df$V3)
df$V5 <- as.numeric(df$V5)
df <- df[order(df$V2, df$V3), ]
rownames(df) <- seq(1, dim(df)[1], 1)
plot(df$V5, col = df$V2, xlab = "Chromosomes", ylab = "Fst", xaxt = "n")
myticks <- as.numeric(rownames(df[!duplicated(df$V2), ]))
axis(side = 1, at = myticks, labels = seq(1, 21, 1))

F-statistics: underlying assumtions

  • Treeness test: are populations related in a tree-like fashion (Reich et al. 2009)?

  • Admixture test: ss a particular population descended from multiple ancestral populations (Reich et al. 2009)?

  • Admixture proportions: what are the contributions from different populations to a focal population (Green et al. 2010; Haak et al. 2015)?

  • Number of founders: how many founder populations are there for a certain region (Reich et al. 2012; Lazaridis et al. 2014)?

  • Complex demography: how can mixtures and splits of population explain demography (Patterson et al. 2012; Lipson et al. 2013)?

  • Closest relative: what is the closest relative to a contemporary or ancient population (Raghavan et al. 2014)?

(Peter, 2016)

F2-statistic and F3-statistic

  • \(F_2\) is a covarinace in contrast to \(F_{st}\) which is a correlation coefficeint, therefore \(F_2\) is tree-additive by its definition

\[F_2=2\pi_{12}-\pi_{1}-\pi_{2}\]

  • \(F_3\) can be expressed through a combination of \(F_2\)-statistic

\[F_3(P_X; P_1, P_2)=\frac{1}{2}\left(F_2(P_X, P_1)+F_2(P_X, P_2)-F_2(P_1, P_2)\right)\]

  • Outgroup \(F_3\)-statistic: for an unknown population \(P_X\) find the closest population from a panel \(P_i\) computing the distances via an outgroup \(P_O\).

Example of F3-outgroup-statistic for an Unknown population

We will use the AADR dataset and compute F3-statistic for a (presumably!) European population, which is marked as Unknown in the dataset, with respect to a panel of other World populations using a Mbuti as an outgroup population.

library("admixtools")
library("tidyverse")

admixtools::extract_f2("AADR", outdir = "f2_AADR")

f2_aadr <- read_f2("f2_AADR")
fst_aadr <- fst(f2_aadr)
fst_aadr

f3_aadr <- f3(f2_aadr, pop1 = "Mbuti", pop2 = "Unknown", pop3 = unique(ind$pop)) %>%
    arrange(est)
f3_aadr


viz_outgroup_f3 <- function(f3_res) {
    f3_res %>%
        mutate(pop3 = fct_reorder(pop3, est)) %>%
        ggplot(aes(x = pop3, y = est, ymin = est - 3 * se, ymax = est + 3 * se)) +
        geom_point() + geom_errorbar() + coord_flip() + theme_bw(25) + xlab(NULL) +
        ylab("f3")
}
viz_outgroup_f3(f3_aadr)

Can you guess what was the Unknown population?

Relation between F-statistics and PCA

(Peter, 2022)

F4-statistic, D-statistic: ABBA-BABA test

  • Four populations considered: Ape is an outgroup, Europeans and Africans are tested for their gene flow with Neanderthals

  • A – ancestral allele, B – derived allele

  • Count number of sites corresponding to ABBA and BABA situations

\[D=\frac{N_{ABBA}-N_{BABA}}{N_{ABBA}+N_{BABA}}\]

  • \(N_{ABBA}\) the total counts of ABBA patterns, \(N_{BABA}\) the total counts of BABA patterns

  • Exess of shared derived alleles between Europeans and Neanderthals indicates a gene flow between them

  • Positive values of D imply Neanderthals are closer to Europeans

  • Negative values of D imply Neanderthals are closer to Africans

Example of D-statistic computation from 1000G project

We will use ANGSD (Korneliussen T, 2014) and compute D-statistic in order to test for potential Neanderthal introgression to European (CEU) and African (YRI) populations from the 1000G project.

echo Neandertal.bam > bams
find $PWD -name '*YRI*.bam' | head -10 >> bams
find $PWD -name '*CEU*.bam' | head -10 >> bams
angsd -out out -doAbbababa 1 -bam bams -doCounts 1 -anc anc.fa
Rscript jackKnife.R file=out.abbababa indNames=bams outfile=out_abbababa_results

and, finally, we will display the resulting table of D-statistics, and plot the ABBA and BABA counts (note the very low, ~100, number of sites):

df <- read.delim("data/out_abbababa_results.txt", header = T, sep = "\t")
df <- df[grepl("Neandertal", as.character(df$H3)), ]
df <- df[grepl("CEU", as.character(df$H2)), ]
df <- df[grepl("YRI", as.character(df$H1)), ]
YRI <- matrix(unlist(strsplit(as.character(df$H1), "\\.")), ncol = 9, byrow = T)
df$H1 <- paste(YRI[, 6], YRI[, 2], sep = "_")
CEU <- matrix(unlist(strsplit(as.character(df$H2), "\\.")), ncol = 9, byrow = T)
df$H2 <- paste(CEU[, 6], CEU[, 2], sep = "_")
df$H3 <- "NEAND"
df[1:8, 1:8]
            H1          H2    H3 nABBA nBABA       Dstat     jackEst        SE
11 YRI_NA19102 CEU_NA12342 NEAND    51    46  0.05154639  0.05154639 0.1220787
12 YRI_NA19213 CEU_NA12342 NEAND    38    31  0.10144930  0.10144930 0.1286760
13 YRI_NA18504 CEU_NA12342 NEAND    50    53 -0.02912621 -0.02912621 0.1353893
14 YRI_NA19108 CEU_NA12342 NEAND    38    42 -0.05000000 -0.05000000 0.1393287
15 YRI_NA19238 CEU_NA12342 NEAND    44    38  0.07317073  0.07317073 0.1305890
16 YRI_NA19102 CEU_NA12348 NEAND    46    45  0.01098901  0.01098901 0.1076116
17 YRI_NA19213 CEU_NA12348 NEAND    37    34  0.04225352  0.04225352 0.1122396
18 YRI_NA18504 CEU_NA12348 NEAND    42    49 -0.07692308 -0.07692308 0.1068372
boxplot(df$nABBA, df$nBABA, names = c("ABBA", "BABA"), ylab = "COUNTS", col = 2)

wilcox.test(df$nABBA, df$nBABA)

    Wilcoxon rank sum test with continuity correction

data:  df$nABBA and df$nBABA
W = 367, p-value = 0.2933
alternative hypothesis: true location shift is not equal to 0

D-statistic computation for higher coverage samples

D-statistic analysis requires large number of genomic sites, therefore the low-coverage 1000G data are not optimal. Now, we are going to use French and Yoruba samples from the Draft Neanderthal genome project where Nenandertal introgression was first detected:

echo Neandertal.bam > bams
echo Yoruba.bam >> bams
echo French.bam >> bams
angsd -out out -doAbbababa 1 -bam bams -doCounts 1 -anc anc.fa
Rscript jackKnife.R file=out.abbababa indNames=bams outfile=out_abbababa_highcov
df <- read.delim("data/out_abbababa_highcov.txt", header = T, sep = "\t")
df
              H1         H2             H3  nABBA  nBABA      Dstat    jackEst
1     Yoruba.bam French.bam Neandertal.bam 267387 249721 0.03416308 0.03416308
2 Neandertal.bam French.bam     Yoruba.bam 732198 249721 0.49136130 0.49136130
3 Neandertal.bam Yoruba.bam     French.bam 732198 267387 0.46500400 0.46500400
           SE         Z
1 0.002764548  12.35756
2 0.002449853 200.56760
3 0.002868187 162.12470

First line of the output corresponds to the tree displayed on the previous slide, i.e. where we test for potential gene flow from Neanderthals to modern Europeans or Africans. Large number of genomic sites ensures confident positivity of the D-statistic.

Positive D-statistic with erors bars not overlapping zero implies a gene flow between Neandertals and modern Europeans.

Refs

D. Lawson, S. M., G. Hellenthal. (2012). Inference of population structure using dense haplotype data. PLoS Genet. 8(1):e1002453. https://doi.org/doi: 10.1371/journal.pgen.1002453
Korneliussen T, N. R., Albrechtsen A. (2014). ANGSD: Analysis of next generation sequencing data. BMC Bioinformatics 15(1):356. https://doi.org/doi: 10.1186/s12859-014-0356-4
Lawson DJ, F. D., van Dorp L. (2018). A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots. Nat Commun. 14;9(1):3258. https://doi.org/doi: 10.1038/s41467-018-05257-7
McVean. (2009). A genealogical interpretation of principal components analysis. PLoS Genetics Oct;5(10):e1000686. https://doi.org/doi: 10.1371/journal.pgen.1000686
Menozzi P, C.-S. L., Piazza A. (1978). Synthetic maps of human gene frequencies in europeans. Science 1;201(4358):786-92. https://doi.org/doi: 10.1126/science.356262
Novembre J, B. K., Johnson T. (2008). Genes mirror geography within europe. Nature. 2008 Nov 6;456(7218):98-101. https://doi.org/doi: 10.1038/nature07331
Peter, B. (2016). Admixture, population structure, and f-statistics. Genetics 202(4):1485-501. https://doi.org/doi: 10.1534/genetics.115.183913
Peter, B. (2022). A geometric relationship of F2, F3 and F4-statistics with principal component analysis. Philos Trans R Soc Lond B Biol Sci. 377(1852):20200413. https://doi.org/doi: 10.1098/rstb.2020.0413
Schraiber, J., & Akey, J. (2015). Methods and models for unravelling human evolutionary history. Nat Rev Genet. 16(12):727-40. https://doi.org/doi: 10.1038/nrg4005