D-statistics

Infer introgression and hybridization with D-statistics

Author

Per Unneberg

Published

18-Sep-2025

About

Populations that diverge often exchange migrants during the splitting process. Populations may also come into secondary contact and hybridize after being separated over long times. In this exercise, we look at f- and D-statistics to detect patterns of hybridization and migration (introgression), as implemented in the package admixtools2.

Commands have been run on a subset of the data

The commands of this document have been run on a subset (a subregion) of the data. Therefore, although you will use the same commands, your results will differ from those presented here.

Intended learning outcomes

use D-statistics to detect hybridization and introgression

Tools

r
plink2 (Chang et al., 2015)
bcftools (Danecek et al., 2021)
admixtools2 (Maier et al., 2023)

Choose one of Modules and Virtual environment to access relevant tools.

Modules

Execute the following command to load modules:

module load \ 
    PDC/24.11 R/4.4.2-cpeGNU-24.11 plink/2.00a5.14 bcftools/1.20

Virtual environment

Run the pgip initialization script and activate the pgip default environment:

source /cfs/klemming/projects/supr/pgip_2025/init.sh
# Now obsolete!
# pgip_activate

Then activate the full environment:

# pgip_shell calls pixi shell -e full --as-is
pgip_shell

Copy the contents to a file pixi.toml in directory population-structure, cd to directory and activate environment with pixi shell:

[workspace]
channels = ["conda-forge", "bioconda"]
name = "population-structure"
platforms = ["linux-64", "osx-64"]

[dependencies]
r = ">=4.3,<4.5"
plink2 = ">=2.0.0a.6.9,<3"
bcftools = ">=1.22,<2"
r-tidyverse = ">=2.0.0,<3"
r-devtools = ">=2.4.5,<3"
r-plotly = ">=4.11.0,<5"

Data setup

PDC
Local

Make an analysis directory population-structure and cd to it:

mkdir -p population-structure && cd population-structure

Using rsync

Use rsync to sync data to your analysis directory (hint: first use options -anv to run the command without actually copying data):

# Run rsync -anv first time
rsync -av /cfs/klemming/projects/supr/pgip_2025/data/monkeyflower/selection/large/vcftools-filter-bqsr/ .

Using pgip client

pgip exercises setup e-population-structure

Make an analysis directory population-structure and cd to it:

mkdir -p population-structure && cd population-structure

Then use wget to retrieve data to analysis directory.

wget -r -np -nH -N --accept-regex all.variantsites* --cut-dirs=7 \
     https://export.uppmax.uu.se/uppstore2017171/pgip/data/monkeyflower/selection/large/vcftools-filter-bqsr/

On D-stastitics and f-statistics

D and f-statistics are tools used in explorations of multiple-population histories, particularly in instances involving migration or hybridization that form reticulate networks, as opposed to bifurcating trees. D-statistics, also known as ABBA-BABA tests, are designed to infer signs of introgression.

On the other hand, f-statistics are crucial for detecting admixture from data. Specifically, f2 measures the total branch length between populations, while f3 tests whether one population (P1) is an admixture of two others (P2 and P3). Furthermore, f4 tests for the independence between two pairs of populations (P1, P2 and P3, P4).

These statistics can be effectively implemented using the comprehensive admixtools and admixtools2 packages. The packages offer an array of capabilities, the extent of which will only be slightly explored in this context. A more exhaustive overview can be found in the the admixtools tutorial.

Data setup

First initialize some variables.

VCF="variants.vcf.gz"
DATAPFX="monkeyflower_dstats"

Here we need to invoke R from the get-go. Load relevant libraries and register the variables.

library(tidyr)
library(ggplot2)
library(viridis)
bw <- theme_bw(base_size = 18) %+replace% theme(axis.text.x = element_text(angle = 45,
    hjust = 1, vjust = 1))
theme_set(bw)

library(dplyr)
library(tibble)
library(admixtools)

VCF <- Sys.getenv("VCF")
DATAPFX <- Sys.getenv("DATAPFX")

admixtools2 needs a population input file consisting of two columns and is picky with regards to the format. Therefore, we need to refactor the sampleinfo file and select relevant columns.

sampleinfo <- read.csv("sampleinfo.csv") %>%
    rename(sample = SampleAlias) %>%
    mutate(species = as.factor(gsub("(ssp. |M. )", "", Taxon))) %>%
    mutate(species = as.factor(gsub(", ", "_", species))) %>%
    select(sample, species) %>%
    as_tibble
write.table(sampleinfo, file = paste0(DATAPFX, ".pop"), row.names = FALSE, col.names = FALSE,
    sep = "\t", quote = FALSE)

Preparing `admixtools2` input files

We need to convert our input VCF to a plink bed file. Some of the steps are a bit fiddly and it can be tricky to infer what the file formats look like from the documentation alone. Lucky for you, we have prepared a template sequence of commands so you don’t have to look up the details. Fasten your seatbelt and read on!

Convert VCF to plink ped format

To start with, admixtools2 requires integer-based chromosome names. We therefore reannotate the input VCF with bcftools. Here, chrmap.txt is a tabular file that maps old to new chromosome names.

for i in {3..4}; do echo -e "LG${i} ${i}"; done > chrmap.txt
bcftools annotate --rename-chrs chrmap.txt $VCF -o ${DATAPFX}.vcf.gz -O z -W tbi

We then convert the updated VCF to bed format.

plink2 --vcf ${DATAPFX}.vcf.gz --allow-extra-chr \
       --set-missing-var-ids @:# \
       --make-bed \
       --out ${DATAPFX} > /dev/null 2>&1

Next, we make use of the .pop file we generated at the beginning to add population information to the first column of the .fam file, as well as modify the last column (phenotype) from -9 (missing data) to 1 (control); the latter needs to be set, else the sample is ignored.

mv ${DATAPFX}.fam ${DATAPFX}.orig.fam
csvtk join -H -t ${DATAPFX}.orig.fam ${DATAPFX}.pop -f "2;1" |
    csvtk cut -t -f 7,2,3,4,5,6 |
    csvtk replace -H -t -f 6 -r 1 > ${DATAPFX}.fam
head -n 3 ${DATAPFX}.fam

aridus  ARI-159_83  0   0   0   1
aridus  ARI-159_84  0   0   0   1
aridus  ARI-195_1   0   0   0   1

Next, we convert bed to ped format, which is the required input for the EIGENSOFT package (Patterson et al., 2006) in the next section.

plink2 --bfile ${DATAPFX} --recode 12 ped \
       --out ${DATAPFX} > /dev/null 2>&1

Here, --bfile specifies a set of files with suffixes .bed, .bim and .fam, whereas --recode recodes nucleotides to numbers. Now you should have a ped file, which is the required format for the next step.

Convert PED to EIGENSTRAT

The EIGENSOFT package (Patterson et al., 2006) is a collection of tools for performing eigenanalysis of population data. The convertf program converts between different file formats. We here want to convert from ped to EIGENSTRAT and thefore need to write a control parameter file:

cat <<EOF  > par.PED.EIGENSTRAT
genotypename:    monkeyflower_dstats.ped
snpname:         monkeyflower_dstats.bim
indivname:       monkeyflower_dstats.fam
outputformat:    EIGENSTRAT
genotypeoutname: monkeyflower_dstats.geno
snpoutname:      monkeyflower_dstats.snp
indivoutname:    monkeyflower_dstats.orig.ind
EOF

The top three lines correspond to input files, whereas the bottom three are output files. Note that we add an .orig tag to the indivoutname as we (again) want to slightly modify the output later on. With this parameter file, convertf is run as

convertf -p par.PED.EIGENSTRAT > ${DATAPFX}.convertf.log 2>&1

We reformat the individual file to produce a desired output consisting of sample, gender (unknown) and population group label ¹.

csvtk space2tab ${DATAPFX}.orig.ind |
    csvtk cut -t -H -f 1,2,1 |
    csvtk replace -t -H -f 1 -p "^.*:" |
    csvtk replace -t -H -f 3 -p ":.*" > ${DATAPFX}.ind
head -n 3 ${DATAPFX}.ind

ARI-159_83  U   aridus
ARI-159_84  U   aridus
ARI-195_1   U   aridus

Analyse data with `admixtools2`

Finally, after all the data juggling, we can start with the analyses!

Prepare f2_blocks

The speedup of admixtools2 compared with its predecessor owes much to the efficient use made of \(f_2\)-statistics. So much so in fact that it can be worthwhile to pre-calculate and store \(f_2\)-statistics, as we do below.

f2_dir <- "f2"
extract_f2(DATAPFX, f2_dir, verbose = FALSE, overwrite = TRUE)
f2_blocks <- f2_from_precomp(f2_dir, verbose = FALSE)

extract_f2 will compute and store blocked f2 statistics, where the size of each block is 0.05cM by default.

Plot average \(f_2\)-statistics

\(f_2\)-statistics measure the amount of genetic drift between two populations (i.e., the total branch length). We can quickly summarize the average \(f_2\) over all blocks and make a plot:

apply(f2_blocks, 1:2, mean) %>%
    as_tibble %>%
    mutate(pop1 = colnames(.), .before = 1) %>%
    pivot_longer(!pop1, names_to = "pop2") %>%
    ggplot(aes(x = pop1, y = pop2, fill = value)) + geom_tile() + scale_fill_viridis()

The structure of the plot could be improved by clustering on rows and columns, but clearly population comparisons including clevelandii have the highest values, as would be expected.

Calculate \(F_{\mathrm{ST}}\)

\(f_2\)-statistics is mathematically related to \(F_{\mathrm{ST}}\), and we can easily obtain \(F_{\mathrm{ST}}\) estimates with the fst function and plot the outcome, obtaining a plot similar to the one above:

fst(f2_blocks) %>%
    ggplot(aes(x = pop1, y = pop2, fill = est)) + geom_tile() + scale_fill_viridis()

f3 statistics

FIXME!

f4 statistics (qpdstat)

The \(f_4\) statistic compares four populations P1, P2, P3, P4 and tests for independence (no admixture) between P1, P2 and P3, P4. Here we generate all 56 possible pairs of 4-taxon tests, setting clevelandii as the outgroup (P4).

allpops <- c("clevelandii", "grandiflorus", "aridus", "parviflorus", "longiflorus",
    "calycinus", "aurantiacus", "puniceus_yellow", "puniceus_red")
pops <- matrix(allpops[2:9][combn(8, 3)], , ncol = 3, byrow = TRUE)[, 3:1]
pops <- cbind(pops, "clevelandii")
# f4 is synonym to qpdstat
res <- qpdstat(f2_blocks, pops)
res %>%
    mutate(taxa = paste(pop1, pop2, pop3, sep = ",")) %>%
    as_tibble %>%
    arrange(est) %>%
    mutate(taxa = factor(taxa, levels = as.character(taxa))) %>%
    ggplot(aes(x = taxa, y = est)) + geom_point() + xlab("Taxa (P1, P2, P3)") + ylab("D-statistic (P1, P2; P3, O)")

Figure 1: Genome-wide estimates of Patterson’s f4-statistic (this is similar to the D-statistic or ABBA-BABA test) for all possible pairs of 4-taxon tests. The statistics have been sorted in ascending order.

If no permutations have \(f_4=0\), then there is no unadmixed tree that can explain the data.

qpgraph

FIXME!

g <- matrix(c("R", "R", "n1", "n1", "n2", "n3", "n3", "n2", "n4", "n4", "n5", "n6",
    "n6", "n5", "n7", "n7", "clevelandii", "n1", "grandiflorus", "n2", "n3", "parviflorus",
    "aridus", "n4", "aurantiacus", "n5", "n6", "calycinus", "longiflorus", "n7",
    "puniceus-yellow", "puniceus-red"), , 2) %>%
    edges_to_igraph()
plot_graph(g)

g <- matrix(c("R", "R", "n1", "n1", "n2", "n3", "n3", "n2", "n4", "n4", "n5", "n6",
    "n6", "n5", "n7", "n7", "n6", "clevelandii", "n1", "grandiflorus", "n2", "n3",
    "parviflorus", "aridus", "n4", "aurantiacus", "n5", "n6", "calycinus", "longiflorus",
    "n7", "puniceus-yellow", "puniceus-red", "n7"), , 2) %>%
    edges_to_igraph()
plot_graph(g)

Conclusions

That was a long exercise, but now you should have a feeling for how you can apply f-statistics to your data. Well done!

References

Chang, C. C., Chow, C. C., Tellier, L. C., Vattikuti, S., Purcell, S. M., & Lee, J. J. (2015). Second-generation PLINK: Rising to the challenge of larger and richer datasets. GigaScience, 4(1), s13742-015-0047-8. https://doi.org/10.1186/s13742-015-0047-8

Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., & Li, H. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2), giab008. https://doi.org/10.1093/gigascience/giab008

Maier, R., Flegontov, P., Flegontova, O., Işıldak, U., Changmai, P., & Reich, D. (2023). On the limits of fitting complex models of population history to f-statistics. eLife, 12, e85492. https://doi.org/10.7554/eLife.85492

Patterson, N., Price, A. L., & Reich, D. (2006). Population Structure and Eigenanalysis. PLOS Genetics, 2(12), e190. https://doi.org/10.1371/journal.pgen.0020190

Footnotes

For more information about the EIGENSOFT file formats, see https://github.com/DReichLab/EIG/tree/master/CONVERTF ↩︎

About

Modules

Virtual environment

Using rsync

Using pgip client

On D-stastitics and f-statistics

Data setup

Preparing admixtools2 input files

Convert VCF to plink ped format

Convert PED to EIGENSTRAT

Analyse data with admixtools2

Prepare f2_blocks

Plot average \(f_2\)-statistics

Calculate \(F_{\mathrm{ST}}\)

f3 statistics

f4 statistics (qpdstat)

qpgraph

Conclusions

References

Footnotes

Preparing `admixtools2` input files

Analyse data with `admixtools2`