VCF="variants.vcf.gz"
DATAPFX="monkeyflower_dstats"
D-statistics
About
Populations that diverge often exchange migrants during the splitting process. Populations may also come into secondary contact and hybridize after being separated over long times. In this exercise, we look at f- and D-statistics to detect patterns of hybridization and migration (introgression), as implemented in the package admixtools2
.
The commands of this document have been run on a subset (a subregion) of the data. Therefore, although you will use the same commands, your results will differ from those presented here.
- use D-statistics to detect hybridization and introgression
Choose one of Modules and Virtual environment to access relevant tools.
Modules
Execute the following command to load modules:
module load \
PDC/24.11 R/4.4.2-cpeGNU-24.11 plink/2.00a5.14 bcftools/1.20
Virtual environment
Run the pgip
initialization script and activate the pgip
default environment:
source /cfs/klemming/projects/supr/pgip_2025/init.sh
# Now obsolete!
# pgip_activate
Then activate the full
environment:
# pgip_shell calls pixi shell -e full --as-is
pgip_shell
Copy the contents to a file pixi.toml
in directory population-structure
, cd
to directory and activate environment with pixi shell
:
[workspace]
channels = ["conda-forge", "bioconda"]
name = "population-structure"
platforms = ["linux-64", "osx-64"]
[dependencies]
r = ">=4.3,<4.5"
plink2 = ">=2.0.0a.6.9,<3"
bcftools = ">=1.22,<2"
r-tidyverse = ">=2.0.0,<3"
r-devtools = ">=2.4.5,<3"
r-plotly = ">=4.11.0,<5"
Make an analysis directory population-structure
and cd to it:
mkdir -p population-structure && cd population-structure
Using rsync
Use rsync to sync data to your analysis directory (hint: first use options -anv
to run the command without actually copying data):
# Run rsync -anv first time
rsync -av /cfs/klemming/projects/supr/pgip_2025/data/monkeyflower/selection/large/vcftools-filter-bqsr/ .
Using pgip client
pgip exercises setup e-population-structure
Make an analysis directory population-structure
and cd to it:
mkdir -p population-structure && cd population-structure
Then use wget to retrieve data to analysis directory.
wget -r -np -nH -N --accept-regex all.variantsites* --cut-dirs=7 \
https://export.uppmax.uu.se/uppstore2017171/pgip/data/monkeyflower/selection/large/vcftools-filter-bqsr/
On D-stastitics and f-statistics
D and f-statistics are tools used in explorations of multiple-population histories, particularly in instances involving migration or hybridization that form reticulate networks, as opposed to bifurcating trees. D-statistics, also known as ABBA-BABA tests, are designed to infer signs of introgression.
On the other hand, f-statistics are crucial for detecting admixture from data. Specifically, f2 measures the total branch length between populations, while f3 tests whether one population (P1) is an admixture of two others (P2 and P3). Furthermore, f4 tests for the independence between two pairs of populations (P1, P2 and P3, P4).
These statistics can be effectively implemented using the comprehensive admixtools and admixtools2 packages. The packages offer an array of capabilities, the extent of which will only be slightly explored in this context. A more exhaustive overview can be found in the the admixtools tutorial.
Data setup
First initialize some variables.
Here we need to invoke R
from the get-go. Load relevant libraries and register the variables.
library(tidyr)
library(ggplot2)
library(viridis)
bw <- theme_bw(base_size = 18) %+replace% theme(axis.text.x = element_text(angle = 45,
hjust = 1, vjust = 1))
theme_set(bw)
library(dplyr)
library(tibble)
library(admixtools)
VCF <- Sys.getenv("VCF")
DATAPFX <- Sys.getenv("DATAPFX")
admixtools2
needs a population input file consisting of two columns and is picky with regards to the format. Therefore, we need to refactor the sampleinfo file and select relevant columns.
sampleinfo <- read.csv("sampleinfo.csv") %>%
rename(sample = SampleAlias) %>%
mutate(species = as.factor(gsub("(ssp. |M. )", "", Taxon))) %>%
mutate(species = as.factor(gsub(", ", "_", species))) %>%
select(sample, species) %>%
as_tibble
write.table(sampleinfo, file = paste0(DATAPFX, ".pop"), row.names = FALSE, col.names = FALSE,
sep = "\t", quote = FALSE)
Preparing admixtools2
input files
We need to convert our input VCF to a plink
bed file. Some of the steps are a bit fiddly and it can be tricky to infer what the file formats look like from the documentation alone. Lucky for you, we have prepared a template sequence of commands so you don’t have to look up the details. Fasten your seatbelt and read on!
Convert VCF to plink ped format
To start with, admixtools2
requires integer-based chromosome names. We therefore reannotate the input VCF with bcftools
. Here, chrmap.txt
is a tabular file that maps old to new chromosome names.
for i in {3..4}; do echo -e "LG${i} ${i}"; done > chrmap.txt
bcftools annotate --rename-chrs chrmap.txt $VCF -o ${DATAPFX}.vcf.gz -O z -W tbi
We then convert the updated VCF to bed format.
plink2 --vcf ${DATAPFX}.vcf.gz --allow-extra-chr \
--set-missing-var-ids @:# \
--make-bed \
--out ${DATAPFX} > /dev/null 2>&1
Next, we make use of the .pop
file we generated at the beginning to add population information to the first column of the .fam
file, as well as modify the last column (phenotype) from -9
(missing data) to 1
(control); the latter needs to be set, else the sample is ignored.
mv ${DATAPFX}.fam ${DATAPFX}.orig.fam
csvtk join -H -t ${DATAPFX}.orig.fam ${DATAPFX}.pop -f "2;1" |
csvtk cut -t -f 7,2,3,4,5,6 |
csvtk replace -H -t -f 6 -r 1 > ${DATAPFX}.fam
head -n 3 ${DATAPFX}.fam
aridus ARI-159_83 0 0 0 1
aridus ARI-159_84 0 0 0 1
aridus ARI-195_1 0 0 0 1
Next, we convert bed to ped format, which is the required input for the EIGENSOFT package (Patterson et al., 2006) in the next section.
plink2 --bfile ${DATAPFX} --recode 12 ped \
--out ${DATAPFX} > /dev/null 2>&1
Here, --bfile
specifies a set of files with suffixes .bed
, .bim
and .fam
, whereas --recode
recodes nucleotides to numbers. Now you should have a ped file, which is the required format for the next step.
Convert PED to EIGENSTRAT
The EIGENSOFT package (Patterson et al., 2006) is a collection of tools for performing eigenanalysis of population data. The convertf
program converts between different file formats. We here want to convert from ped to EIGENSTRAT and thefore need to write a control parameter file:
cat <<EOF > par.PED.EIGENSTRAT
genotypename: monkeyflower_dstats.ped
snpname: monkeyflower_dstats.bim
indivname: monkeyflower_dstats.fam
outputformat: EIGENSTRAT
genotypeoutname: monkeyflower_dstats.geno
snpoutname: monkeyflower_dstats.snp
indivoutname: monkeyflower_dstats.orig.ind
EOF
The top three lines correspond to input files, whereas the bottom three are output files. Note that we add an .orig
tag to the indivoutname
as we (again) want to slightly modify the output later on. With this parameter file, convertf is run as
convertf -p par.PED.EIGENSTRAT > ${DATAPFX}.convertf.log 2>&1
We reformat the individual file to produce a desired output consisting of sample, gender (unknown) and population group label 1.
csvtk space2tab ${DATAPFX}.orig.ind |
csvtk cut -t -H -f 1,2,1 |
csvtk replace -t -H -f 1 -p "^.*:" |
csvtk replace -t -H -f 3 -p ":.*" > ${DATAPFX}.ind
head -n 3 ${DATAPFX}.ind
ARI-159_83 U aridus
ARI-159_84 U aridus
ARI-195_1 U aridus
Analyse data with admixtools2
Finally, after all the data juggling, we can start with the analyses!
Prepare f2_blocks
The speedup of admixtools2
compared with its predecessor owes much to the efficient use made of \(f_2\)-statistics. So much so in fact that it can be worthwhile to pre-calculate and store \(f_2\)-statistics, as we do below.
f2_dir <- "f2"
extract_f2(DATAPFX, f2_dir, verbose = FALSE, overwrite = TRUE)
f2_blocks <- f2_from_precomp(f2_dir, verbose = FALSE)
extract_f2
will compute and store blocked f2 statistics, where the size of each block is 0.05cM by default.
Plot average \(f_2\)-statistics
\(f_2\)-statistics measure the amount of genetic drift between two populations (i.e., the total branch length). We can quickly summarize the average \(f_2\) over all blocks and make a plot:
The structure of the plot could be improved by clustering on rows and columns, but clearly population comparisons including clevelandii have the highest values, as would be expected.
Calculate \(F_{\mathrm{ST}}\)
\(f_2\)-statistics is mathematically related to \(F_{\mathrm{ST}}\), and we can easily obtain \(F_{\mathrm{ST}}\) estimates with the fst
function and plot the outcome, obtaining a plot similar to the one above:
f3 statistics
FIXME!
f4 statistics (qpdstat)
The \(f_4\) statistic compares four populations P1, P2, P3, P4 and tests for independence (no admixture) between P1, P2 and P3, P4. Here we generate all 56 possible pairs of 4-taxon tests, setting clevelandii as the outgroup (P4).
allpops <- c("clevelandii", "grandiflorus", "aridus", "parviflorus", "longiflorus",
"calycinus", "aurantiacus", "puniceus_yellow", "puniceus_red")
pops <- matrix(allpops[2:9][combn(8, 3)], , ncol = 3, byrow = TRUE)[, 3:1]
pops <- cbind(pops, "clevelandii")
# f4 is synonym to qpdstat
res <- qpdstat(f2_blocks, pops)
res %>%
mutate(taxa = paste(pop1, pop2, pop3, sep = ",")) %>%
as_tibble %>%
arrange(est) %>%
mutate(taxa = factor(taxa, levels = as.character(taxa))) %>%
ggplot(aes(x = taxa, y = est)) + geom_point() + xlab("Taxa (P1, P2, P3)") + ylab("D-statistic (P1, P2; P3, O)")
If no permutations have \(f_4=0\), then there is no unadmixed tree that can explain the data.
qpgraph
FIXME!
g <- matrix(c("R", "R", "n1", "n1", "n2", "n3", "n3", "n2", "n4", "n4", "n5", "n6",
"n6", "n5", "n7", "n7", "clevelandii", "n1", "grandiflorus", "n2", "n3", "parviflorus",
"aridus", "n4", "aurantiacus", "n5", "n6", "calycinus", "longiflorus", "n7",
"puniceus-yellow", "puniceus-red"), , 2) %>%
edges_to_igraph()
plot_graph(g)
g <- matrix(c("R", "R", "n1", "n1", "n2", "n3", "n3", "n2", "n4", "n4", "n5", "n6",
"n6", "n5", "n7", "n7", "n6", "clevelandii", "n1", "grandiflorus", "n2", "n3",
"parviflorus", "aridus", "n4", "aurantiacus", "n5", "n6", "calycinus", "longiflorus",
"n7", "puniceus-yellow", "puniceus-red", "n7"), , 2) %>%
edges_to_igraph()
plot_graph(g)
Conclusions
That was a long exercise, but now you should have a feeling for how you can apply f-statistics to your data. Well done!
References
Footnotes
For more information about the EIGENSOFT file formats, see https://github.com/DReichLab/EIG/tree/master/CONVERTF↩︎