VCF=variants.vcf.gz
PRUNE_IN=monkeyflower_pca.prune.in
DATAPFX=monkeyflower_adm
Admixture
About
Admixture models model the ancestry components of a set of samples, where the ancestry components consist of a pre-defined number of (source) populations. In this exercise, we will use the software ADMIXTURE
(Alexander et al., 2009) to model ancestry.
The commands of this document have been run on a subset (a subregion) of the data. Therefore, although you will use the same commands, your results will differ from those presented here.
- run
ADMIXTURE
to infer population structure and individual ancestries
Choose one of Modules and Virtual environment to access relevant tools.
Modules
Execute the following command to load modules:
module load \
ADMIXTURE/1.3.0 plink/2.00a5.14 PDC/24.11 R/4.4.2-cpeGNU-24.11
Virtual environment
Run the pgip
initialization script and activate the pgip
default environment:
source /cfs/klemming/projects/supr/pgip_2025/init.sh
# Now obsolete!
# pgip_activate
Then activate the full
environment:
# pgip_shell calls pixi shell -e full --as-is
pgip_shell
Copy the contents to a file pixi.toml
in directory population-structure
, cd
to directory and activate environment with pixi shell
:
[workspace]
channels = ["conda-forge", "bioconda"]
name = "population-structure"
platforms = ["linux-64", "osx-64"]
[dependencies]
admixture = ">=1.3.0,<2"
plink2 = ">=2.0.0a.6.9,<3"
r = ">=4.3,<4.5"
r-tidyverse = ">=2.0.0,<3"
Make an analysis directory population-structure
and cd to it:
mkdir -p population-structure && cd population-structure
Using rsync
Use rsync to sync data to your analysis directory (hint: first use options -anv
to run the command without actually copying data):
# Run rsync -anv first time
rsync -av /cfs/klemming/projects/supr/pgip_2025/data/monkeyflower/selection/large/vcftools-filter-bqsr/ .
Using pgip client
pgip exercises setup e-population-structure
Make an analysis directory population-structure
and cd to it:
mkdir -p population-structure && cd population-structure
Then use wget to retrieve data to analysis directory.
wget -r -np -nH -N --accept-regex all.variantsites* --cut-dirs=7 \
https://export.uppmax.uu.se/uppstore2017171/pgip/data/monkeyflower/selection/large/vcftools-filter-bqsr/
Admixture
Although clustering methods like Principal Component Analysis (PCA) cluster individuals, they provide limited information regarding the compositional makeup of genomes. Admixture modelling involves analyzing ancestry from multiple source populations, where individual allele frequency (P) is a mixture (Q) of source population Ps. Two approaches to this include estimating P from known populations and using the STRUCTURE tool, which applies Bayesian estimations directly from ‘G’ as proposed by (Pritchard et al., 2000). Ancestry is essentially the proportion of a genome sourced from specified groups or populations. The Admixture tool (Alexander et al., 2009) is another method used to estimate ancestry components from genotypes based on maximum likelihood. Its development was primarily motivated by the need to address population stratification, a common confounding factor in association studies.
In this exercise, you will use ADMIXTURE
to estimate ancestry components in Monkeyflower.
Data setup
We start by defining some variables. Note that ADMIXTURE
assumes indpendence among markers, which means we must first perform LD pruning. Follow the steps in the PCA to generate the input file monkeyflower_pca.prune.in
defined below.
Running ADMIXTURE
Given an appropriate input file, running ADMIXTURE
can be done in a few steps. ADMIXTURE
takes as input a plink bed (binary biallelic genotype table) file, which we generate with the option --make-bed
:
plink2 --vcf $VCF --allow-extra-chr \
--extract ${PRUNE_IN} \
--set-missing-var-ids @:# \
--make-bed \
--out ${DATAPFX} > /dev/null 2>&1
This command will generate three output files:
-
monkeyflower_adm.bed
– file containing representation of genotype calls at biallelic variants -
monkeyflower_adm.bim
– a map file which is a table of the markers, their positions, and the alleles -
monkeyflower_adm.fam
– a sample information file
Before running ADMIXTURE
, we need to modify the map file such that chromosome names are integers and not as now prefixed with LG
.
# ADMIXTURE only accepts integer chromosome names
sed -i -e "s/^\([A-Z0-9][A-Z0-9]*\)/0/g" ${DATAPFX}.bim
That is all we have to do! We next run ADMIXTURE
with cross-validation (default is 5-fold CV) and set the number of populations K=2
:
admixture --cv ${DATAPFX}.bed 2 > ${DATAPFX}.2.log
The output is two files: .Q
consists of two columns with cluster assignments for each individual, whereas .P
is the population allele frequencies. We want to compare models with different settings for the number of populations K
so we run a for loop from 3 to 10:
for k in {3..10}
do
admixture --cv ${DATAPFX}.bed $k > ${DATAPFX}.${k}.log
done
Each run has generated a cross validation error. We want to select the model with the smallest error and therefore extract and save the CV errors into an output file:
grep CV *log | cut -d " " -f 3,4 | sed -e "s/[()K=:]//g" > ${DATAPFX}.cv.err
R analyses
Now that we have the output data we turn to plotting the results in R. First load the necessary packages:
Next, we plot the CV errors.
df <- read.table(paste0(DATAPFX, ".cv.err"))
colnames(df) <- c("K", "CV")
ggplot(df, aes(x = K, y = CV)) + geom_point() + ylab("CV error")
Here, we look for the lowest value of K
, which for this example is 4 (NB: this may differ from your results since this is based on a smaller dataset!).
Now we can plot the admixture proportions. Before doing so, we add sample population information to the plink sample information file (.fam
):
sampleinfo <- read.csv("sampleinfo.csv") %>%
rename(sample = SampleAlias) %>%
mutate(species = as.factor(gsub("ssp. ", "", Taxon))) %>%
select(sample, ScientificName, Taxon, Latitude, Longitude, species) %>%
as_tibble
fam <- read.table(paste0(DATAPFX, ".fam")) %>%
select(2) %>%
rename(sample = V2) %>%
right_join(sampleinfo) %>%
as_tibble
head(fam)
# A tibble: 6 × 6
sample ScientificName Taxon Latitude Longitude species
<chr> <chr> <chr> <dbl> <dbl> <fct>
1 ARI-159_83 Diplacus aridus ssp. aridus 32.7 -116. aridus
2 ARI-159_84 Diplacus aridus ssp. aridus 32.7 -116. aridus
3 ARI-195_1 Diplacus aridus ssp. aridus 32.6 -116. aridus
4 ARI-T84 Diplacus aridus ssp. aridus 32.7 -116. aridus
5 AUR-T102 Diplacus aurantiacus ssp. aurantiacus 39.0 -123. aurantiac…
6 AUR-T104 Diplacus aurantiacus ssp. aurantiacus 39.2 -124. aurantiac…
Finally, we define a function to plot the admixture proportions. Without going into too much detail, the code below groups samples by populations in a “facet_grid” to facilitate interpretation.
plot_admixture <- function(filename, fam) {
df <- read.table(filename) %>%
rename_with(~paste0("pop", seq_along(.))) %>%
mutate(sample = fam$sample) %>%
left_join(fam) %>%
pivot_longer(cols = starts_with("pop"), names_to = "Population", values_to = "Q") %>%
mutate(across(Population, as.factor)) %>%
as_tibble
colors <- colorRampPalette(brewer.pal(12, "Set3"))(length(levels(df$Population)))
p <- ggplot(df, aes(x = sample, y = Q, fill = factor(Population))) + geom_col(aes(color = Population),
linewidth = 0.1) + facet_grid(~species, switch = "x", scales = "free", space = "free") +
labs(x = "Individual", y = "Q") + scale_y_continuous(expand = c(0, 0)) +
scale_x_discrete(expand = expansion(add = 1)) + theme(panel.spacing.x = unit(0.1,
"lines"), axis.text.x = element_text(angle = 45), panel.grid = element_blank(),
strip.text.x = element_text(angle = 0)) + scale_fill_manual("Population",
values = colors)
return(p)
}
We here include admixture plots for K=2
and K=4
.
It is important not to overinterpret admixture plots (Lawson et al., 2018), mainly due to the fact that different demographic histories can lead to the same result. The plots are good for detecting recent hybrid events, but fall short for inference of more complex demographic histories.
Also remember that different runs of ADMIXTURE
will generate different results because the initial mixing parameters are chosen at random. There are several methods for evaluating the robustness of the results, including Pong (Behr et al., 2016), Clumpak (Kopelman et al., 2015) and evaladmix (Garcia-Erill & Albrechtsen, 2020).
Things to try
If you have more time over, make some more plots with different values of K
. Do the results make sense?