Admixture

Infer populations and individual ancestries

Author

Per Unneberg

Published

18-Sep-2025

About

Admixture models model the ancestry components of a set of samples, where the ancestry components consist of a pre-defined number of (source) populations. In this exercise, we will use the software ADMIXTURE (Alexander et al., 2009) to model ancestry.

Commands have been run on a subset of the data

The commands of this document have been run on a subset (a subregion) of the data. Therefore, although you will use the same commands, your results will differ from those presented here.

Intended learning outcomes

run ADMIXTURE to infer population structure and individual ancestries

Tools

admixture (Alexander et al., 2009)
plink2 (Chang et al., 2015)
r

Choose one of Modules and Virtual environment to access relevant tools.

Modules

Execute the following command to load modules:

module load \ 
    ADMIXTURE/1.3.0 plink/2.00a5.14 PDC/24.11 R/4.4.2-cpeGNU-24.11

Virtual environment

Run the pgip initialization script and activate the pgip default environment:

source /cfs/klemming/projects/supr/pgip_2025/init.sh
# Now obsolete!
# pgip_activate

Then activate the full environment:

# pgip_shell calls pixi shell -e full --as-is
pgip_shell

Copy the contents to a file pixi.toml in directory population-structure, cd to directory and activate environment with pixi shell:

[workspace]
channels = ["conda-forge", "bioconda"]
name = "population-structure"
platforms = ["linux-64", "osx-64"]

[dependencies]
admixture = ">=1.3.0,<2"
plink2 = ">=2.0.0a.6.9,<3"
r = ">=4.3,<4.5"
r-tidyverse = ">=2.0.0,<3"

Data setup

PDC
Local

Make an analysis directory population-structure and cd to it:

mkdir -p population-structure && cd population-structure

Using rsync

Use rsync to sync data to your analysis directory (hint: first use options -anv to run the command without actually copying data):

# Run rsync -anv first time
rsync -av /cfs/klemming/projects/supr/pgip_2025/data/monkeyflower/selection/large/vcftools-filter-bqsr/ .

Using pgip client

pgip exercises setup e-population-structure

Make an analysis directory population-structure and cd to it:

mkdir -p population-structure && cd population-structure

Then use wget to retrieve data to analysis directory.

wget -r -np -nH -N --accept-regex all.variantsites* --cut-dirs=7 \
     https://export.uppmax.uu.se/uppstore2017171/pgip/data/monkeyflower/selection/large/vcftools-filter-bqsr/

Admixture

Although clustering methods like Principal Component Analysis (PCA) cluster individuals, they provide limited information regarding the compositional makeup of genomes. Admixture modelling involves analyzing ancestry from multiple source populations, where individual allele frequency (P) is a mixture (Q) of source population Ps. Two approaches to this include estimating P from known populations and using the STRUCTURE tool, which applies Bayesian estimations directly from ‘G’ as proposed by (Pritchard et al., 2000). Ancestry is essentially the proportion of a genome sourced from specified groups or populations. The Admixture tool (Alexander et al., 2009) is another method used to estimate ancestry components from genotypes based on maximum likelihood. Its development was primarily motivated by the need to address population stratification, a common confounding factor in association studies.

In this exercise, you will use ADMIXTURE to estimate ancestry components in Monkeyflower.

Data setup

We start by defining some variables. Note that ADMIXTURE assumes indpendence among markers, which means we must first perform LD pruning. Follow the steps in the PCA to generate the input file monkeyflower_pca.prune.in defined below.

VCF=variants.vcf.gz
PRUNE_IN=monkeyflower_pca.prune.in
DATAPFX=monkeyflower_adm

Running `ADMIXTURE`

Given an appropriate input file, running ADMIXTURE can be done in a few steps. ADMIXTURE takes as input a plink bed (binary biallelic genotype table) file, which we generate with the option --make-bed:

plink2 --vcf $VCF --allow-extra-chr \
       --extract ${PRUNE_IN} \
       --set-missing-var-ids @:# \
       --make-bed \
       --out ${DATAPFX} > /dev/null 2>&1

This command will generate three output files:

monkeyflower_adm.bed – file containing representation of genotype calls at biallelic variants
monkeyflower_adm.bim – a map file which is a table of the markers, their positions, and the alleles
monkeyflower_adm.fam – a sample information file

Before running ADMIXTURE, we need to modify the map file such that chromosome names are integers and not as now prefixed with LG.

# ADMIXTURE only accepts integer chromosome names
sed -i -e "s/^\([A-Z0-9][A-Z0-9]*\)/0/g" ${DATAPFX}.bim

That is all we have to do! We next run ADMIXTURE with cross-validation (default is 5-fold CV) and set the number of populations K=2:

admixture --cv ${DATAPFX}.bed 2 > ${DATAPFX}.2.log

The output is two files: .Q consists of two columns with cluster assignments for each individual, whereas .P is the population allele frequencies. We want to compare models with different settings for the number of populations K so we run a for loop from 3 to 10:

for k in {3..10}
do
    admixture --cv ${DATAPFX}.bed $k > ${DATAPFX}.${k}.log
done

Each run has generated a cross validation error. We want to select the model with the smallest error and therefore extract and save the CV errors into an output file:

grep CV *log | cut -d " " -f 3,4 | sed -e "s/[()K=:]//g" > ${DATAPFX}.cv.err

R analyses

Now that we have the output data we turn to plotting the results in R. First load the necessary packages:

library(tidyr)
library(ggplot2)
library(viridis)
bw <- theme_bw(base_size = 18) %+replace% theme(axis.text.x = element_text(angle = 45,
    hjust = 1, vjust = 1))
theme_set(bw)

library(dplyr)
library(tibble)
library(RColorBrewer)

DATAPFX <- "monkeyflower_adm"

Next, we plot the CV errors.

df <- read.table(paste0(DATAPFX, ".cv.err"))
colnames(df) <- c("K", "CV")
ggplot(df, aes(x = K, y = CV)) + geom_point() + ylab("CV error")

Here, we look for the lowest value of K, which for this example is 4 (NB: this may differ from your results since this is based on a smaller dataset!).

Now we can plot the admixture proportions. Before doing so, we add sample population information to the plink sample information file (.fam):

sampleinfo <- read.csv("sampleinfo.csv") %>%
    rename(sample = SampleAlias) %>%
    mutate(species = as.factor(gsub("ssp. ", "", Taxon))) %>%
    select(sample, ScientificName, Taxon, Latitude, Longitude, species) %>%
    as_tibble
fam <- read.table(paste0(DATAPFX, ".fam")) %>%
    select(2) %>%
    rename(sample = V2) %>%
    right_join(sampleinfo) %>%
    as_tibble
head(fam)

# A tibble: 6 × 6
  sample     ScientificName       Taxon            Latitude Longitude species   
  <chr>      <chr>                <chr>               <dbl>     <dbl> <fct>     
1 ARI-159_83 Diplacus aridus      ssp. aridus          32.7     -116. aridus    
2 ARI-159_84 Diplacus aridus      ssp. aridus          32.7     -116. aridus    
3 ARI-195_1  Diplacus aridus      ssp. aridus          32.6     -116. aridus    
4 ARI-T84    Diplacus aridus      ssp. aridus          32.7     -116. aridus    
5 AUR-T102   Diplacus aurantiacus ssp. aurantiacus     39.0     -123. aurantiac…
6 AUR-T104   Diplacus aurantiacus ssp. aurantiacus     39.2     -124. aurantiac…

Finally, we define a function to plot the admixture proportions. Without going into too much detail, the code below groups samples by populations in a “facet_grid” to facilitate interpretation.

plot_admixture <- function(filename, fam) {
    df <- read.table(filename) %>%
        rename_with(~paste0("pop", seq_along(.))) %>%
        mutate(sample = fam$sample) %>%
        left_join(fam) %>%
        pivot_longer(cols = starts_with("pop"), names_to = "Population", values_to = "Q") %>%
        mutate(across(Population, as.factor)) %>%
        as_tibble
    colors <- colorRampPalette(brewer.pal(12, "Set3"))(length(levels(df$Population)))
    p <- ggplot(df, aes(x = sample, y = Q, fill = factor(Population))) + geom_col(aes(color = Population),
        linewidth = 0.1) + facet_grid(~species, switch = "x", scales = "free", space = "free") +
        labs(x = "Individual", y = "Q") + scale_y_continuous(expand = c(0, 0)) +
        scale_x_discrete(expand = expansion(add = 1)) + theme(panel.spacing.x = unit(0.1,
        "lines"), axis.text.x = element_text(angle = 45), panel.grid = element_blank(),
        strip.text.x = element_text(angle = 0)) + scale_fill_manual("Population",
        values = colors)
    return(p)
}

We here include admixture plots for K=2 and K=4.

plot_admixture(paste0(DATAPFX, ".2.Q"), fam)

plot_admixture(paste0(DATAPFX, ".4.Q"), fam)

It is important not to overinterpret admixture plots (Lawson et al., 2018), mainly due to the fact that different demographic histories can lead to the same result. The plots are good for detecting recent hybrid events, but fall short for inference of more complex demographic histories.

Also remember that different runs of ADMIXTURE will generate different results because the initial mixing parameters are chosen at random. There are several methods for evaluating the robustness of the results, including Pong (Behr et al., 2016), Clumpak (Kopelman et al., 2015) and evaladmix (Garcia-Erill & Albrechtsen, 2020).

Things to try

If you have more time over, make some more plots with different values of K. Do the results make sense?

References

Alexander, D. H., Novembre, J., & Lange, K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome Research, 19(9), 1655–1664. https://doi.org/10.1101/gr.094052.109

Behr, A. A., Liu, K. Z., Liu-Fang, G., Nakka, P., & Ramachandran, S. (2016). Pong: Fast analysis and visualization of latent clusters in population genetic data. Bioinformatics, 32(18), 2817–2823. https://doi.org/10.1093/bioinformatics/btw327

Chang, C. C., Chow, C. C., Tellier, L. C., Vattikuti, S., Purcell, S. M., & Lee, J. J. (2015). Second-generation PLINK: Rising to the challenge of larger and richer datasets. GigaScience, 4(1), s13742-015-0047-8. https://doi.org/10.1186/s13742-015-0047-8

Garcia-Erill, G., & Albrechtsen, A. (2020). Evaluation of model fit of inferred admixture proportions. Molecular Ecology Resources, 20(4), 936–949. https://doi.org/10.1111/1755-0998.13171

Kopelman, N. M., Mayzel, J., Jakobsson, M., Rosenberg, N. A., & Mayrose, I. (2015). Clumpak: A program for identifying clustering modes and packaging population structure inferences across K. Molecular Ecology Resources, 15(5), 1179–1191. https://doi.org/10.1111/1755-0998.12387

Lawson, D. J., van Dorp, L., & Falush, D. (2018). A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots. Nature Communications, 9(1), 3258. https://doi.org/10.1038/s41467-018-05257-7

Pritchard, J. K., Stephens, M., & Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155(2), 945–959. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1461096/

About

Modules

Virtual environment

Using rsync

Using pgip client

Admixture

Data setup

Running ADMIXTURE

R analyses

Things to try

References

Running `ADMIXTURE`