Population Genomics in Practice 2025
  • Slides
  • Exercises
  • Code recipes
  1. Exercises
  2. Population structure
  3. Admixture
  • Slides
    • Listing
    • Introduction
      • Population genomics in practice
    • Population genetics foundations
      • Listing
      • Data and definitions
      • Alleles and genealogies
      • Linkage disequilibrium
      • The Wright-Fisher model
      • Genetic diversity
      • Selection
    • Variant calling
      • Listing
      • DNA sequencing data
      • Read mapping
      • Variant calling and genotyping
      • Variant calling workflows
    • Variant filtering
      • Listing
      • Variant filtering
      • Depth filtering
    • Genetic diversity
      • Listing
      • Genetic diversity
    • Population structure
      • Listing
      • Principal component analysis
      • Admixture
    • Demography
      • Listing
    • Selection
      • Listing
    • Simulation
      • Listing
      • Brief introduction to simulation packages and stdpopsim
      • Primer on the coalescent and forward simulation
      • Ancestral recombination graph inference
  • Exercises
    • Listing
    • Data
      • Compute environment
      • Monkeyflowers dataset
    • Variant calling
      • Listing
      • Variant calling introduction
      • Data quality control
      • Read mapping and duplicate removal
      • Variant calling workflow
    • Variant filtering
      • Listing
      • Basic variant filtering
      • Depth filtering on invariant sites
    • Recombination and linkage
      • Listing
      • Linkage disequilibrium decay
    • Genetic diversity
      • Listing
      • Genetic diversity landscapes
    • Population structure
      • Listing
      • Principal component analysis
      • Admixture
      • D-statistics
    • Simulation
      • Listing
      • HOWTO
      • Introduction to stdpopsim
      • Simulating selection with stdpopsim
      • Introduction to simulation with msprime
  • Code recipes
    • Code recipes

On this page

  • About
    • Admixture
    • Data setup
    • Running ADMIXTURE
    • R analyses
    • Things to try
    • References
  1. Exercises
  2. Population structure
  3. Admixture

Admixture

Infer populations and individual ancestries
Author

Per Unneberg

Published

18-Sep-2025

About

Admixture models model the ancestry components of a set of samples, where the ancestry components consist of a pre-defined number of (source) populations. In this exercise, we will use the software ADMIXTURE (Alexander et al., 2009) to model ancestry.

Commands have been run on a subset of the data

The commands of this document have been run on a subset (a subregion) of the data. Therefore, although you will use the same commands, your results will differ from those presented here.

Intended learning outcomes
  • run ADMIXTURE to infer population structure and individual ancestries
Tools
  • Listing
  • PDC
  • pixi
  • admixture (Alexander et al., 2009)
  • plink2 (Chang et al., 2015)
  • r

Choose one of Modules and Virtual environment to access relevant tools.

Modules

Execute the following command to load modules:

module load \ 
    ADMIXTURE/1.3.0 plink/2.00a5.14 PDC/24.11 R/4.4.2-cpeGNU-24.11
Virtual environment

Run the pgip initialization script and activate the pgip default environment:

source /cfs/klemming/projects/supr/pgip_2025/init.sh
# Now obsolete!
# pgip_activate

Then activate the full environment:

# pgip_shell calls pixi shell -e full --as-is
pgip_shell

Copy the contents to a file pixi.toml in directory population-structure, cd to directory and activate environment with pixi shell:

[workspace]
channels = ["conda-forge", "bioconda"]
name = "population-structure"
platforms = ["linux-64", "osx-64"]

[dependencies]
admixture = ">=1.3.0,<2"
plink2 = ">=2.0.0a.6.9,<3"
r = ">=4.3,<4.5"
r-tidyverse = ">=2.0.0,<3"

Data setup
  • PDC
  • Local

Make an analysis directory population-structure and cd to it:

mkdir -p population-structure && cd population-structure
Using rsync

Use rsync to sync data to your analysis directory (hint: first use options -anv to run the command without actually copying data):

# Run rsync -anv first time
rsync -av /cfs/klemming/projects/supr/pgip_2025/data/monkeyflower/selection/large/vcftools-filter-bqsr/ .
Using pgip client
pgip exercises setup e-population-structure

Make an analysis directory population-structure and cd to it:

mkdir -p population-structure && cd population-structure

Then use wget to retrieve data to analysis directory.

wget -r -np -nH -N --accept-regex all.variantsites* --cut-dirs=7 \
     https://export.uppmax.uu.se/uppstore2017171/pgip/data/monkeyflower/selection/large/vcftools-filter-bqsr/

Admixture

Although clustering methods like Principal Component Analysis (PCA) cluster individuals, they provide limited information regarding the compositional makeup of genomes. Admixture modelling involves analyzing ancestry from multiple source populations, where individual allele frequency (P) is a mixture (Q) of source population Ps. Two approaches to this include estimating P from known populations and using the STRUCTURE tool, which applies Bayesian estimations directly from ‘G’ as proposed by (Pritchard et al., 2000). Ancestry is essentially the proportion of a genome sourced from specified groups or populations. The Admixture tool (Alexander et al., 2009) is another method used to estimate ancestry components from genotypes based on maximum likelihood. Its development was primarily motivated by the need to address population stratification, a common confounding factor in association studies.

In this exercise, you will use ADMIXTURE to estimate ancestry components in Monkeyflower.

Data setup

We start by defining some variables. Note that ADMIXTURE assumes indpendence among markers, which means we must first perform LD pruning. Follow the steps in the PCA to generate the input file monkeyflower_pca.prune.in defined below.

VCF=variants.vcf.gz
PRUNE_IN=monkeyflower_pca.prune.in
DATAPFX=monkeyflower_adm

Running ADMIXTURE

Given an appropriate input file, running ADMIXTURE can be done in a few steps. ADMIXTURE takes as input a plink bed (binary biallelic genotype table) file, which we generate with the option --make-bed:

plink2 --vcf $VCF --allow-extra-chr \
       --extract ${PRUNE_IN} \
       --set-missing-var-ids @:# \
       --make-bed \
       --out ${DATAPFX} > /dev/null 2>&1

This command will generate three output files:

  • monkeyflower_adm.bed – file containing representation of genotype calls at biallelic variants
  • monkeyflower_adm.bim – a map file which is a table of the markers, their positions, and the alleles
  • monkeyflower_adm.fam – a sample information file

Before running ADMIXTURE, we need to modify the map file such that chromosome names are integers and not as now prefixed with LG.

# ADMIXTURE only accepts integer chromosome names
sed -i -e "s/^\([A-Z0-9][A-Z0-9]*\)/0/g" ${DATAPFX}.bim

That is all we have to do! We next run ADMIXTURE with cross-validation (default is 5-fold CV) and set the number of populations K=2:

admixture --cv ${DATAPFX}.bed 2 > ${DATAPFX}.2.log

The output is two files: .Q consists of two columns with cluster assignments for each individual, whereas .P is the population allele frequencies. We want to compare models with different settings for the number of populations K so we run a for loop from 3 to 10:

for k in {3..10}
do
    admixture --cv ${DATAPFX}.bed $k > ${DATAPFX}.${k}.log
done

Each run has generated a cross validation error. We want to select the model with the smallest error and therefore extract and save the CV errors into an output file:

grep CV *log | cut -d " " -f 3,4 | sed -e "s/[()K=:]//g" > ${DATAPFX}.cv.err

R analyses

Now that we have the output data we turn to plotting the results in R. First load the necessary packages:

library(tidyr)
library(ggplot2)
library(viridis)
bw <- theme_bw(base_size = 18) %+replace% theme(axis.text.x = element_text(angle = 45,
    hjust = 1, vjust = 1))
theme_set(bw)

library(dplyr)
library(tibble)
library(RColorBrewer)

DATAPFX <- "monkeyflower_adm"

Next, we plot the CV errors.

df <- read.table(paste0(DATAPFX, ".cv.err"))
colnames(df) <- c("K", "CV")
ggplot(df, aes(x = K, y = CV)) + geom_point() + ylab("CV error")

Here, we look for the lowest value of K, which for this example is 4 (NB: this may differ from your results since this is based on a smaller dataset!).

Now we can plot the admixture proportions. Before doing so, we add sample population information to the plink sample information file (.fam):

sampleinfo <- read.csv("sampleinfo.csv") %>%
    rename(sample = SampleAlias) %>%
    mutate(species = as.factor(gsub("ssp. ", "", Taxon))) %>%
    select(sample, ScientificName, Taxon, Latitude, Longitude, species) %>%
    as_tibble
fam <- read.table(paste0(DATAPFX, ".fam")) %>%
    select(2) %>%
    rename(sample = V2) %>%
    right_join(sampleinfo) %>%
    as_tibble
head(fam)
# A tibble: 6 × 6
  sample     ScientificName       Taxon            Latitude Longitude species   
  <chr>      <chr>                <chr>               <dbl>     <dbl> <fct>     
1 ARI-159_83 Diplacus aridus      ssp. aridus          32.7     -116. aridus    
2 ARI-159_84 Diplacus aridus      ssp. aridus          32.7     -116. aridus    
3 ARI-195_1  Diplacus aridus      ssp. aridus          32.6     -116. aridus    
4 ARI-T84    Diplacus aridus      ssp. aridus          32.7     -116. aridus    
5 AUR-T102   Diplacus aurantiacus ssp. aurantiacus     39.0     -123. aurantiac…
6 AUR-T104   Diplacus aurantiacus ssp. aurantiacus     39.2     -124. aurantiac…

Finally, we define a function to plot the admixture proportions. Without going into too much detail, the code below groups samples by populations in a “facet_grid” to facilitate interpretation.

plot_admixture <- function(filename, fam) {
    df <- read.table(filename) %>%
        rename_with(~paste0("pop", seq_along(.))) %>%
        mutate(sample = fam$sample) %>%
        left_join(fam) %>%
        pivot_longer(cols = starts_with("pop"), names_to = "Population", values_to = "Q") %>%
        mutate(across(Population, as.factor)) %>%
        as_tibble
    colors <- colorRampPalette(brewer.pal(12, "Set3"))(length(levels(df$Population)))
    p <- ggplot(df, aes(x = sample, y = Q, fill = factor(Population))) + geom_col(aes(color = Population),
        linewidth = 0.1) + facet_grid(~species, switch = "x", scales = "free", space = "free") +
        labs(x = "Individual", y = "Q") + scale_y_continuous(expand = c(0, 0)) +
        scale_x_discrete(expand = expansion(add = 1)) + theme(panel.spacing.x = unit(0.1,
        "lines"), axis.text.x = element_text(angle = 45), panel.grid = element_blank(),
        strip.text.x = element_text(angle = 0)) + scale_fill_manual("Population",
        values = colors)
    return(p)
}

We here include admixture plots for K=2 and K=4.

plot_admixture(paste0(DATAPFX, ".2.Q"), fam)

plot_admixture(paste0(DATAPFX, ".4.Q"), fam)

It is important not to overinterpret admixture plots (Lawson et al., 2018), mainly due to the fact that different demographic histories can lead to the same result. The plots are good for detecting recent hybrid events, but fall short for inference of more complex demographic histories.

Also remember that different runs of ADMIXTURE will generate different results because the initial mixing parameters are chosen at random. There are several methods for evaluating the robustness of the results, including Pong (Behr et al., 2016), Clumpak (Kopelman et al., 2015) and evaladmix (Garcia-Erill & Albrechtsen, 2020).

Things to try

If you have more time over, make some more plots with different values of K. Do the results make sense?

References

Alexander, D. H., Novembre, J., & Lange, K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome Research, 19(9), 1655–1664. https://doi.org/10.1101/gr.094052.109
Behr, A. A., Liu, K. Z., Liu-Fang, G., Nakka, P., & Ramachandran, S. (2016). Pong: Fast analysis and visualization of latent clusters in population genetic data. Bioinformatics, 32(18), 2817–2823. https://doi.org/10.1093/bioinformatics/btw327
Chang, C. C., Chow, C. C., Tellier, L. C., Vattikuti, S., Purcell, S. M., & Lee, J. J. (2015). Second-generation PLINK: Rising to the challenge of larger and richer datasets. GigaScience, 4(1), s13742-015-0047-8. https://doi.org/10.1186/s13742-015-0047-8
Garcia-Erill, G., & Albrechtsen, A. (2020). Evaluation of model fit of inferred admixture proportions. Molecular Ecology Resources, 20(4), 936–949. https://doi.org/10.1111/1755-0998.13171
Kopelman, N. M., Mayzel, J., Jakobsson, M., Rosenberg, N. A., & Mayrose, I. (2015). Clumpak: A program for identifying clustering modes and packaging population structure inferences across K. Molecular Ecology Resources, 15(5), 1179–1191. https://doi.org/10.1111/1755-0998.12387
Lawson, D. J., van Dorp, L., & Falush, D. (2018). A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots. Nature Communications, 9(1), 3258. https://doi.org/10.1038/s41467-018-05257-7
Pritchard, J. K., Stephens, M., & Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155(2), 945–959. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1461096/

2025 NBIS | GPL-3 License

 

Published with Quarto v