Linkage disequilibrium decay

Calculate linkage disequilibrium decay from an input VCF with plink2

Author

Per Unneberg

Published

18-Sep-2025

About

In this exercise we will use plink2 to calculate the linkage disequilibrium (LD) decay from a VCF. Among other things, the LD structure informs us about the choice appropriate marker density for SNP chips or window size in genome scans.

Commands have been run on a subset of the data

The commands of this document have been run on a subset (a subregion) of the data. Therefore, although you will use the same commands, your results will differ from those presented here.

Intended learning outcomes

generate linkage disequilibrium plot with plink2
determine window size for genomic scans

Tools

plink
r

Choose one of Modules and Virtual environment to access relevant tools.

Modules

Execute the following command to load modules:

# NB: the following tools are not available as modules:
# - plink
module load \ 
    PDC/24.11 R/4.4.2-cpeGNU-24.11

Virtual environment

Run the pgip initialization script and activate the pgip default environment:

source /cfs/klemming/projects/supr/pgip_2025/init.sh
# Now obsolete!
# pgip_activate

Then activate the full environment:

# pgip_shell calls pixi shell -e full --as-is
pgip_shell

Copy the contents to a file pixi.toml in directory recombination-linkage, cd to directory and activate environment with pixi shell:

[workspace]
channels = ["conda-forge", "bioconda"]
name = "recombination-linkage"
platforms = ["linux-64", "osx-64"]

[dependencies]
plink = ">=1.90b7.7,<2"
r = ">=4.3,<4.5"
r-tidyverse = ">=2.0.0,<3"

Data setup

PDC
Local

Make an analysis directory recombination-linkage and cd to it:

mkdir -p recombination-linkage && cd recombination-linkage

Using rsync

Use rsync to sync data to your analysis directory (hint: first use options -anv to run the command without actually copying data):

# Run rsync -anv first time
rsync -av /cfs/klemming/projects/supr/pgip_2025/data/monkeyflower/selection/large/vcftools-filter-bqsr/ .

Using pgip client

pgip exercises setup e-recombination-linkage

Make an analysis directory recombination-linkage and cd to it:

mkdir -p recombination-linkage && cd recombination-linkage

Then use wget to retrieve data to analysis directory.

wget -r -np -nH -N --accept-regex all.variantsites.vcf* --cut-dirs=7 \
     https://export.uppmax.uu.se/uppstore2017171/pgip/data/monkeyflower/selection/large/vcftools-filter-bqsr/

Linkage disequilibrium decay¹

Genetic markers in close proximity are often inherited together; we say that the markers are linked. Genetic linkage creates haplotypes and haplotype structures that contain information about inheritance patterns and evolutionary processes. Linkage can be thought of as a force that keeps markers together. Recombination, on the other hand, breaks down linkage and therefore acts as an opposing force that generates new haplotypes as new combinations of alleles. The farther apart two markers are, the more likely a recombination event between them is.

The distance between two markers is often measured as a genetic distance. It is measured in centiMorgan (cM), defined as the average number of crossovers (recombination events) per hundred between the markers. For instance, two markers 10cM apart is equivalent to 0.1 crossovers in meiosis on average. For short distances, this can be interpreted as the percent probability of crossover.

The genetic distance can be related to physical distance as expressed in base pair distance through the recombination rate, measured in cM/Mb. This metric can be interpreted as the probability of crossover per Mb. For instance, in human the recombination rate is 1.2cM/Mb on average, meaning there is a 1.2% probability of crossover per Mb.

Consequently, for longer distances recombination tends to be the dominating force, whereas linkage is stronger for shorter distances. For humans, as a rule of thumb LD is stronger for SNPs ~0.01-0.1 cM apart, whereas recombination dominates linkage for distances > ~0.1cM (Pritchard, n.d., Chapter 2.3). Exactly where the transition occurs depends on the strength of recombination. Figure 1 shows the theoretical breakdown of LD for different recombination rates. This is the type of plot that we will generate next.

Figure 1: Linkage disequilibrium decay for different recombination rates (r)

Data variables and R functions

We start by setting up variables and functions to save typing further down. First set a variable to store the VCF we’re analysing:

VCF=all.variantsites.snp.biallelic.vcf.gz

How many samples, variants and linkage groups are there?

Hint

Use bcftools query -l to list samples, bcftools stats to summarize variants, and bcftools view -h to view the header.

Answer

bcftools query -l ${VCF} | wc -l
bcftools stats ${VCF} | grep "number of SNPs:"
bcftools view -h ${VCF} | grep contig

37
SN  0   number of SNPs: 136135
##contig=<ID=LG4,length=3000000>

In R, we load libraries for plotting and data manipulation

library(tidyr)
library(ggplot2)
library(viridis)
bw <- theme_bw(base_size = 18) %+replace% theme(axis.text.x = element_text(angle = 45,
    hjust = 1, vjust = 1))
theme_set(bw)

library(dplyr)

Calculate LD with plink2

Finally we can start crunching the numbers! We will use plink2 to generate the desired raw output. There are quite a few options involved and we detail them below.²

Note

For larger datasets, to reduce output size, you may have to reduce the magnitude of the --thin option.

# Set thin to a lower value to start with
plink2 --vcf $VCF --double-id --allow-extra-chr \
  --set-missing-var-ids @:# \
  --thin 0.2 --r2-unphased zs --ld-window 999999 --ld-window-kb 100 \
  --mac 3 --geno 0.1 --mind 0.5 \
  --ld-window-r2 0 \
  --make-bed --out monkeyflower_ld 2>&1 > /dev/null

Let’s look at these options.

--vcf – the name of the VCF
--double-id – set both individual ID and family ID to the input (sample) ID as some analyses need pedigree information (not relevant here)
--allow-extra-chr – permit chromosome names other than those expected for human (1-22,X)
--set-missing-var-ids – assign chromosome and bp-based IDs to unnamed variants according to format CHROM:BP. Necessary since model organisms often have annotated variant names
--thin – randomly remove variants with probability p. You may need to tweak this variable to get the proper output (and reduce output size!)
--r2-unphased – emulate plink --r2 option; calculate the squared correlation coefficient \(r^2\) for unphased data between markers. This is the output that is used for the plots below. The zs argument tells plink2 to compress output in zs format
--ld-window – limit the number of variants between which r2 is calculated (NB: differs from plink!)
--ld-window-kb – max distance in kb between markers for which r2 is calculated
--mac – filter variants on minor allele count. Try out different settings for this parameter and note the difference in output.
--geno – exclude variants with missing call frequencies greater than threshold
--mind – exclude samples with missing call frequencies greater than threshold
-ld-window-r2 – by default this sets a filter on the final output but we want to keep everything
--make-bed – create the (old plink-style) binary output dataset
--out – prefix for output data
2>&1 > /dev/null – redirects errors and output to the electronic wastebin. If the command fails, remove this statement to see what is going on

Phew! That’s a lot of options! The results will be written to a file with file extension .vcor.zst which we will analyse in the next step.

Calculate average LD

In the previous step, we have calculated pairwise LD between individual marker pairs. However, in order to make a nice looking LD decay plot, we want average LD binned by distance. To facilitate this calculation, below you’ll find an R function that does just that!

#'
#' Calculate average linkage disequilibrium between pairs of SNPs binned by distance
#'
calc_ld <- function(filename, max_bin = 1e+05) {
    ld <- read.table(pipe(paste("zstdcat", filename, "| sed -e \"s/#//\"")), header = TRUE)
    ld$dist <- ld$POS_B - ld$POS_A
    ld$distc <- cut(ld$dist, breaks = seq(from = 0, to = max_bin + 1, by = 100))

    ldr2 <- ld %>%
        group_by(distc) %>%
        summarise(mean = mean(UNPHASED_R2), median = median(UNPHASED_R2), count = length(UNPHASED_R2))

    labs <- levels(ld$distc)
    ldr2$dist <- as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", labs))
    return(ldr2)
}

Copy-paste the function into your R session and run it on the output file:

ld2r <- calc_ld("monkeyflower_ld.vcor.zst")
head(ld2r, 3)

# A tibble: 3 × 5
  distc      mean median count  dist
  <fct>     <dbl>  <dbl> <int> <dbl>
1 (0,100]   0.239 0.0349  2350   100
2 (100,200] 0.203 0.0257  1528   200
3 (200,300] 0.189 0.0243  1387   300

We use head to view the contents of the output tibble. Each row corresponds to an interval (distc column) along with mean/median \(r^2\) values, number of data points in interval, and a column containing the interval endpoint. We plot the mean against the endpoint below:

ggplot(ld2r, aes(x = dist, y = mean)) + geom_point() + xlim(0, 1e+05)

In this particular figure, we see that for short distances, LD is high, but that it decays towards the background value at around 20-25kb. That gives us a rule of thumb for how large windows to choose for genome scans, for instance. That’s it!

Additional things to try

Calculating LD is very sensitive to small sample sizes³. Try different parameter settings (in particular --thin and --mac) and remake the plots to see what happens. The take-home message is that some fine-tuning of parameters may be needed to capture the signal from the noise.

References

Pritchard, J. K. (n.d.). An Owner’s Guide to the Human Genome. Retrieved August 18, 2025, from https://web.stanford.edu/group/pritchardlab/HGbook.html

Footnotes

This exercise is inspired by and based https://speciationgenomics.github.io/ld_decay/]↩︎
You can always type plink2 --help followed by the option you want more information about (e.g., plink2 --help --set-missing-var-ids)↩︎
In fact, plink2 warns that the sample size is too small, which you can see if you remove the redirect statement (2>&1 > /dev/null) above.↩︎

About

Modules

Virtual environment

Using rsync

Using pgip client

Linkage disequilibrium decay1

Data variables and R functions

Calculate LD with plink2

Calculate average LD

Additional things to try

References

Footnotes

Linkage disequilibrium decay¹