Linkage disequilibrium decay
About
In this exercise we will use plink2
to calculate the linkage disequilibrium (LD) decay from a VCF. Among other things, the LD structure informs us about the choice appropriate marker density for SNP chips or window size in genome scans.
The commands of this document have been run on a subset (a subregion) of the data. Therefore, although you will use the same commands, your results will differ from those presented here.
- generate linkage disequilibrium plot with
plink2
- determine window size for genomic scans
Choose one of Modules and Virtual environment to access relevant tools.
Modules
Execute the following command to load modules:
# NB: the following tools are not available as modules:
# - plink
module load \
PDC/24.11 R/4.4.2-cpeGNU-24.11
Virtual environment
Run the pgip
initialization script and activate the pgip
default environment:
source /cfs/klemming/projects/supr/pgip_2025/init.sh
# Now obsolete!
# pgip_activate
Then activate the full
environment:
# pgip_shell calls pixi shell -e full --as-is
pgip_shell
Copy the contents to a file pixi.toml
in directory recombination-linkage
, cd
to directory and activate environment with pixi shell
:
[workspace]
channels = ["conda-forge", "bioconda"]
name = "recombination-linkage"
platforms = ["linux-64", "osx-64"]
[dependencies]
plink = ">=1.90b7.7,<2"
r = ">=4.3,<4.5"
r-tidyverse = ">=2.0.0,<3"
Make an analysis directory recombination-linkage
and cd to it:
mkdir -p recombination-linkage && cd recombination-linkage
Using rsync
Use rsync to sync data to your analysis directory (hint: first use options -anv
to run the command without actually copying data):
# Run rsync -anv first time
rsync -av /cfs/klemming/projects/supr/pgip_2025/data/monkeyflower/selection/large/vcftools-filter-bqsr/ .
Using pgip client
pgip exercises setup e-recombination-linkage
Make an analysis directory recombination-linkage
and cd to it:
mkdir -p recombination-linkage && cd recombination-linkage
Then use wget to retrieve data to analysis directory.
wget -r -np -nH -N --accept-regex all.variantsites.vcf* --cut-dirs=7 \
https://export.uppmax.uu.se/uppstore2017171/pgip/data/monkeyflower/selection/large/vcftools-filter-bqsr/
Linkage disequilibrium decay1
Genetic markers in close proximity are often inherited together; we say that the markers are linked. Genetic linkage creates haplotypes and haplotype structures that contain information about inheritance patterns and evolutionary processes. Linkage can be thought of as a force that keeps markers together. Recombination, on the other hand, breaks down linkage and therefore acts as an opposing force that generates new haplotypes as new combinations of alleles. The farther apart two markers are, the more likely a recombination event between them is.
The distance between two markers is often measured as a genetic distance. It is measured in centiMorgan (cM), defined as the average number of crossovers (recombination events) per hundred between the markers. For instance, two markers 10cM apart is equivalent to 0.1 crossovers in meiosis on average. For short distances, this can be interpreted as the percent probability of crossover.
The genetic distance can be related to physical distance as expressed in base pair distance through the recombination rate, measured in cM/Mb. This metric can be interpreted as the probability of crossover per Mb. For instance, in human the recombination rate is 1.2cM/Mb on average, meaning there is a 1.2% probability of crossover per Mb.
Consequently, for longer distances recombination tends to be the dominating force, whereas linkage is stronger for shorter distances. For humans, as a rule of thumb LD is stronger for SNPs ~0.01-0.1 cM apart, whereas recombination dominates linkage for distances > ~0.1cM (Pritchard, n.d., Chapter 2.3). Exactly where the transition occurs depends on the strength of recombination. Figure 1 shows the theoretical breakdown of LD for different recombination rates. This is the type of plot that we will generate next.
Data variables and R functions
We start by setting up variables and functions to save typing further down. First set a variable to store the VCF
we’re analysing:
VCF=all.variantsites.snp.biallelic.vcf.gz
Use bcftools query -l
to list samples, bcftools stats
to summarize variants, and bcftools view -h
to view the header.
bcftools query -l ${VCF} | wc -l
bcftools stats ${VCF} | grep "number of SNPs:"
bcftools view -h ${VCF} | grep contig
37
SN 0 number of SNPs: 136135
##contig=<ID=LG4,length=3000000>
In R
, we load libraries for plotting and data manipulation
Calculate LD with plink2
Finally we can start crunching the numbers! We will use plink2
to generate the desired raw output. There are quite a few options involved and we detail them below.2
For larger datasets, to reduce output size, you may have to reduce the magnitude of the --thin
option.
# Set thin to a lower value to start with
plink2 --vcf $VCF --double-id --allow-extra-chr \
--set-missing-var-ids @:# \
--thin 0.2 --r2-unphased zs --ld-window 999999 --ld-window-kb 100 \
--mac 3 --geno 0.1 --mind 0.5 \
--ld-window-r2 0 \
--make-bed --out monkeyflower_ld 2>&1 > /dev/null
Let’s look at these options.
-
--vcf
– the name of the VCF -
--double-id
– set both individual ID and family ID to the input (sample) ID as some analyses need pedigree information (not relevant here) -
--allow-extra-chr
– permit chromosome names other than those expected for human (1-22,X) -
--set-missing-var-ids
– assign chromosome and bp-based IDs to unnamed variants according to formatCHROM:BP
. Necessary since model organisms often have annotated variant names -
--thin
– randomly remove variants with probability p. You may need to tweak this variable to get the proper output (and reduce output size!) -
--r2-unphased
– emulateplink
--r2
option; calculate the squared correlation coefficient \(r^2\) for unphased data between markers. This is the output that is used for the plots below. Thezs
argument tellsplink2
to compress output in zs format -
--ld-window
– limit the number of variants between whichr2
is calculated (NB: differs fromplink
!) -
--ld-window-kb
– max distance in kb between markers for whichr2
is calculated -
--mac
– filter variants on minor allele count. Try out different settings for this parameter and note the difference in output. -
--geno
– exclude variants with missing call frequencies greater than threshold -
--mind
– exclude samples with missing call frequencies greater than threshold -
-ld-window-r2
– by default this sets a filter on the final output but we want to keep everything -
--make-bed
– create the (oldplink
-style) binary output dataset -
--out
– prefix for output data -
2>&1 > /dev/null
– redirects errors and output to the electronic wastebin. If the command fails, remove this statement to see what is going on
Phew! That’s a lot of options! The results will be written to a file with file extension .vcor.zst
which we will analyse in the next step.
Calculate average LD
In the previous step, we have calculated pairwise LD between individual marker pairs. However, in order to make a nice looking LD decay plot, we want average LD binned by distance. To facilitate this calculation, below you’ll find an R
function that does just that!
#'
#' Calculate average linkage disequilibrium between pairs of SNPs binned by distance
#'
calc_ld <- function(filename, max_bin = 1e+05) {
ld <- read.table(pipe(paste("zstdcat", filename, "| sed -e \"s/#//\"")), header = TRUE)
ld$dist <- ld$POS_B - ld$POS_A
ld$distc <- cut(ld$dist, breaks = seq(from = 0, to = max_bin + 1, by = 100))
ldr2 <- ld %>%
group_by(distc) %>%
summarise(mean = mean(UNPHASED_R2), median = median(UNPHASED_R2), count = length(UNPHASED_R2))
labs <- levels(ld$distc)
ldr2$dist <- as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", labs))
return(ldr2)
}
Copy-paste the function into your R session and run it on the output file:
ld2r <- calc_ld("monkeyflower_ld.vcor.zst")
head(ld2r, 3)
# A tibble: 3 × 5
distc mean median count dist
<fct> <dbl> <dbl> <int> <dbl>
1 (0,100] 0.239 0.0349 2350 100
2 (100,200] 0.203 0.0257 1528 200
3 (200,300] 0.189 0.0243 1387 300
We use head
to view the contents of the output tibble. Each row corresponds to an interval (distc
column) along with mean/median \(r^2\) values, number of data points in interval, and a column containing the interval endpoint. We plot the mean against the endpoint below:
ggplot(ld2r, aes(x = dist, y = mean)) + geom_point() + xlim(0, 1e+05)
In this particular figure, we see that for short distances, LD is high, but that it decays towards the background value at around 20-25kb. That gives us a rule of thumb for how large windows to choose for genome scans, for instance. That’s it!
Additional things to try
Calculating LD is very sensitive to small sample sizes3. Try different parameter settings (in particular --thin
and --mac
) and remake the plots to see what happens. The take-home message is that some fine-tuning of parameters may be needed to capture the signal from the noise.
References
Footnotes
This exercise is inspired by and based https://speciationgenomics.github.io/ld_decay/]↩︎
You can always type
plink2 --help
followed by the option you want more information about (e.g.,plink2 --help --set-missing-var-ids
)↩︎In fact,
plink2
warns that the sample size is too small, which you can see if you remove the redirect statement (2>&1 > /dev/null
) above.↩︎