Demographic inference

Author

André Soares

Published

15-Nov-2023

Compute environment setup

If you haven’t already done so, please read Compute environment for information on how to prepare your working directory.

Learning objectives

Perform basic ANGSD commands and analyses
Look and interpret SFS plots

Data setup

UPPMAX
Local

Move to your course working directory /proj/naiss2023-22-1084/users/USERNAME, create an exercise directory called demography and cd to it:

cd /proj/naiss2023-22-1084/users/USERNAME
mkdir -p demography
cd demography

You will run most analyses on Rackham, so just go to your directory to start. You will need to download the results files to you computer afterwards, you can use wget like you’d done multiple times by now.

You will run R, either via R Studio or on the command line, you can choose.

I recommend downloading the Rmd file that has all the instructions for this exercise and all the R code necessary to generate the plots we want to see today.

Just download it from https://export.uppmax.uu.se/naiss2023-22-1084/demography/demography.Rmd with wget¹:

wget --user pgip --password PASSWORD https://export.uppmax.uu.se/naiss2023-22-1084/demography/demography.Rmd

Tools

UPPMAX modules

Everything you need is on Uppmax. Execute the following command to load modules:

module load bioinfo-tools samtools PCAngsd ANGSD bcftools plink htslib

Practical information

This exercise is divided into two parts. The first part will happen on Rackham. All the files you need are located here: /proj/naiss2023-22-1084/private/demography. The second part will happen inside R Studio locally.

We will be using ANGSD (https://github.com/ANGSD/angsd).

What has been prepared for you:

Dataset. Data from 3 human populations (TSI, PEL and LWK).
You are also being provided with files with the extension *.filelist. These files contain a list with all the BAM files needed for this exercise.

You have two options now:

Create a bash script and submit to SLURM;
Start an Interactive session on Rackham and run manually each step. I recommend this option for today.

How to start interactive mode on Rackham:

interactive -A {{< var uppmaxproject >}} -n 4 \
   --partition core --time 08:00:00 \
   --reservation={{< var uppmaxproject >}}_4

You need to load all modules on Rackham so you don’t have to worry about installing each program that we will need in this exercise.

You might need to re-index the reference genome once you copy things over, since ANGSD might give you a “the index file is older than the reference file”-type error. Just run the command below if that happens:

samtools faidx ref.fa.gz

First steps

Calculate allele frequencies

You will need your file that has a list with all BAM files from all individuals from all populations (/proj/naiss2023-22-1084/private/demography/dataset/ALL.bamlist).

Just run:

angsd -b ALL.bamlist -GL 2 -doMajorMinor 1 -doMaf 2 -minMapQ 30 -minQ 20 -nThreads 4

-b = your file with your BAM files list
-GL 2 = This will calculate the genotype likelihood for you. "2" means it
    will use the same method as GATK. If you use -GL 1 it will use the
    same method as Samtools.
-doMajorMinor 1 = Many methods assume that each site is diallelic, so
    this option will use the genotype likelihood to infer the major and
    minor alleles.
-doMaf 2 = It will use the inferred major alleles, but it does not assume
    known minor alleles.
-minMapQ 30 = mapping quality filter
-minQ 20 = base quality filter
-nThreads 4 = will use 4 threads (cores) to perform all calculations.

The output will be angsdput.mafs.gz.

Take a look inside the results file. This file is formatted such as:

chromo  position        major   minor   unknownEM       nInd
11      61005493        C       A       0.000006        1
11      61005494        A       C       0.000006        1
11      61005495        C       A       0.000006        1
11      61005496        G       A       0.000006        1
11      61005497        G       A       0.000006        1
11      61005498        T       A       0.000006        1
11      61005499        A       C       0.000006        1
11      61005500        C       A       0.000006        1
11      61005501        A       C       0.000006        1

chromo = chromosome name
position = position…
major = major allele
minor = minor allele
knownEM = frequency using -doMaf1
unknownEM = frequency using -doMaf2
pK-EM = p-value for the frequency of the minor allele
nInd = Number of individuals with data

Generate the genotype likelihood files

You will need your file that has a list with all BAM files from all individuals from all populations (/proj/naiss2023-22-1084/private/demography/dataset/ALL.bamlist).

Just run:

angsd -GL 2 -doGlf 2 -b ALL.bamlist -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 -nThreads 4

-doGlf 2 = will create the beagle likelihood file
-SNP_pval = it will use 10^-6 as a p-value for the LRT to calculate which sites are polymorphic.

The output will be angsdput.mafs.gz and angsdput.beagle.gz

You already know what’s inside the mafs.gz file, but let’s take a look inside the Beagle format:

marker  allele1 allele2 Ind0    Ind0    Ind0    Ind1    Ind1    Ind1    Ind2    Ind2    Ind2    Ind3    Ind3    Ind3    Ind4    Ind4    Ind4    Ind5    Ind5    Ind5    Ind6    Ind6    Ind6    Ind7    Ind7    Ind7    Ind8    Ind8    Ind8    Ind9    Ind9    Ind9
1_14000202      2       0       0.000532        0.999468        0.000000        0.333333        0.333333        0.333333        0.333333        0.333333        0.333333        0.666490        0.333333        0.000177        0.333333        0.333333        0.333333        0.002660        0.997340        0.000001        0.799744        0.200255        0.000000        0.000842        0.999157        0.000001        0.888820        0.111180        0.000000        0.799988        0.200012        0.000000
1_14000873      2       0       0.000000        0.030324        0.969676        0.663107        0.333333        0.003560        0.888698        0.111302        0.000000        0.992114        0.007886        0.000000        0.999013        0.000987        0.000000        0.000009        0.991658        0.008334        0.888726        0.111274        0.000000        0.799696        0.200303        0.000001        0.000000        0.030330        0.969670        0.666490        0.333333        0.000177
1_14001018      3       1       0.000000        0.015429        0.984571        0.799823        0.200177        0.000000        0.050636        0.949364        0.000000        0.941120        0.058880        0.000000        0.888842        0.111158        0.000000        0.799970        0.200030        0.000000        0.333333        0.333333        0.333333        0.992231        0.007769        0.000000        0.000140        0.333333        0.666526        0.666622        0.333333        0.000044

You will see that each individual has three columns in the file.

marker = chromosome and position (chr_position)
allele1 = major allele as: 0=A, 1=C, 2=G, 3=T
allele2 = minor allele as: 0=A, 1=C, 2=G, 3=T
Ind0 = 1st column is the genotype likelihood for major/major genotype
Ind0 = 2nd column is the genotype likelihood for major/minor genotype
Ind0 = 3rd column is the genotype likelihood for minor/minor genotype

SFS estimation

Let’s look into the Site Frequency Spectrum (SFS) for these populations. As you saw in the theoretical part earlier today, the SFS will show you the proportions of sites at different allele frequencies. And to calculate the unfolded SFS we will need an outgroup species to be used as the ancestral state. One important caveat to remember is that during out exercises we are using only a portion of one chromosome, but for real world data usage it is important to use the whole genome so you have enough information.

We usually run SFS per population, but let’s try running it for all individuals in our dataset first, just out of curiosity.

You will need your file that has a list with all BAM files from all individuals from all populations (naiss2023-22-1084/demography/dataset/ALL.bamlist). You will also need /proj/naiss2023-22-1084/private/demography/dataset/anc.fa.gz, which has the ancestral state for your individuals. Let’s calculate the unfolded SFS.

First we need the allele frequencies.

Just run:

angsd -bam ALL.bamlist -anc anc.fa.gz -doSaf 1 -GL 2 -nThreads 4

-doSaf 1 = it will calculate allele frequency likelihood based on
    individual genotype likelihoods assuming Hardy-Weinberg.

The output will be angsdput.saf.idx, angsdput.saf.pos.gz, angsdput.saf.gz. If you want to know more about these files, read https://github.com/ANGSD/angsd/blob/newsaf/doc/formats.pdf.

But now that we have calculated allele frequencies, let’s calculate the genotype likelihood:

Just run:

angsd -bam ALL.bamlist -doSaf 1 -out output_gl -anc anc.fa.gz -GL 2 -P 4 -minMapQ 1 -minQ 20

-doSaf 1 = it will calculate allele frequency likelihood based on
    individual genotypoe likelihoods assuming Hardy-Weinberg.
-out = choose a name for your output files
-anc chimpHg19.fa.gz = we have to supply a file with the ancestral state
    if we want to find unfolded SFS
-P = just another way to use "nThreads"

Now that all calculations have been done, let’s output the SFS based on them. Just run:

realSFS output_gl.saf.idx -maxIter 100 -P 4 >results.sfs

One you have done it, repeat it with our populations: PEL, TSI and LWK. Do not forget to add the option -out name to name your output files so you can identify each population results.

Switching to your local computer

Alright, now you need to have this .Rmd file and all outputs in the same directory. You can copy to your machine. Or you can remote access it somewhere. It’s up to you. But if you followed the instructions at the top of the page you’ll have all you need already in your local machine.

Plotting SFS

You can run the code below in R to see how the SFS plots look like. Does it look like the plot we’ve seen during the theoretical session?

Let’s look population by population.

First, PEL:

# function to normalize
norm <- function(x) x/sum(x)
# read data
sfs <- (scan("pel_results.sfs"))
# the variability as percentile
pvar <- (1 - sfs[1] - sfs[length(sfs)]) * 100
# the variable categories of the sfs
sfs <- norm(sfs[-c(1, length(sfs))])
barplot(sfs, xlab = "Derived allele frequency", names = 1:length(sfs), ylab = "Proportions",
    main = "SFS plot", col = "blue")

then LWK:

# function to normalize
norm <- function(x) x/sum(x)
# read data
sfs <- (scan("lwk_results.sfs"))
# the variability as percentile
pvar <- (1 - sfs[1] - sfs[length(sfs)]) * 100
# the variable categories of the sfs
sfs <- norm(sfs[-c(1, length(sfs))])
barplot(sfs, xlab = "Derived allele frequency", names = 1:length(sfs), ylab = "Proportions",
    main = "SFS plot", col = "blue")

And TSI:

# function to normalize
norm <- function(x) x/sum(x)
# read data
sfs <- (scan("tsi_results.sfs"))
# the variability as percentile
pvar <- (1 - sfs[1] - sfs[length(sfs)]) * 100
# the variable categories of the sfs
sfs <- norm(sfs[-c(1, length(sfs))])
barplot(sfs, xlab = "Derived allele frequency", names = 1:length(sfs), ylab = "Proportions",
    main = "SFS plot", col = "blue")

They are all a bit different, why do you think is that?

Let’s see them all together:

# function to normalize
nnorm <- function(x) x/sum(x)
# expected number of sites with 1:20 derived alleles
res <- rbind(PEL = scan("pel_results.sfs")[-1], LWK = scan("LWK_results.sfs")[-1],
    TSI = scan("TSI_results.sfs")[-1])
colnames(res) <- 1:20

# density instead of expected counts
res <- t(apply(res, 1, nnorm))

# plot the polymorphic sites.
resPoly <- t(apply(res[, -20], 1, nnorm))
barplot(resPoly, beside = T, legend = c("PEL", "LWK", "TSI"), names = 1:19, main = "Derived allele frequencies")

References

You can learn a lot by browsing the ANGSD website: http://www.popgen.dk/angsd/index.php/ANGSD
All data comes from the 1000 Genomes Project https://www.nature.com/articles/nature15393
You can check Matteo Fumagalli’s ngsTools Tutorial page for more exercises using the same dataset. It will show you how to do other things we didn’t have time to do today, like FST, PCAs, etc, using ANGSD/ngsTools: https://github.com/mfumagalli/ngsTools/blob/master/TUTORIAL.md
This exercise just scratched the surface of what you can do with this type of data. If you are interested in running more interesting analyses, I recommend that you take a look at dadi: https://dadi.readthedocs.io/en/latest/ They have a nice Jupyterlab notebook (like you have been using in the previous days) which will help you a lot.
I also recommend Fastsimcoal2: http://cmpg.unibe.ch/software/fastsimcoal2/

Footnotes

The password is provided by the course instructor↩︎