module load bioinfo-tools samtools PCAngsd ANGSD bcftools plink htslib
Demographic inference
If you haven’t already done so, please read Compute environment for information on how to prepare your working directory.
- Perform basic ANGSD commands and analyses
- Look and interpret SFS plots
Move to your course working directory /proj/naiss2023-22-1084/users/USERNAME
, create an exercise directory called demography
and cd
to it:
cd /proj/naiss2023-22-1084/users/USERNAME
mkdir -p demography
cd demography
You will run most analyses on Rackham, so just go to your directory to start. You will need to download the results files to you computer afterwards, you can use wget
like you’d done multiple times by now.
You will run R, either via R Studio
or on the command line, you can choose.
I recommend downloading the Rmd
file that has all the instructions for this exercise and all the R code necessary to generate the plots we want to see today.
Just download it from https://export.uppmax.uu.se/naiss2023-22-1084/demography/demography.Rmd
with wget
1:
wget --user pgip --password PASSWORD https://export.uppmax.uu.se/naiss2023-22-1084/demography/demography.Rmd
Everything you need is on Uppmax. Execute the following command to load modules:
Practical information
This exercise is divided into two parts. The first part will happen on Rackham. All the files you need are located here: /proj/naiss2023-22-1084/private/demography
. The second part will happen inside R Studio locally.
We will be using ANGSD (https://github.com/ANGSD/angsd).
What has been prepared for you:
- Dataset. Data from 3 human populations (TSI, PEL and LWK).
- You are also being provided with files with the extension
*.filelist
. These files contain a list with all the BAM files needed for this exercise.
You have two options now:
- Create a bash script and submit to SLURM;
- Start an Interactive session on Rackham and run manually each step. I recommend this option for today.
How to start interactive mode on Rackham:
interactive -A {{< var uppmaxproject >}} -n 4 \
--partition core --time 08:00:00 \
--reservation={{< var uppmaxproject >}}_4
You need to load all modules on Rackham so you don’t have to worry about installing each program that we will need in this exercise.
You might need to re-index the reference genome once you copy things over, since ANGSD might give you a “the index file is older than the reference file”-type error. Just run the command below if that happens:
samtools faidx ref.fa.gz
First steps
Calculate allele frequencies
You will need your file that has a list with all BAM files from all individuals from all populations (/proj/naiss2023-22-1084/private/demography/dataset/ALL.bamlist
).
Just run:
angsd -b ALL.bamlist -GL 2 -doMajorMinor 1 -doMaf 2 -minMapQ 30 -minQ 20 -nThreads 4
-b = your file with your BAM files list
-GL 2 = This will calculate the genotype likelihood for you. "2" means it
will use the same method as GATK. If you use -GL 1 it will use the
same method as Samtools.
-doMajorMinor 1 = Many methods assume that each site is diallelic, so
this option will use the genotype likelihood to infer the major and
minor alleles.
-doMaf 2 = It will use the inferred major alleles, but it does not assume
known minor alleles.
-minMapQ 30 = mapping quality filter
-minQ 20 = base quality filter
-nThreads 4 = will use 4 threads (cores) to perform all calculations.
The output will be angsdput.mafs.gz
.
Take a look inside the results file. This file is formatted such as:
chromo position major minor unknownEM nInd
11 61005493 C A 0.000006 1
11 61005494 A C 0.000006 1
11 61005495 C A 0.000006 1
11 61005496 G A 0.000006 1
11 61005497 G A 0.000006 1
11 61005498 T A 0.000006 1
11 61005499 A C 0.000006 1
11 61005500 C A 0.000006 1
11 61005501 A C 0.000006 1
chromo = chromosome name
position = position…
major = major allele
minor = minor allele
knownEM = frequency using -doMaf1
unknownEM = frequency using -doMaf2
pK-EM = p-value for the frequency of the minor allele
nInd = Number of individuals with data
Generate the genotype likelihood files
You will need your file that has a list with all BAM files from all individuals from all populations (/proj/naiss2023-22-1084/private/demography/dataset/ALL.bamlist
).
Just run:
angsd -GL 2 -doGlf 2 -b ALL.bamlist -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 -nThreads 4
-doGlf 2 = will create the beagle likelihood file
-SNP_pval = it will use 10^-6 as a p-value for the LRT to calculate which sites are polymorphic.
The output will be angsdput.mafs.gz
and angsdput.beagle.gz
You already know what’s inside the mafs.gz file, but let’s take a look inside the Beagle format:
marker allele1 allele2 Ind0 Ind0 Ind0 Ind1 Ind1 Ind1 Ind2 Ind2 Ind2 Ind3 Ind3 Ind3 Ind4 Ind4 Ind4 Ind5 Ind5 Ind5 Ind6 Ind6 Ind6 Ind7 Ind7 Ind7 Ind8 Ind8 Ind8 Ind9 Ind9 Ind9
1_14000202 2 0 0.000532 0.999468 0.000000 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.666490 0.333333 0.000177 0.333333 0.333333 0.333333 0.002660 0.997340 0.000001 0.799744 0.200255 0.000000 0.000842 0.999157 0.000001 0.888820 0.111180 0.000000 0.799988 0.200012 0.000000
1_14000873 2 0 0.000000 0.030324 0.969676 0.663107 0.333333 0.003560 0.888698 0.111302 0.000000 0.992114 0.007886 0.000000 0.999013 0.000987 0.000000 0.000009 0.991658 0.008334 0.888726 0.111274 0.000000 0.799696 0.200303 0.000001 0.000000 0.030330 0.969670 0.666490 0.333333 0.000177
1_14001018 3 1 0.000000 0.015429 0.984571 0.799823 0.200177 0.000000 0.050636 0.949364 0.000000 0.941120 0.058880 0.000000 0.888842 0.111158 0.000000 0.799970 0.200030 0.000000 0.333333 0.333333 0.333333 0.992231 0.007769 0.000000 0.000140 0.333333 0.666526 0.666622 0.333333 0.000044
You will see that each individual has three columns in the file.
marker = chromosome and position (chr_position)
allele1 = major allele as: 0=A, 1=C, 2=G, 3=T
allele2 = minor allele as: 0=A, 1=C, 2=G, 3=T
Ind0 = 1st column is the genotype likelihood for major/major genotype
Ind0 = 2nd column is the genotype likelihood for major/minor genotype
Ind0 = 3rd column is the genotype likelihood for minor/minor genotype
SFS estimation
Let’s look into the Site Frequency Spectrum (SFS) for these populations. As you saw in the theoretical part earlier today, the SFS will show you the proportions of sites at different allele frequencies. And to calculate the unfolded SFS we will need an outgroup species to be used as the ancestral state. One important caveat to remember is that during out exercises we are using only a portion of one chromosome, but for real world data usage it is important to use the whole genome so you have enough information.
We usually run SFS per population, but let’s try running it for all individuals in our dataset first, just out of curiosity.
You will need your file that has a list with all BAM files from all individuals from all populations (naiss2023-22-1084/demography/dataset/ALL.bamlist
). You will also need /proj/naiss2023-22-1084/private/demography/dataset/anc.fa.gz
, which has the ancestral state for your individuals. Let’s calculate the unfolded SFS.
First we need the allele frequencies.
Just run:
angsd -bam ALL.bamlist -anc anc.fa.gz -doSaf 1 -GL 2 -nThreads 4
-doSaf 1 = it will calculate allele frequency likelihood based on
individual genotype likelihoods assuming Hardy-Weinberg.
The output will be angsdput.saf.idx
, angsdput.saf.pos.gz
, angsdput.saf.gz
. If you want to know more about these files, read https://github.com/ANGSD/angsd/blob/newsaf/doc/formats.pdf
.
But now that we have calculated allele frequencies, let’s calculate the genotype likelihood:
Just run:
angsd -bam ALL.bamlist -doSaf 1 -out output_gl -anc anc.fa.gz -GL 2 -P 4 -minMapQ 1 -minQ 20
-doSaf 1 = it will calculate allele frequency likelihood based on
individual genotypoe likelihoods assuming Hardy-Weinberg.
-out = choose a name for your output files
-anc chimpHg19.fa.gz = we have to supply a file with the ancestral state
if we want to find unfolded SFS
-P = just another way to use "nThreads"
Now that all calculations have been done, let’s output the SFS based on them. Just run:
realSFS output_gl.saf.idx -maxIter 100 -P 4 >results.sfs
One you have done it, repeat it with our populations: PEL, TSI and LWK. Do not forget to add the option -out name
to name your output files so you can identify each population results.
Switching to your local computer
Alright, now you need to have this .Rmd file and all outputs in the same directory. You can copy to your machine. Or you can remote access it somewhere. It’s up to you. But if you followed the instructions at the top of the page you’ll have all you need already in your local machine.
Plotting SFS
You can run the code below in R to see how the SFS plots look like. Does it look like the plot we’ve seen during the theoretical session?
Let’s look population by population.
First, PEL:
# function to normalize
norm <- function(x) x/sum(x)
# read data
sfs <- (scan("pel_results.sfs"))
# the variability as percentile
pvar <- (1 - sfs[1] - sfs[length(sfs)]) * 100
# the variable categories of the sfs
sfs <- norm(sfs[-c(1, length(sfs))])
barplot(sfs, xlab = "Derived allele frequency", names = 1:length(sfs), ylab = "Proportions",
main = "SFS plot", col = "blue")
then LWK:
# function to normalize
norm <- function(x) x/sum(x)
# read data
sfs <- (scan("lwk_results.sfs"))
# the variability as percentile
pvar <- (1 - sfs[1] - sfs[length(sfs)]) * 100
# the variable categories of the sfs
sfs <- norm(sfs[-c(1, length(sfs))])
barplot(sfs, xlab = "Derived allele frequency", names = 1:length(sfs), ylab = "Proportions",
main = "SFS plot", col = "blue")
And TSI:
# function to normalize
norm <- function(x) x/sum(x)
# read data
sfs <- (scan("tsi_results.sfs"))
# the variability as percentile
pvar <- (1 - sfs[1] - sfs[length(sfs)]) * 100
# the variable categories of the sfs
sfs <- norm(sfs[-c(1, length(sfs))])
barplot(sfs, xlab = "Derived allele frequency", names = 1:length(sfs), ylab = "Proportions",
main = "SFS plot", col = "blue")
They are all a bit different, why do you think is that?
Let’s see them all together:
# function to normalize
nnorm <- function(x) x/sum(x)
# expected number of sites with 1:20 derived alleles
res <- rbind(PEL = scan("pel_results.sfs")[-1], LWK = scan("LWK_results.sfs")[-1],
TSI = scan("TSI_results.sfs")[-1])
colnames(res) <- 1:20
# density instead of expected counts
res <- t(apply(res, 1, nnorm))
# plot the polymorphic sites.
resPoly <- t(apply(res[, -20], 1, nnorm))
barplot(resPoly, beside = T, legend = c("PEL", "LWK", "TSI"), names = 1:19, main = "Derived allele frequencies")
References
- You can learn a lot by browsing the ANGSD website: http://www.popgen.dk/angsd/index.php/ANGSD
- All data comes from the 1000 Genomes Project https://www.nature.com/articles/nature15393
- You can check Matteo Fumagalli’s ngsTools Tutorial page for more exercises using the same dataset. It will show you how to do other things we didn’t have time to do today, like FST, PCAs, etc, using ANGSD/ngsTools: https://github.com/mfumagalli/ngsTools/blob/master/TUTORIAL.md
- This exercise just scratched the surface of what you can do with this type of data. If you are interested in running more interesting analyses, I recommend that you take a look at dadi: https://dadi.readthedocs.io/en/latest/ They have a nice Jupyterlab notebook (like you have been using in the previous days) which will help you a lot.
- I also recommend Fastsimcoal2: http://cmpg.unibe.ch/software/fastsimcoal2/
Footnotes
The password is provided by the course instructor↩︎