PSMC

Author

André Soares

Published

15-Nov-2023

Compute environment setup

If you haven’t already done so, please read Compute environment for information on how to prepare your working directory.

Learning objectives

Perform basic PSMC commands and analyses
Look and interpret at PSMC plots

Data setup

UPPMAX
Local

Move to your course working directory /proj/naiss2023-22-1084/users/USERNAME, create an exercise directory called demography and cd to it:

cd /proj/naiss2023-22-1084/users/USERNAME
mkdir -p psmc
cd psmc

You will run most analyses on Rackham, so just go to your directory to start. You will need to download the results files to you computer afterwards, you can use wget like you’d done multiple times by now.

You can perform all commands on Rackham, so you will be working entirely on Uppmax.

Tools

UPPMAX modules

Everything you need is on Uppmax. Execute the following command to load modules:

module load bioinfo-tools bcftools psmc samtools htslib biopython

Practical information

All the files you need are located here: /proj/naiss2023-22-1084/private/demography/psmc.

We will be using PSMC (https://github.com/lh3/psmc).

PSMC is a programs to infer demography history and population size over time based on whole genome data.

To run the commands you have two options today:

Create a bash script and submit to SLURM;
Start an Interactive session on Rackham and run manually each step. I recommend this option for today.

How to start interactive mode on Rackham:

#| echo: true
#| eval: false
interactive -A naiss2023-22-1084 -n 4 \
   --partition core --time 08:00:00 \
   --reservation=naiss2023-22-1084_4

You need to load all modules on Rackham so you don’t have to worry about installing each program that we will need in this exercise.

module load bioinfo-tools bcftools psmc samtools htslib

PSMC

When you are running PSMC for your own projects you will need to perform all steps, but today we have a shorter version of the process, so you can look at the results more quickly.

One a regular day this is the process:

You will need a BAM file with reads mapped to a reference genome, just like you did on Tuesday at the Variant Calling exercise (https://nbisweden.github.io/workshop-pgip/exercises/variant_calling/#variant-calling-overview). For PSMC and MSMC it is recommended a minimum of 12x coverage, we discussed this a bit more during the lecture. The organism also needs to be diploid.
For PSMC you need to get a diploid consensus from the BAM file, using a command like:

bcftools mpileup -C50 -uf ref.fa aln.bam | bcftools view -c - \
      | vcfutils.pl vcf2fq -d 10 -D 100 | gzip > diploid.fq.gz

After that you will need to convert the diploid consensus to a PSMC input file. PSMC comes with a script for that:

fq2psmcfa -q20 diploid.fq.gz > diploid.psmcfa

Then you will run PSMC for your sample. It is important downstream to know the generation time and mutation rate for your species.

For today we will be using a pre-made psmcfa file. This file was generated based on a simulated dataset created with ms.

PSMC input file

The .psmcfa file used for PSMC input is located at /proj/naiss2023-22-1084/private/demography/psmc/psmc_input.psmcfa.

Take a look inside with less psmc_input.psmcfa, you will see something like:

>1
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTKTTTTTTTTTTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTTKTTTTTTTKTTTTTTTTTTTTKTTTTTTTTTTTTTTTTKTTTTTT
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTKTTTTTTTTTTTTTTTTTTTT

Where >1 indicates the chromosome, and then every window is assigned a T or a K. Can you guess what T and K means in this context?

Running PSMC

We will be running PSMC as:

psmc -N25 -t15 -r5 -p 4+25*2+4+6 -o output.psmc psmc_input.psmcfa

PSMC has a lot of parameters, and it can feel a bit of a black box for most users.

-N = number of iterations
–p = number of free atomic time intervals. This can be confusing. It
means 64 atomic time intervals and 28 free intervals parameters. The
first population size parameter spans the first 4 atomic time intervals,
each of the next 25 parameters spans 2 intervals, the 27th spans 4
intervals and the last parameter spans the last 6 time intervals.
–t = upper limit of time to most recent common ancestor, or the maximum
coalescence time

The PSMC algorithm breaks down the history of a population into a series of time intervals, each of which corresponds to a different era in the population’s past. The boundaries between these time intervals are estimated using genetic data and coalescent theory, which describes how genealogical relationships between individuals change over time.

Atomic time allows researchers to compare the relative timing of demographic events, such as population size changes or bottlenecks, without making assumptions about the actual calendar dates.

The aim of these parameters is that after 20 rounds of iterations, at least 10 recombinations are inferred to occur in the intervals each parameter spans. These parameters are the default for PSMC and were tested for the analysis of human data.

Please note that PSMC runs for hours, so once you see it is running, you can abort it. If you want, you can submit this as a job to Uppmax and check the results after ~3-4 hours.

Since we don’t have that time, I have the output file ready for you.

The results file is:

/proj/naiss2023-22-1084/private/demography/psmc/results.psmc

You can plot it two ways in this exercise.

Use the plot_results.py script.

Just run ./plot_results.py at the directory with the script and the .psmc file. This script was written to output a PDF file with the real population history (one of the benefits of using simulations) and the PSMC estimate. Once everybody reaches this stage we will stop and talk about the results a bit. This script has baked in all the options from the regular PSMC plotting tool below.

PSMC has a script for plotting the results. Just run the command:

psmc_plot.pl -u 2.5e-8 -g 25 psmc_plot results.psmc

It includes the µ for this “species” (2.5e-8), and the generation time (25). These parameters are fundamental to scale the x axis of the plot. If you provide the plotting tool with the wrong information, your x axis scale will be off and your interpretations will be essentially meaningless. If you are curious about how psmc_plot rescales the time, you can read about it in detail at the program’s paper here (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3154645/).

References

The ms PSMC input and the PSMC plotting for simulated data use scripts I modified from Willy Rodríguez (https://github.com/willyrv/ms-PSMC).