Inferring tree sequences for population genomics
Trees are everywhere in biology…
… an ARG (recombinant genealogy) is a generalisation of an evolutionary tree
To the left, an ARG without recombination. To the right, recombination splits lineage 1 going backwards in time, where L_1 takes the left path, and R_1 the right. Consequently, L_1 and R_1 have different MRCAs!
https://github.com/kitchensjn/tskit_arg_visualizer Wong et al. (2024)
Generalises an evolutionary “tree” to genome-wide inheritance
But still early days…
Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes. Ralph et al. (2020), Fig. 1
Neutral
Expansion
Bottleneck
Selection
Data compression
…but limited support for major genomic rearrangements (e.g. inversions, large indels): genomes should be (reasonably) aligned => current primary focus = population genetics
Speed
Images from online tutorial “Terminology & concepts” https://tskit.dev/tutorials/terminology_and_concepts.html
Nodes (=genomes)
id | flags | population | individual | time | metadata |
---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0.00000000 | |
1 | 1 | 0 | 0 | 0.00000000 | |
2 | 1 | 0 | 1 | 0.00000000 | |
3 | 1 | 0 | 1 | 0.00000000 | |
4 | 1 | 0 | 2 | 0.00000000 | |
5 | 1 | 0 | 2 | 0.00000000 | |
6 | 0 | 0 | -1 | 14.70054184 | |
7 | 0 | 0 | -1 | 40.95936939 | |
8 | 0 | 0 | -1 | 72.52965866 | |
9 | 0 | 0 | -1 | 297.22307150 | |
10 | 0 | 0 | -1 | 340.15496436 | |
11 | 0 | 0 | -1 | 605.35907657 |
Edges
id | left | right | parent | child | metadata |
---|---|---|---|---|---|
0 | 0 | 1000 | 6 | 2 | |
1 | 0 | 1000 | 6 | 5 | |
2 | 0 | 1000 | 7 | 1 | |
3 | 0 | 1000 | 7 | 4 | |
4 | 0 | 1000 | 8 | 3 | |
5 | 0 | 1000 | 8 | 6 | |
6 | 307 | 1000 | 9 | 0 | |
7 | 307 | 1000 | 9 | 7 | |
8 | 0 | 307 | 10 | 0 | |
9 | 0 | 567 | 10 | 8 | |
10 | 307 | 567 | 10 | 9 | |
11 | 0 | 307 | 11 | 7 | |
12 | 567 | 1000 | 11 | 8 | |
13 | 567 | 1000 | 11 | 9 | |
14 | 0 | 307 | 11 | 10 |
This is how we can encode genetic variation. Most genomic positions do not vary between genomes: usually we don’t bother tracking these.
We can create a site at a given genomic position with a fixed ancestral state.
id | position | ancestral_state | metadata |
---|---|---|---|
0 | 52 | C | |
1 | 200 | A | |
2 | 335 | A | |
3 | 354 | A | |
4 | 474 | G | |
5 | 523 | A | |
6 | 774 | C | |
7 | 796 | C | |
8 | 957 | A |
This is how we can encode genetic variation. Most genomic positions do not vary between genomes: usually we don’t bother tracking these.
We can create a site at a given genomic position with a fixed ancestral state.
id | position | ancestral_state | metadata |
---|---|---|---|
0 | 52 | C | |
1 | 200 | A | |
2 | 335 | A | |
3 | 354 | A | |
4 | 474 | G | |
5 | 523 | A | |
6 | 774 | C | |
7 | 796 | C | |
8 | 957 | A |
This is how we can encode genetic variation. Most genomic positions do not vary between genomes: usually we don’t bother tracking these.
Normally, a site is created in order to place one or more mutations at that site
id | site | node | time | derived_state | parent | metadata |
---|---|---|---|---|---|---|
0 | 0 | 8 | 247.85988972 | T | -1 | |
1 | 1 | 0 | 169.80687857 | C | -1 | |
2 | 2 | 3 | 31.84262397 | C | -1 | |
3 | 3 | 9 | 326.26095349 | C | -1 | |
4 | 3 | 7 | 71.04212649 | T | 3 | |
5 | 4 | 3 | 42.72352948 | C | -1 | |
6 | 5 | 7 | 55.44045835 | T | -1 | |
7 | 6 | 0 | 259.82567754 | T | -1 | |
8 | 7 | 8 | 169.87040769 | G | -1 | |
9 | 8 | 0 | 42.47396523 | C | -1 |
tsinfer/tsdate relate ARGweaver SINGER ARGNeedle Threads
Analysis of tree sequences with
Adapted from slide by Yun Deng.
Population Genomics in Practice