Population Genomics in Practice 2025 – Ancestral recombination graph inference

What are Ancestral Recombination Graphs?

Halanych (2004)

Trees are everywhere in biology…

… an ARG (recombinant genealogy) is a generalisation of an evolutionary tree

The coalescent with recombination

No recombination

With recombination

To the left, an ARG without recombination. To the right, recombination splits lineage 1 going backwards in time, where L_1 takes the left path, and R_1 the right. Consequently, L_1 and R_1 have different MRCAs!

ARG visualization

https://github.com/kitchensjn/tskit_arg_visualizer Wong et al. (2024)

Why ARGs?

Generalises an evolutionary “tree” to genome-wide inheritance

Understanding

Describes the processes that generated genomic data
Helps to generate hypotheses

Power

“True” genealogy represents total DNA history

Size & Speed

Tree-like structures make algorithms fast
Can compress variant data: an “evolutionary encoding”

Representation

Duality ARG - tree sequence along a genome, where a tree is defined over a non-recombining interval

But still early days…

Of genotype matrices and genealogical trees

Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes. Ralph et al. (2020), Fig. 1

Trees capture biology

Neutral

Expansion

Bottleneck

Selection

msprime stores data as succinct tree sequences

Tree sequences (Baumdicker et al., 2022, Figure 2)

Tree sequences compress data and speedup analyses

Compact storage (“domain specific compression”)
Fast, efficient analysis (a “succinct” structure)
Well tested, open source (active dev community)

Data compression

Built-in functionality (well documented: http://tskit.dev)

…but limited support for major genomic rearrangements (e.g. inversions, large indels): genomes should be (reasonably) aligned => current primary focus = population genetics

Speed

Source: What is a tree sequence? (https://tskit.dev/tutorials/what_is.html)

tskit terminology: the basics

Multiple local trees exist along a genome of fixed length (by convention measured in base pairs)
Genomes exist at specific times, and arerepresented by nodes (the same node can persist across many local trees)
Some nodes are most recent common ancestors (MRCAs) of other nodes
Entities are zero-based: the rst node has id 0, the second id 1, …

Images from online tutorial “Terminology & concepts” https://tskit.dev/tutorials/terminology_and_concepts.html

tskit terminology: nodes and edges

Nodes (=genomes)

exist at a specific time
can be flagged as “samples”
can belong to “individuals” (e.g., 2 nodes per individuals in humans) and, if useful, “populations”

id	flags	individual	time
0	1	0	0.00000000
1	1	0	0.00000000
2	1	1	0.00000000
3	1	1	0.00000000
4	1	2	0.00000000
5	1	2	0.00000000
6	0	-1	14.70054184
7	0	-1	40.95936939
8	0	-1	72.52965866
9	0	-1	297.22307150
10	0	-1	340.15496436
11	0	-1	605.35907657

Edges

Connect a parent & child
Have a left & right genomic coordinate
Usually span multiple trees (e.g., edges connecting nodes 1+7 and 4+7)

id	left	right	parent	child
0	0	1000	6	2
1	0	1000	6	5
2	0	1000	7	1
3	0	1000	7	4
4	0	1000	8	3
5	0	1000	8	6
6	307	1000	9	0
7	307	1000	9	7
8	0	307	10	0
9	0	567	10	8
10	307	567	10	9
11	0	307	11	7
12	567	1000	11	8
13	567	1000	11	9
14	0	307	11	10

tskit terminology: sites and mutations

This is how we can encode genetic variation. Most genomic positions do not vary between genomes: usually we don’t bother tracking these.

tskit terminology: sites and mutations

We can create a site at a given genomic position with a fixed ancestral state.

id	position	ancestral_state
0	52	C
1	200	A
2	335	A
3	354	A
4	474	G
5	523	A
6	774	C
7	796	C
8	957	A

This is how we can encode genetic variation. Most genomic positions do not vary between genomes: usually we don’t bother tracking these.

tskit terminology: sites and mutations

We can create a site at a given genomic position with a fixed ancestral state.

id	position	ancestral_state
0	52	C
1	200	A
2	335	A
3	354	A
4	474	G
5	523	A
6	774	C
7	796	C
8	957	A

This is how we can encode genetic variation. Most genomic positions do not vary between genomes: usually we don’t bother tracking these.

Normally, a site is created in order to place one or more mutations at that site

id	site	node	time	derived_state	parent
0	0	8	247.85988972	T	-1
1	1	0	169.80687857	C	-1
2	2	3	31.84262397	C	-1
3	3	9	326.26095349	C	-1
4	3	7	71.04212649	T	3
5	4	3	42.72352948	C	-1
6	5	7	55.44045835	T	-1
7	6	0	259.82567754	T	-1
8	7	8	169.87040769	G	-1
9	8	0	42.47396523	C	-1

Tree sequence inference methods

tsinfer/tsdate relate ARGweaver SINGER ARGNeedle Threads

Analysis of tree sequences with

Adapted from slide by Yun Deng.

Relate

Haplotypes and trees

Relate method

fast, but limited sample sizes (1000s?)
good support for ancient DNA

(Speidel et al., 2019)

Speidel mice

(Speidel, 2019)

tsinfer - tree sequence inference

fast
scales! (millions of samples!)
introduces tree sequence format
only genealogies, no branch lengths (but see tsdate (Wohns et al., 2022)

Tsinfer GNNs

(Kelleher et al., 2019)

GARG workshop Drøbak research station Aug-24

.

Mapping workflow Monkeyflower

Rulegraph, all

Tree sequence inference in Monkeyflower

raxml

Densitree

pca

Rulegraph, tsinfer

Bibliography

Baumdicker, F., Bisschop, G., Goldstein, D., Gower, G., Ragsdale, A. P., Tsambos, G., Zhu, S., Eldon, B., Ellerman, E. C., Galloway, J. G., Gladstein, A. L., Gorjanc, G., Guo, B., Jeffery, B., Kretzschumar, W. W., Lohse, K., Matschiner, M., Nelson, D., Pope, N. S., … Kelleher, J. (2022). Efficient ancestry and mutation simulation with msprime 1.0. Genetics, 220(3), iyab229. https://doi.org/10.1093/genetics/iyab229

Halanych, K. M. (2004). The New View of Animal Phylogeny. Annual Review of Ecology, Evolution, and Systematics, 35(Volume 35, 2004), 229–256. https://doi.org/10.1146/annurev.ecolsys.35.112202.130124

Kelleher, J., Wong, Y., Wohns, A. W., Fadil, C., Albers, P. K., & McVean, G. (2019). Inferring whole-genome histories in large population datasets. Nature Genetics, 51(9), 1330–1338. https://doi.org/10.1038/s41588-019-0483-y

Ralph, P., Thornton, K., & Kelleher, J. (2020). Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes. Genetics, 215(3), 779–797. https://doi.org/10.1534/genetics.120.303253

Speidel, L. (2019). Genealogy estimation for thousands of samples [{{Http://purl.org/dc/dcmitype/Text}}]. University of Oxford.

Speidel, L., Forest, M., Shi, S., & Myers, S. R. (2019). A method for genome-wide genealogy estimation for thousands of samples. Nature Genetics, 51(9), 1321–1329. https://doi.org/10.1038/s41588-019-0484-x

Wohns, A. W., Wong, Y., Jeffery, B., Akbari, A., Mallick, S., Pinhasi, R., Patterson, N., Reich, D., Kelleher, J., & McVean, G. (2022). A unified genealogy of modern and ancient genomes. Science, 375(6583), eabi8264. https://doi.org/10.1126/science.abi8264

Wong, Y., Ignatieva, A., Koskela, J., Gorjanc, G., Wohns, A. W., & Kelleher, J. (2024). A general and efficient representation of ancestral recombination graphs. Genetics, iyae100. https://doi.org/10.1093/genetics/iyae100