For this course, I have used one of my very simple tree from a few different organisms within the Archaea superkingdom. More specifically, it is the concatenated ribosomal protein tree within archaea, with representatives from the different Phyla: Asgard archaea (Thorarchaeota, Odinarchaeota, Lokiarchaeota and Heimdallarchaeota), TACK, Euryarchaeota and DPANN. The maximum likelihood phylogeny was reconstructed in IQ-TREE with the LG+F+I+G4 model. The alignment was trimmed to 6,732 positions with the BMGE (Block Mapping and Gathering with Entropy) tool.
ggtree
is an R package that extends ggplot2 for visualizating and annotating phylogenetic trees with their covariates and other associated data. ggtree
in combination with treeio
supports several file formats, including:
read.tree
for reading Newick files.read.phylip
for reading Phylip files.read.jplace
for reading Jplace files.read.nhx
for reading NHX files.read.beast
for parsing output of BEASTread.codeml
for parsing output of CODEML (rst and mlc files)read.codeml_mlc
for parsing mlc file (output of CODEML)read.hyphy
for parsing output of HYPHYread.jplace
for parsing jplace file including output from EPA and pplacerread.nhx
for parsing NHX file including output from PHYLODOG and RevBayesread.paml_rst
for parsing rst file (output of BASEML and CODEML)read.r8s
for parsing output of r8sread.raxml
for parsing output of RAxMLlibrary(ggtree)
library(treeio)
arch_tree <- read.newick("data/arch_newick.txt")
arch_tree
##
## Phylogenetic tree with 25 tips and 24 internal nodes.
##
## Tip labels:
## DPANN_Pace_RBG_13_36_9, DPANN_Woese_UBA94, EURY_Thermoplasma_volcanium_GSS1, Eury_Halobaculum_gomorrense, Eury_Methanothermococcus_okinawensis, TACK_Bathy_SMTZ80, ...
## Node labels:
## 100, 10.9, 100, 100, 72, 35.9, ...
##
## Rooted; includes branch lengths.
Just like with ggplot2
we created a basic canvas with ggplot(...)
and added layers with +geom_???()
, we can do the same here. The ggtree
package gives us a geom_tree()
function. Because ggtree
is built on top of ggplot2
, you get ggplot2
’s default gray theme with white lines. You can override this with a theme from the ggtree
package.
Because you’ll almost always want to add a tree geom and remove the default background and axes, the ggtree()
function is essentially a shortcut for ggplot(...) + geom_tree() + theme_tree()
.
ggtree(arch_tree)
#Same as:
#ggplot(arch_tree) + geom_tree() + theme_tree()
One can also customize how the tree looks like, just as in ggplot
ggtree(arch_tree, branch.length="none", color="blue", size=2, linetype=3)
You can also use different orientations to show the tree.
ggtree(arch_tree, color="blue", layout = "circular")
Just like in ggplot()
you can add layers to the tree as geoms
.
p <- ggtree(arch_tree)
p + geom_tiplab(color = "blue") + theme_tree2()
Before we can go further into annotating the tree, we need to understand how ggtree is handling the tree structure internally. Some of the functions in ggtree for annotating clades need a parameter specifying the internal node number. To get the internal node number, user can use geom_text
to display it, where the label is an aesthetic mapping to the “node variable” stored inside the tree object (think of this like the continent variable inside the gapminder object).
p + geom_text(aes(label=node), hjust=-.3)
Another way to get the internal node number is using the function MRCA()
which stands for Most Recent Common Ancestor. From the tree above, if you would like to get the number of the particular node that is common to all the “Lokiarchaea”. You can do so like below:
MRCA(arch_tree, .node1 = "Baja_Loki3", .node2 = "Loki_CR4")
## [1] 46
Now, let us see how we can use this information to make it scientifically intuitive.
p +
geom_cladelabel(node=46, label="Lokiarchaeota", color="green")
ALternatively, we can also highlight an entire clade:
p +
geom_hilight(node=39, fill = "purple") +
geom_cladelabel(node=39, label="Thorarchaeota", color="purple") +
geom_hilight(node=46, fill = "green") +
geom_cladelabel(node=46, label="Lokiarchaeota", color="green")
You can make connections to represent processes like gene transfer events on the tree.
p +
geom_hilight(node=30, fill = "orange") +
geom_cladelabel(node=30, label="Euryarchaeota", color="orange") +
geom_hilight(node=46, fill = "green") +
geom_cladelabel(node=46, label="Lokiarchaeota", color="green") +
geom_taxalink("Eury_Methanothermococcus_okinawensis", "Loki_GC14", color = "red")
You can group the different tips using the groupOTU
function or you can group the tips based on a node using groupClade
function. These groups you can then further use it for other geoms and so on.
Let us say, we want to show the Asgrad archaea separately from the other archaea in this tree. One way to this would be:
arch_tree <- groupOTU(arch_tree, c("DPANN_Pace_RBG_13_36_9", "DPANN_Woese_UBA94",
"EURY_Thermoplasma_volcanium_GSS1", "Eury_Halobaculum_gomorrense", "Eury_Methanothermococcus_okinawensis",
"TACK_Bathy_SMTZ80", "TACK_Nitrososphaera_viennensis", "TACK_Sulfolobus_islandicus_LS215"))
ggtree(arch_tree, aes(color=group)) + geom_tiplab()
We can also get more detailed:
cls <- list(Heimdallarchaeota = c("Heim_AB125", "Heim_LC2", "Heim_LC3"),
Thorarchaeota = c("Baja_Thor", "Thor_AB25", "Thor_MP11T", "Thor_SMTZ45", "Thor_SMTZ_145", "Thor_MP8T", "Thor_MP9T", "Thor_SMTZ_183"),
Lokiarchaeota = c("Loki_CR4", "Loki_GC14", "Baja_Loki2", "Baja_Loki1", "Baja_Loki3"),
Odinarchaeota = "Odin_LCB4")
arch_tree <- groupOTU(arch_tree, cls)
ggtree(arch_tree) + geom_tiplab(aes(color=group)) +
scale_color_manual(values=c("black", rainbow(4)), labels = c("OutGroup", names(cls)))
One thing to notice hear that you can group tips from different clades together as well.
baja <- list(Baja = c("Baja_Thor", "Baja_Loki2",
"Baja_Loki1", "Baja_Loki3"))
arch_tree <- groupOTU(arch_tree, baja)
ggtree(arch_tree) + geom_tiplab(aes(color=group)) +
scale_color_manual(values=c("black", "blue"),
labels = c("Public genomes", "Baja genomes"))
Now, if we are only interested in showing the differences based on the clades, we can use groupClade
.
arch_tree <- groupClade(arch_tree, c(39, 46, 35, 27))
ggtree(arch_tree) + geom_tiplab(aes(color=group)) +
scale_color_manual(values=c(rainbow(4), "black"),
labels = c("Odin", "Thor", "Loki", "Heimdall", "OutGroup"))
If you notice the tree object, it actually contains node.label
that are bootstrap values.
arch_tree
##
## Phylogenetic tree with 25 tips and 24 internal nodes.
##
## Tip labels:
## DPANN_Pace_RBG_13_36_9, DPANN_Woese_UBA94, EURY_Thermoplasma_volcanium_GSS1, Eury_Halobaculum_gomorrense, Eury_Methanothermococcus_okinawensis, TACK_Bathy_SMTZ80, ...
## Node labels:
## 100, 10.9, 100, 100, 72, 35.9, ...
##
## Rooted; includes branch lengths.
There are different ways we could show this values on the tree. For example,
ggtree(arch_tree) + geom_nodelab() + geom_tiplab() + theme_tree()
Now, let us say that you only want to show the values that are above 80, then you would have to do some trick here, like:
ggtree(arch_tree) +
geom_nodepoint(aes(subset = as.numeric(label) > 80), size = 3) +
geom_tiplab() +
theme_tree()
Note Notice that the node.label
are in string, so must remember to use as.numeric()
option be able to filter them to show them.
Note If the tree does not have bootstrap values with in the treefile, one can add them separately as a data.frame
using %<+%
function, as shown in the Adding data to tree section below.
It is possible to specific tips
from the tree using drop.tip()
fuction. Let us say, we want to remove Thor_SMTZ45
and Baja_Loki1
, becuase of the reason that their genomes are not close to complete. We can do that by:
to_drop <- c("Thor_SMTZ45", "Baja_Loki1")
arch_tree_reduced <- drop.tip(arch_tree, to_drop)
arch_tree_reduced <- groupClade(arch_tree_reduced, c(37, 43, 33, 25))
ggtree(arch_tree_reduced) + geom_tiplab(aes(color=group)) +
scale_color_manual(values=c(rainbow(4), "black"),
labels = c("Odin", "Thor", "Loki", "Heimdall", "OutGroup"))
One can also subset a tree, by using the internal node number as shown in the exercises earlier. We can do this by using the tree_subset()
function:
asg_tree <- treeio::tree_subset(arch_tree, node=34, levels_back=0)
ggtree(asg_tree) + geom_tiplab()
Note The branch lengths and all the other information from the tree are maintained in the new object, when remove tips or subset a tree.
In this part, let us try to see what kind of ways one could add metadata to a tree.
env_info <- read.table("data/tree_env.tsv", sep="\t",
stringsAsFactor=F, row.names = 1, header = T)
p <- ggtree(arch_tree) +
geom_tiplab(aes(color=group)) +
scale_color_manual(values=c(rainbow(4), "black"),
labels = c("Odin", "Thor", "Loki", "Heimdall", "OutGroup")) +
theme(legend.position = "none")
gheatmap(p, env_info, offset = 4, width=1.5,
colnames_angle = 90, colnames_offset_y = 5,
legend_title = "Environment sampled")
Similarly we can plot a heatmap:
env_hmap <- read.table("data/tree_hmap.tsv", sep="\t",
stringsAsFactor=F, row.names = 1, header = T)
p <- ggtree(arch_tree) +
geom_tiplab(aes(color=group)) +
scale_color_manual(values=c(rainbow(4), "black"),
labels = c("Odin", "Thor", "Loki", "Heimdall", "OutGroup")) +
theme(legend.position = "none")
gheatmap(p, env_hmap, offset = 4, width=1.5,
colnames_angle = 90, colnames_offset_y = 5,
legend_title = "Environment sampled")
sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] treeio_1.27.0.002 ggtree_3.11.0 pheatmap_1.0.12
## [4] swemaps_1.0 mapdata_2.3.1 maps_3.4.2
## [7] gridExtra_2.3 jpeg_0.1-10 ggpubr_0.6.0
## [10] cowplot_1.1.3 ggthemes_5.0.0 scales_1.3.0
## [13] ggrepel_0.9.5 wesanderson_0.3.7 forcats_1.0.0
## [16] stringr_1.5.1 purrr_1.0.2 readr_2.1.5
## [19] tidyr_1.3.1 tibble_3.2.1 tidyverse_2.0.0
## [22] reshape2_1.4.4 ggplot2_3.4.4 formattable_0.2.1
## [25] kableExtra_1.4.0 dplyr_1.1.4 lubridate_1.9.3
## [28] leaflet_2.2.1 yaml_2.3.8 fontawesome_0.5.2.9000
## [31] captioner_2.2.3 bookdown_0.37 knitr_1.45
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-155 fs_1.6.3 RColorBrewer_1.1-3
## [4] tools_4.1.3 backports_1.4.1 bslib_0.6.1
## [7] utf8_1.2.4 R6_2.5.1 lazyeval_0.2.2
## [10] mgcv_1.8-39 colorspace_2.1-0 withr_3.0.0
## [13] tidyselect_1.2.0 compiler_4.1.3 cli_3.6.2
## [16] xml2_1.3.6 labeling_0.4.3 sass_0.4.8
## [19] yulab.utils_0.1.4 systemfonts_1.0.5 digest_0.6.34
## [22] rmarkdown_2.25 svglite_2.1.3 pkgconfig_2.0.3
## [25] htmltools_0.5.7 fastmap_1.1.1 highr_0.10
## [28] htmlwidgets_1.6.4 rlang_1.1.3 rstudioapi_0.15.0
## [31] gridGraphics_0.5-1 jquerylib_0.1.4 farver_2.1.1
## [34] generics_0.1.3 jsonlite_1.8.8 crosstalk_1.2.1
## [37] car_3.1-2 magrittr_2.0.3 ggplotify_0.1.2
## [40] patchwork_1.2.0 Matrix_1.6-5 Rcpp_1.0.12
## [43] munsell_0.5.0 fansi_1.0.6 ape_5.7-1
## [46] abind_1.4-5 lifecycle_1.0.4 stringi_1.8.3
## [49] carData_3.0-5 plyr_1.8.9 parallel_4.1.3
## [52] lattice_0.20-45 splines_4.1.3 hms_1.1.3
## [55] pillar_1.9.0 ggsignif_0.6.4 glue_1.7.0
## [58] evaluate_0.23 ggfun_0.1.4 leaflet.providers_2.0.0
## [61] vctrs_0.6.5 tzdb_0.4.0 gtable_0.3.4
## [64] cachem_1.0.8 xfun_0.41 broom_1.0.5
## [67] tidytree_0.4.6 rstatix_0.7.2 viridisLite_0.4.2
## [70] aplot_0.2.2 memoise_2.0.1 timechange_0.3.0
## [73] ellipsis_0.3.2