Omics Integration course 2020

Rui Benfeitas, Scilifelab, NBIS National Bioinformatics Infrastructure Sweden

rui.benfeitas@scilifelab.se

Abstract
In this notebook we will explore how to generate and analyse a multi-omic network comprising metabolites quantifications and gene expression. We will compare these networks against randomly generated networks, and compute different network metrics. At the end we will also perform a community analysis and functional characterization at the gene level.


Contents

Biological network topology analysis

Objective
In this notebook you will learn how to build and analyse a network built and analysed from a gene-metabolite association analysis. Other mixed networks may also be similarly analyzed, differring only in whether and how you can apply the final functional analysis.

Data
As a test case we will be using the file met_genes.tsv which contains abundances for 125 metabolites and 1992 genes, for 24 samples.

Software
This notebook relies on python's igraph for most of the analyses. Most, if not all, functions used here can also be applied with R's igraph. Other packages exist for network analysis including networkx and graph-tool. Snap.py is also a good alternative for large networks.

We will build our network through an association analysis, but there are other methods to do this including Graphical Lasso or linear SVR.

Data preparation

You will use a gene expression dataset (RNAseq, expressed as TPM) from a disease group with 24 samples to keep analysis memory and time requirements reasonable for this lesson. It is assumed that all batch effects or other possible technical artifacts are not present, and that all data is ready for analysis. However, there are several important considerations in preprocessing your data:

This will depend on the type of that that you have and what you want to do with it, and will severely affect downstream results. It is thus important to carefully think about this.

No duplicated features are present.

A very quick view shows that several gene clusters are found, including two major groups. However, the analysis below does not perform any statistical filtering.

Association analysis

Our initial network analysis will be performed on the association analysis using Spearman correlations. Because this network has a big chance of producing false positives we will consider Bonferroni correction to control for familywise error, as well as FDR.

A very quick view shows that several gene clusters are found, including two major groups. However, the analysis below does not perform any statistical filtering.

We will perform a gene-gene, gene-metabolite, and metabolite-metabolite association analysis by computing pairwise Spearman correlations. Choosing other non-parametric (Kendall or Spearman) vs parametric (Pearson) methods depends on your data. Here, because we have a small sample size, and want to save on computational time we choose Spearman.

The following takes a few minutes to run.

If we look at the matrix of P values, we can already see that many of the correlations in the top right columns are not significant even before multiple hypothesis correction:

We will now adjust the P values based on the number of comparisons done. The heatmaps above are highlighting a total of $2117^2 \approx 4.5m$ correlations. However, these numbers consider that the same correlation is computed twice (gene A vs gene B, and gene B vs gene A). If we were to include all correlations, we would thus be including many repeated analyses, and we are only interested in half of that above, and excluding the correlation of a feature with itself.
This means $\frac{2117!}{2!(2117-2)!} \approx 2.2m$ correlations. At an error rate of 0.05, this means that the probability of finding at least one false positive is nearly 100%: $1-0.95^{2000000} \approx 1$. We thus need to correct P values.

In the following cell, we convert the matrix of p*p features to a long matrix, concatenate both R and P for each correlation, and correct based on Bonferroni (Padj) and FDR (Benjamin-Hochberg).

Considering the Bonferroni correction we find 16305 correlations that are statistically significant at an $\alpha < 0.01$. If we consider instead FDR as correction method, we find 402368, which at a FDR of 0.01 implies $0.01 \times 402368 = 4023$ false positives.

Let's add two additional columns, where we assign R=0 for those correlations that are not statistically significant (adjP > 0.01, and FDR > 0.01).

Now we can see how the initial heatmap of correlations looks. The next cell converts the matrix from long to squared matrix so that we can generate the heatmap. We will plot those features that show statistically significant associations with more than 5% of the features after FDR correction, and compare them between FDR- and Bonferroni-corrected datasets.

The following plot shows the heatmap of the Spearman rank correlation coefficients after Bonferroni-correction - correlations where Padj > 0.05 are shown as 0.

The plots above show that the Bonferroni correction is only selecting very high (absolute) correlations. This should remove false positives, but it may also remove weaker correlations that are biologically relevant and true positives. The Bonferroni correction also removes most of the negatively-associated features. Notice this from the distribution of correlation coefficients:

We can also observe that the Bonferroni correction yields a more homogeneous number of associated features for each feature (i.e. first neighbors), compared to the FDR filtering. This has a consequence on the network structure.

Note that the Bonferroni correction leads to many nodes being associated with 1 or 2 other nodes, whereas the FDR correction leads to substantially higher number of associations for some of the nodes. This can also raise questions about the biological plausibility of such high number of associations - is it biologically significant that a gene is co-expressed with 800 other genes/metabolites?

We can also be a bit more strict on the FDR that we consider as statistically significant. Let's compare the number of potential false positives at different FDR. The following plot further highlights an FDR = 0.01 (dashed gray line).

Note how the number of potential false positives increases substantially with the number of edges, showing that networks with large sizes can have a high number of false positives. Thus, selecting an appropriate cut-off needs to take into account these two quantities.
By being more conservative, one will be more sure of the identified associations but a network that is too small may miss biologically important associations (i.e. have many false negatives) and become too sparse to be representative of the biology. This may result in very small communities of nodes, and many dyads and isolated vertices. On the other hand, a network that is too big may display a very high number of associations that are not observed biologically (false positives). It is likely that such networks will display large communities and high density. Think of the balance between Type I and Type II errors. Unfortunately, because this is an unsupervised problem we have no way to compute the false negative rate. This is not always the case: if you are dealing with human proteomics, one possible solution to identify false negatives can be from using a reference map of human protein-protein interactions.

In choosing an appropriate cut-off we will thus rely on the number of potential false positives and the network dimensions. In another section below, we will compare different networks and explore how their structure differs both in network properties (e.g. centrality) and communities. Bear in mind that false positives should not have an extensive impact on the network communities if these spurious associations are randomly distributed throughout the network, since the community interrogation algorithms usually compute regions of high density.

We will also look at how the statistically significant correlation coefficients change FDR.

Predicably, by being more conservative we will be selecting higher absolute correlation coefficients. In this test case we are considering only 25 samples. Larger sample sizes lead to lower nominal and adjusted p-values, and a higher number of statistically significant but milder correlation coefficients. In such cases, one may be more conservative in the significance threshold. Henceforth, we will consider as statistically significant those edges where FDR < 0.01.

One last point comes from the comparison of statistically significant associations within and between omics. Note how so few inter-omic associations are identified, and that Bonferroni correction completely misses any.

Questions

In building a graph from an association analysis:

Network construction and preliminary analysis

We will now build 4 different networks to analyse further:

This will be a null-model for our analyses. The idea is that if a certain property found in one of our graphs is reproduced in a random graph, then we do not need to account for any other possible explanations for that feature. In other words, if a property of a graph (e.g. clustering) is not found in a random network, we can assume that it does not appear in our biological network due to randomness.

We will now build the kNNG, using distances as input to determine the nearest neighbours. Because this data contains both gene expressions and metabolite quantifications, we need to normalize them beforehand. (We didn't need to do this above as we were comparing ranks)

We start by standardizing all features

We will now generate the graphs from the dataframes above

For representation purposes we will see how a short knn graph looks - be careful in drawing the others, as they have many more edges it becomes computationally heavy.

In the next table, we can see that while the same number of nodes is found in all networks, the number of edges varies greatly. We also see that the network is fully connected, which is not allways the case. If it wasn't connected, we could select the k largest connected components, and proceed the analyses with them. The largest connected component is called the giant component.

Questions:

Centrality analysis

We'll look into different centrality measures:

Degree, Betweenness, Closeness and Eigenvector centralities may be additionally computed for the positive association network by taking into account each edge's weight. For instance, for degree this is done for each node by summing each edge's degree.

Because the number of shortest paths in a network scales with the network size, we normalize Eccentricity and Betweenness with respect to the network size so that they can be compared between the four networks above.

Note that many other centrality metrics can be computed. For instance, PageRank and HITS take into account edge directionality to compute what are the most central nodes in a network.

Degree distribution
Let's start by comparing the degrees of the random network against the three other networks. From the figures below it seems that there is no relationship between the degree of the random network, and any of the three others.

Centrality
We will now compare different centrality metrics between the graphs.

When interpreting the results above, it is important to bear in mind that these networks have different network sizes. Recall:

Overall, we see:

Questions:

We will also explore the relationships between different centrality metrics. Because these have different interpretations, we will compute ranks for each centrality, and perform the correlations on the ranks. In the following cell we do this, and then compute correlations within the 5 metrics for the full network. One additional column is presented (median_centrality), that is basically the median of the ranks of the 5 other centralities.

The plot above shows that most of the centrality metrics are positively correlated in the full network and in the positively coexpression network. However, this is not the case for the KNN network.

Questions

  1. How do you explain the inverse relationship between degree and most other metrics in the KNN network?
  2. Why do you think that the KNN network specifically displays this opposite trend?

The next figure may help in answering the question above. We highlight the most central nodes based on degree (red) and eccentricity (green), in addition to a random subset of 500 nodes. Of these, which do you think displays the shortest path to all other nodes?

Community analysis

Node communities may be identified based on different metrics including Modularity) or Density. We will look at community detection through modularity.

Modularity of a small graph

Recall that the Modularity of a community is given by $$Q = \frac{1}{2m} \sum_{c}(e_c - \frac{K_c^2}{2m})$$

where $e_c$ is the number of edges in community $c$, and $\frac{K_c^2}{2m}$ is the expected number of edges in the community given the $K_c$ sum of degrees of its nodes, for a network with $m$ edges. This will correspond to

$$Q = \frac{1}{2m} \sum_{ij}[A_{ij} - \frac{k_i k_j}{2m}\delta(c_i,c_j)]$$

where $A_ij$ is the Adjacency between nodes $i$ and $j$, $k_i$ and $k_j$ are their degree, and $\delta(c_i,c_j)$ is the Kronecker delta, defined as 1 if nodes $i$ and $j$ are in the same community, or 0 if they are not. Let's examine the following small network, with communities given by the two colors: red [A,B,C,D,H] and blue [E,F,G].

For this network, we have the following adjacency matrix.

We will compute the modularity of this network given the 2 communities above, herein identified as communities 1 and 2.

As this is a very small network, we can take a brute-force approach and examine all possible membership combinations that will yield the highest possible modularity.

We can see that the 2 top hits that maximize modularity define the same communities as [A,B,C,D] and [E,F,G,H]. For larger networks we cannot use a brute-force approach, and instead rely on the 2-pass Louvain algorithm, which has since been improved with the Leiden algorithm.

Modularity of gene-metabolite networks

Below, we perform the community analysis on the 4 networks. We will perform one additional community analysis by considering the edge weights from the positively associated network. Importantly, this method searches for the largest possible communities for our network, which may not always be the desired. Alternative models such as the Constant Potts Model allow you to identify smaller communities. Should we know that our data has special feature classes, we can compare whether the communities identify those classes by examining them individually, and increasing the resolution if needed.

Predictably, we can see that the modularity score of any of the networks is substantially larger than that of the random network.

Comparing the different communities by size:

Some of the communities are very small in the pos and full networks, and comprise only 2 and 3 elements. Can we really call this a community?

Functional analysis
In order to perform functional enrichment, we will extract each of the communities and perform a hypergeometric test on the genes to understand whether they are particularly enriched in specific biological functions.
We will use enrichr to perform the gene set enrichment analysis. As background we will use the full list of genes that were quantified.

We will look at 3 gene set libraries. Should you have other kinds of data, enrichr allows you to define your own feature sets and perform a similar analysis. The challenge is in identifying comprehensive and well curated gene sets.

From the output of the cell above you can readily see that some communities such as c1 display no enriched terms from either GO, KEGG or OMIM.

Running the command above not only gives you the results after significance testing (Q<0.05), but it also outputs some preliminary barplots with the statistically significant results (found under /Enrichr/). For instance:

Note that some of these communities are very big, which explains the big number of biological processes found above.

We can now find whether the Full Network, Positively associated, and Positively associated weighted, show any common terms among their biggest communities. We do not compare with kNN-G as this shows very homogeneous and different communities than the other two networks

Among the biggest communities we find several biological processes (55) that are simultaneously identified in the same community in the three graphs (Full, Pos assoc, and Pos assoc weighted).

Questions

If you want to export communities to use them in other sections.

Conclusion

Here we've performed some network analyses based on a met-gene association network. We've explored different centrality measures to characterize the networks, and identified the communities of genes in these networks. We have also used gene set enrichment analysis to characterize these communities based on the genes, but it remains to show whether similar results would be attained if we considered the metabolites in each community.