Network diffusion based methods in integrating 'omics data¶

Sergiu Netotea, PhD, NBIS, Chalmers

Network fusion
Similarity network fusion, explained in detail
Applications

Classification of graph based integration methods¶

Network-Based Approaches: - Graph Construction, Multi-Modal Networks: Integrating multiple omics datasets into one comprehensive graph that allows for the analysis of cross-layer interactions. - Node/Edge Weighting: Some methods apply weighting strategies to nodes and edges to emphasize biological relevance, which can assist in identifying key components within the network.

Algorithmic Methods:

Network Diffusion/Propagation: Methods that allow information (e.g., signals or perturbations) to propagate through the network to detect influential nodes or subnetworks. Examples: Similarity network fusion, Graph autoencoders
Causal Inference: Algorithms that attempt to infer causality between nodes based on their interactions and the multi-omics data layers. Example: Graphical models based on statistical learning, Recurrent Graph Learning

Machine Learning Approaches:

Integration with Network-Based Approaches: Machine learning models are sometimes used in tandem with network approaches to enhance predictive power and biological insight. These methods are characterized by the existence of a fittness function. Examples: Graph embedding, Graph Neural Networks

Read mode:

Agamah FE, Bayjanov JR, Niehues A, Njoku KF, Skelton M, Mazandu GK, Ederveen THA, Mulder N, Chimusa ER, 't Hoen PAC. Computational approaches for network-based integrative multi-omics analysis. Front Mol Biosci. 2022 Nov 14;9:967205. doi: 10.3389/fmolb.2022.967205. PMID: 36452456; PMCID: PMC9703081.

Network based models in biology¶

Network models are a very complex representation of data:

Power law sophistication: for every n vertices there are up to n(n-1) possible edges
This fits biological complexity, and draws a paralel to the "curse of dimensionality"
I like to see graphs as yet another data representation format, along with tabular data, multi channel signal intensities or spatial microscopy..
Networks too can handle differences in scale, collection bias and noise in each data set
They can also handle the complementary nature of the information provided by different types of data

the potential for feature cleaning and data transformation using a network structure are great
one important tradeoff: losing a weak signal due to a poorly made network model
...one important benefit: recovering a weak signal due to the cummulative power of weak links

Common network (graph) representation¶

Typical dataset:

features: samples, patients, genes, metabolites, clinical descriptors etc
instances: any repeated measurements of the above!

Via network modelling, all instances are turned into feature relationships

Nodes (vertices): features
Links (edges): numerical relationship between nodes (directional, boolean, discrete, continuous)

Via network fusion, the feature relationships (links) are described based on multiple datasets.

NF basics

How can networks be fused?¶

Simple rule: average edge values by summing up adjacency matrices!

Pitfalls:

Most omics networks are sparse: some feature relationships eh .. are prevalent only on certain 'omics levels!
Guilt of association is not captured: features similar to pan-omics features should not the treated equally.

Complex rule: take in consideration the infomation difussivity in each network when fusing them!

in graph computing this is called "message passing".
pitfalls: model parametrization

(Perceived) advantages of network fusion¶

Very useful when:

data is incomplete
dataset quality is not uniform
data from multiple sources is used
networks can be built in complex ways, taking into account additional, annotated information.
easy to transfer knowledge from alternative models

It is a generative model!

example: generate data from similarity matrix
- https://en.wikipedia.org/wiki/Multidimensional_scaling
the generated networks can be used for graph clustering, regression, classification.

SNF: Similarity Network Fusion¶

Wang, Bo et al. “Similarity network fusion for aggregating data types on a genomic scale.” Nature methods vol. 11,3 (2014): 333-7.

https://doi.org/10.1038/nmeth.2810

mRNA expression, DNA methylation and microRNA (miRNA) expression data for five cancer data sets.
clustering cancer subtypes and predicting survival
uses networks of samples (patient networks) as a basis for integration.
at the time patient-similarity networks have not been used specifically for integrating biological data

Similarity networks¶

At every 'omics level, there are multiple ways to compute a network.
Similarity networks are only investigating the similarity of one feature to another, with regards to a distance measure.
Every similarity distance is different. See https://en.wikipedia.org/wiki/Metric_space TODO: similarity tables

similarity

SNF Method¶

Similarity distance:

contininuous data: euclidean distance, weighted using an exponential kernel function $f(x) = \exp{\left( -\frac{\sum_{i=1}^{n}(x_i-y_j)^2}{2\sigma^2} \right)}$
discrete data: chi squared distance $\sum_{i=1}^{n} \cfrac{(x_i - y_i)^2} {(x_i + y_i)}$
boolean data: agreement based distances (Jaccard, Hamming)

SNF Method - the similarity matrix¶

P measures a given patient's similarity to all other patients
In order to fuse the supplied m simmilarity (affinity) matrices, each must be normalized.
A traditional normalization on an affinity matrix would suffer from numerical instabilities due to the self-similarity along the diagonal; thus, a modified normalization is used:

$$ \mathbf{P}(i,j) = \left\{\begin{array}{rr} \frac{\mathbf{W}_(i,j)} {2 \sum_{k\neq i}^{} \mathbf{W}_(i,k)} ,& j \neq i \\ 1/2 ,& j = i \end{array}\right. $$

SNF Method - the kernel matrix¶

Under the assumption that local similarities are more important than distant ones, a more sparse weight matrix is calculated based on a KNN framework:
S measures a given patient's similarity to the K most similar patients.

$$ \mathbf{S}(i,j) = \left\{\begin{array}{rr} \frac{\mathbf{W}_(i,j)} {\sum_{k\in N_{i}}^{}\mathbf{W}_(i,k)} ,& j \in N_{i} \\ 0 ,& \text{otherwise} \end{array}\right. $$

SNF Method - Inductive network fusion¶

The two weight matrices P and S thus provide information about a given patient's similarity to all other patients and the K most similar patients, respectively.
These m matrices are then iteratively fused. At each iteration, the matrices are made more similar to each other via message passing:

$$ \mathbf{P}^{(v)} = \mathbf{S}^{(v)} \times \frac{\sum_{k\neq v}^{}\mathbf{P}^{(k)}}{m-1} \times (\mathbf{S}^{(v)})^{T}, v = 1, 2, ..., m $$

Obs:

After each iteration, the resultant P matrices are normalized via the equation above.
Fusion stops after t iterations, or when the matrices $\mathbf{P}^{(v)}, v = 1, 2, ..., m$ converge.
The output fused matrix is full rank and can be subjected to clustering and classification.

Different usage of the fused matrix:

clustering: spectral clustering, consensus clustering
network model based subtype prediction: label propagation
network model based feature selection: measuring the change in consistence brought by SNF on a single feature

Post-fusion analysis - Spectral clustering¶

The fitted model is a similarity network (an affinity matrix). Thus to cluster one must use graph algorithms such as spectral clustering. Other popular graph clustering methods are: K-neighbors clustering, MCL clustering

Spectral clustering:

eigenvalues, eigenvectors $A x = \lambda x$
graph laplacian: degree matrix - adjacency matrix
the increasing ordered eigenvalues of the graph laplacian matrix are describing the clusters in their increased network effect

SNF usage examples¶

Paper study:¶

MoGCN, a multi-omics integration model based on graph convolutional network (GCN) - https://github.com/Lifoof/MoGCN - Li X, Ma J, Leng L, Han M, Li M, He F, Zhu Y. MoGCN: A Multi-Omics Integration Method Based on Graph Convolutional Network for Cancer Subtype Analysis. Front Genet. 2022 Feb 2;13:806842. doi: 10.3389/fgene.2022.806842. PMID: 35186034; PMCID: PMC8847688

Cancer subtype classification and analysis. Genomics, transcriptomics and proteomics datasets for 511 breast invasive carcinoma (BRCA) samples were downloaded from the Cancer Genome Atlas (TCGA).
The autoencoder (AE) and the similarity network fusion (SNF) methods were used to reduce dimensionality and construct the patient similarity network (PSN), respectively
Then the vector features and the PSN were input into the GCN for training and testing. Feature extraction and network visualization were used for further biological knowledge discovery and subtype classification.

No description has been provided for this image

Paper study:¶

Wang, C., Lue, W., Kaalia, R. et al. Network-based integration of multi-omics data for clinical outcome prediction in neuroblastoma. Sci Rep 12, 15425 (2022). https://doi.org/10.1038/s41598-022-19019-5

Aim: integrate multi-omics data (like gene expression and DNA methylation) for predicting clinical outcomes in neuroblastoma, a pediatric cancer.
Using Patient Similarity Networks (PSNs) derived from omics features, they create networks where patients are nodes and edges represent their similarity based on omics data. They apply two methods for data fusion: at feature level and at network level
Their results show that network-level fusion generally outperforms feature-level fusion for integrating diverse omics datasets, while feature-level fusion is effective when combining different features within the same omics dataset.
Feature-level fusion: Combines features derived from each omics dataset into a single feature set by concatenating or averaging features like centrality and modularity from PSNs. For each omics dataset m, a Patient Similarity Network (PSN) is constructed. Let x_m represent the feature vector of the m-th omics dataset for a subject. The feature-level fusion is performed as follows:
- Extract centrality and modularity features from each PSN.
- Compute the mean of centrality features and concatenate the modularity features from each omics dataset: $$ x_{\text{fused}} = \frac{1}{M} \sum_{m=1}^{M} x_m $$ , where M is the total number of omics datasets.

The fused feature vector $x_{\text{fused}}$ is used as input to machine learning classifiers for clinical outcome prediction.

Network-level fusion: PSNs from individual omics datasets are combined to form a single multi-omics PSN. The fusion is performed using the Similarity Network Fusion (SNF) algorithm, which combines the similarity matrices (A_m) of individual PSNs: $$ A_{\text{fused}} = \text{SNF}(A_1, A_2, \ldots, A_M) $$ The fused similarity matrix $ A_{\text{fused}} $ represents the multi-omics PSN, which is then used for downstream prediction tasks.

No description has been provided for this image

Paper study:¶

Wang, J., Liao, N., Du, X. et al. A semi-supervised approach for the integration of multi-omics data based on transformer multi-head self-attention mechanism and graph convolutional networks. BMC Genomics 25, 86 (2024). https://doi.org/10.1186/s12864-024-09985-7 Searched 2 sites

Uses a semi-supervised learning framework for disease classification, combining transformer multi-head self-attention mechanisms with graph convolutional networks (GCNs) to extract meaningful relationships between samples. The model integrates labeled and unlabeled omics data, using the attention mechanism to capture dependencies across features and GCNs to capture graph-based relationships.
Omics used: mRNA expression, microRNA expression, and DNA methylation. These data types are integrated to improve the prediction accuracy of disease classifications, such as Alzheimer's disease and breast cancer.
Self-Attention Mechanism: Captures intra- and inter-modality feature dependencies.
Graph Convolutional Networks (GCNs): Extract structural information from the multi-omics graph, enabling better representation of relationships between data points.
Semi-Supervised Learning: Utilizes both labeled and unlabeled data to improve model training, mitigating the limitations of small labeled datasets often found in multi-omics studies.

No description has been provided for this image

Latest trends - single cell integration¶

Using SNF and autoencoders for single cell clustering
- Tran, B., Tran, D., Nguyen, H. et al. scCAN: single-cell clustering using autoencoder and network fusion. Sci Rep 12, 10267 (2022). https://doi.org/10.1038/s41598-022-14218-6

Consensus clustering¶

Monti, S., Tamayo, P., Mesirov, J. et al. Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning 52, 91–118 (2003). https://doi.org/10.1023/A:1023949509487
Implementation: http://www.bioconductor.org/packages/release/bioc/html/CancerSubtypes.html

SNF drawbacks:

the unstable nature of kernel-based clustering
sensitive to small changes in molecular measurements
need for hyper-parametrization

Details:

The data are first partitioned with different values of k (number of clusters).
For each value of k, we construct the pair-wise connectivity matrix.
To identify the number of clusters we add noise to the data and then build the pairwise connectivity for the perturbed data.
We calculate the discrepancy in pairwise connectivity before and after data perturbation. We choose opt_k as the optimal number of clusters for which the pair-wise connectivity is the most stable.

Integrative SNF-CC example:¶

Nguyen, Tin et al. “A novel approach for data integration and disease subtyping.” Genome research vol. 27,12 (2017): 2025-2039.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5741060/

Stages:

construct the combined similarity matrix between patients using the connectivity information from individual data types.
partition the patients using the integrated similarity matrix.
further split each discovered group of patients into subgroups if possible.

Weighted similarity network fusion¶

Links:

Xu T, Le TD, Liu L, Wang R, Sun B, et al. (2016) Identifying Cancer Subtypes from miRNA-TF-mRNA Regulatory Networks and Expression Data. PLOS ONE 11(4): e0152792. https://doi.org/10.1371/journal.pone.0152792
Implementation:
- https://rdrr.io/bioc/CancerSubtypes/
- https://rdrr.io/cran/IntClust/

WSNF - Method (1)¶

Compute PageRank based ranking of features. Important features are ranked based on the in-number of TFs.
- Build the regulatory network where the nodes represent the features, i.e. the microRNAs (miRNAs), transcription factors (TFs) and messenger RNAs (mRNAs) and the edges indicate the interactions between the features.
- The interactions are retrieved from various interatomic databases.

No description has been provided for this image

WSNF - Method (2)¶

Integrate feature ranking with feature variation. Gene expression based.
- Use knn based kernels and the expression data of the miRNAs, TFs and mRNAs to calculate the weight of the features, representing the level of importance of the features.
Weighted SNF
- The feature weight is then integrated into a network fusion approach to cluster the samples (patients) and thus to identify cancer subtypes.
- $D(s_i, s_j) = \sqrt{\sum_{m=1}^p{W(f_m)*(f_m^{s_i}-f_m^{s_j})^2}}$

No description has been provided for this image

ANF - Affinity network fusion¶

Tianle Ma, Aidong Zhang, Affinity Network Fusion and Semi-supervised Learning for Cancer Patient Clustering, 2019, https://doi.org/10.1016/j.ymeth.2018.05.020
Software: https://www.bioconductor.org/packages/release/bioc/html/ANF.html
- https://github.com/BeautyOfWeb/ANF
Data: gene expression, miRNA expression and DNA methylation

No description has been provided for this image

Overall, the methodology is similar to SNF. Weak edges derived from a similarity distance are pruned using a kNN based kernel. The same is acomplished using the message passing kernel in SNF.
The matrix fusion process is random-walk based, and it is a bit more clear, although it parametrizes feature importance with a set of omics-specific weights:
- $ W = \sum_{v=1}^n{w_v * W^{(v)}}$, $\sum_{v=1}^n{w_v}=1$

W regarded as a "state transition matrix", and multiplying it r times is a random walk in its generated graph. $W = W^r$
- this step is similar to the inductive message passing in SNF
- r parameter must be small (2, 3), or W becomes a rank one matrix
- a more complex random walk was also explored, that uses some of the knn pruned weaker edges.

Obs. While the SNF fitting process might look long, the patient matrix is usually a small network (most clinical datasets are in the order of tens - hundreds)!

ANF application - Semi-supervised learning on patient affinity networks¶

https://arxiv.org/abs/1805.09673

a good example for the potential of NF to combine supervised and unsupervised learning through good representation learning.
97% accuracy on test set with training less than 1% of data for classifying patients into correct disease types
ANF obtains effective kNN-based nonlinear transformations that reduce noise in multi-omic data

No description has been provided for this image

Subsequent SNF studies:¶

Guo, Yang et al. “Improvement of cancer subtype prediction by incorporating transcriptome expression data and heterogeneous biological networks.” BMC medical genomics vol. 11,Suppl 6 119. 31 Dec. 2018, doi:10.1186/s12920-018-0435-x
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6311915/

Main ideas:

ANF did not consider the feature importance and the feature relationships in data integration (different regulatory mechanisms may exist in different cancer subtypes).

Method: CSPRV (Cancer Subtype Prediction using RV2)

Given the expression data of genome elements, we first extract multiple expression features for each regulatory element based on the heterogeneous biological networks. Based on the extracted feature matrices of samples, we use a matrix correlation method, RV2, to predict the similarities between samples in each expression data-view, and then fuse the similarity information in samples from all considering data-views according to different integration weights. Finally, we cluster patient samples into different cancer subtypes based on the predicted integrative similarity network between samples.

SKF - Similarity Kernel Fusion¶

https://en.wikipedia.org/wiki/Kernel_method
multiple papers:
- Jiang, Limin et al. “Discovering Cancer Subtypes via an Accurate Fusion Strategy on Multiple Profile Data.” Frontiers in genetics vol. 10 20. 5 Feb. 2019, doi:10.3389/fgene.2019.00020
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6370730/
- https://pubmed.ncbi.nlm.nih.gov/30619454/
- https://pubmed.ncbi.nlm.nih.gov/31514111/
Also search for MKL (multiple kernel learning)

The Kernel trick¶

(wikipedia) "kernel functions, enable them to operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space."

$P_l^{t+1}=\alpha(S_l \times \frac{\sum_{r \neq l}P_r^t}{n} \times S_l^T) + (1-\alpha) \frac{\sum_{r \neq l}P_r^t}{n}$

No description has been provided for this image

Software, papers, further reading¶

Papers:¶

SNF similarity network fusion: https://www.nature.com/articles/nmeth.2810
- https://media.nature.com/original/nature-assets/nmeth/journal/v11/n3/extref/nmeth.2810-S1.pdf
SNF.CC: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5741060/
WSNF: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152792
SKF: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6370730/#:~:text=(2016)%20proposed%20Weighted%20Similarity%20Network,cancer%20types%20to%20demonstrate%20performance.
Speicher and Pfeifer (2015) https://pubmed.ncbi.nlm.nih.gov/26072491/ pointed out that iCluster has high computational complexity and proposed a dimensionality reduction method to integrate multiple similarity kernels. This method is evaluated by using five cancer types.
ANF affinity network fusion: https://arxiv.org/abs/1805.09673
INF integrative network fusion: https://www.frontiersin.org/articles/10.3389/fonc.2020.01065/full
fusion methods review (2020): https://www.sciencedirect.com/science/article/pii/S2001037019304155
icluster: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0035236
- https://pubmed.ncbi.nlm.nih.gov/19759197/
IntClust https://rdrr.io/cran/IntClust/

Software:¶

ANF: https://www.bioconductor.org/packages/release/bioc/html/ANF.html
- https://github.com/BeautyOfWeb/ANF
R: CancerSubtypes https://bioconductor.org/packages/devel/bioc/vignettes/CancerSubtypes/inst/doc/CancerSubtypes-vignette.html
SNF Python: https://github.com/rmarkello/snfpy
SNF IntClust: https://rdrr.io/cran/IntClust/man/SNF.html

In [ ]: