SNF lab¶
Task (45 minutes)¶
Important
- Please don't cry if you are unable to finish every single task in 45 minutes.
- Think of these tasks more like thought challanges or future challanges for your efforts..
Data: MOFA's CLL dataset
- Load and prepare data, then compute the affinity matrices and perform SNF.
- Assume that there are two cancer subtypes and cluster them :)
- Plot the fused network and the clusters using networkx, or Gephi, or igraph, whatever else you prefer.
Optional:
- Compare clusters to NMF
- Use the other omics tables as well
- Compare with the results from the MOFA analysis https://bioconductor.riken.jp/packages/3.9/bioc/vignettes/MOFA/inst/doc/MOFA_example_CLL.html
Code:
- Lab example uses a SNF implementation in Python:
- For R, feel free to use this package:
data_loc = "data/"
import pandas as pd
df_meth = pd.read_csv(data_loc + "CLL_data_Methylation.csv", index_col=0)
df_mrna = pd.read_csv(data_loc + "CLL_data_mRNA.csv", index_col=0)
# drop nans by column
df_mrna = df_mrna.dropna(axis='columns')
df_meth = df_meth.dropna(axis='columns')
df_mrna = df_mrna.T
df_meth = df_meth.T
cols = df_meth.columns.copy()
columns = {}
for c in cols:
mask = df_meth[c] < 0
columns[c + '_p'] = df_meth[c].mask(mask)
columns[c + '_n'] = - df_meth[c].mask(~mask)
df_meth = pd.concat(list(columns.values()), keys=list(columns.keys()), axis=1)
df_meth = df_meth.fillna(0)
X = pd.concat([df_mrna.T, df_meth.T])
X = X.dropna(axis='columns')
print(X.shape)
(13496, 135)
df_mrna.T.shape[0]
5000
X1 = X.iloc[5000:, :].T
X2 = X.iloc[:5000, :].T
After loading the files and doing the required transformations to the data (check the NMF lab), I finally reached a point where I have each dataset in a matrix, X1 and X2, respectively mRNA and methylation. Now computing the affinity matrices..
print(X1.values.shape, X2.values.shape)
(135, 8496) (135, 5000)
from snf import compute
affinities = compute.make_affinity([X1.values, X2.values], metric='euclidean', K=20, mu=0.5)
affinities[0].shape
(135, 135)
Applying the SNF method, with a k parameter of 20 (check the course slides for what this means). Then I extract the estimated number of clusters using spectral clustering.
fused = compute.snf(affinities, K=20)
first, second = compute.get_n_clusters(fused)
first, second
(np.int64(2), np.int64(4))
import numpy as np
all_labels = np.array([1 for i in range(135)] + [2 for i in range(135)])
from sklearn import cluster
fused_labels = cluster.spectral_clustering(fused, n_clusters=first)
#labels = [all_labels, fused_labels]
labels = [fused_labels]
for arr in affinities:
labels += [cluster.spectral_clustering(arr, n_clusters=first)]
fused_labels
array([1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1], dtype=int32)
You can now compare the estimated clusters from the fused matrix with the clustering done on the initial similarity matrices. As well as computing other indicators such as NMI and Silhouette scores.
labels
[array([1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1], dtype=int32), array([0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0], dtype=int32), array([1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0], dtype=int32)]
from snf import metrics
nmi = metrics.nmi(labels)
print(nmi)
[[1. 0.68746257 0.01308389] [0.68746257 1. 0.00350462] [0.01308389 0.00350462 1. ]]
import numpy as np
np.fill_diagonal(fused, 0)
sil = metrics.silhouette_score(fused, fused_labels)
print('Silhouette score for the fused matrix is: {:.2f}'.format(sil))
Silhouette score for the fused matrix is: 0.24
Downstream analysis¶
Since we are dealing with network models, we cannot rely on weights and scores here, but we need to compute network centralities to find the samples driving the signal in one of the clusters.
Further task:
- Compute network centralities (degree or betweeness) for the first cluster and estimate what sample are driving the cluster signal.
- We cannot find the features responsible for the signal directly but we could perform differential expression to extract the relevant features...