Important
Data: MOFA's CLL dataset
Optional:
Code:
data_loc = "C:\\mydata\\work\\io_course\\data\\"
import pandas as pd
df_meth = pd.read_csv(data_loc + "CLL_data_Methylation.csv", index_col=0)
df_mrna = pd.read_csv(data_loc + "CLL_data_mRNA.csv", index_col=0)
# drop nans by column
df_mrna = df_mrna.dropna(axis='columns')
df_meth = df_meth.dropna(axis='columns')
df_mrna = df_mrna.T
df_meth = df_meth.T
cols = df_meth.columns.copy()
columns = {}
for c in cols:
mask = df_meth[c] < 0
columns[c + '_p'] = df_meth[c].mask(mask)
columns[c + '_n'] = - df_meth[c].mask(~mask)
df_meth = pd.concat(list(columns.values()), keys=list(columns.keys()), axis=1)
df_meth = df_meth.fillna(0)
X = pd.concat([df_mrna.T, df_meth.T])
X = X.dropna(axis='columns')
print(X.shape)
(13496, 135)
df_mrna.T.shape[0]
5000
X1 = X.iloc[5000:, :].T
X2 = X.iloc[:5000, :].T
After loading the files and doing the required transformations to the data (check the NMF lab), I finally reached a point where I have each dataset in a matrix, X1 and X2, respectively mRNA and methylation. Now computing the affinity matrices..
print(X1.values.shape, X2.values.shape)
(135, 8496) (135, 5000)
from snf import compute
affinities = compute.make_affinity([X1.values, X2.values], metric='euclidean', K=20, mu=0.5)
affinities[0].shape
(135, 135)
Applying the SNF method, with a k parameter of 20 (check the course slides for what this means). Then I extract the estimated number of clusters using spectral clustering.
fused = compute.snf(affinities, K=20)
first, second = compute.get_n_clusters(fused)
first, second
(2, 4)
import numpy as np
all_labels = np.array([1 for i in range(135)] + [2 for i in range(135)])
from sklearn import cluster
fused_labels = cluster.spectral_clustering(fused, n_clusters=first)
#labels = [all_labels, fused_labels]
labels = [fused_labels]
for arr in affinities:
labels += [cluster.spectral_clustering(arr, n_clusters=first)]
C:\Users\sergi\anaconda3\envs\iocourse\lib\site-packages\sklearn\cluster\_kmeans.py:1382: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1. warnings.warn( C:\Users\sergi\anaconda3\envs\iocourse\lib\site-packages\sklearn\cluster\_kmeans.py:1382: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1. warnings.warn( C:\Users\sergi\anaconda3\envs\iocourse\lib\site-packages\sklearn\cluster\_kmeans.py:1382: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1. warnings.warn(
fused_labels
array([1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1])
You can now compare the estimated clusters from the fused matrix with the clustering done on the initial similarity matrices. As well as computing other indicators such as NMI and Silhouette scores.
labels
[array([1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1]), array([0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0]), array([0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1])]
from snf import metrics
nmi = metrics.nmi(labels)
print(nmi)
[[1. 0.68746257 0.01308389] [0.68746257 1. 0.00350462] [0.01308389 0.00350462 1. ]]
import numpy as np
np.fill_diagonal(fused, 0)
sil = metrics.silhouette_score(fused, fused_labels)
print('Silhouette score for the fused matrix is: {:.2f}'.format(sil))
Silhouette score for the fused matrix is: 0.24
Since we are dealing with network models, we cannot rely on weights and scores here, but we need to compute network centralities to find the samples driving the signal in one of the clusters.
Further task: