NMF lab¶
Task (45 minutes)¶
Important
- This task is optimized for speedrun and not quality. Normaly you would look into more components-
Data: MOFA's CLL dataset
- Load and prepare data, then perform NMR on the joint representation of the methylation and mRNA datasets.
- Assume that there are two cancer subtypes and cluster them :)
- Find the driving features and verify their functionality.
Code:
- Lab example uses a NMF implementation in Python:
- For R, feel free to use this package:
Load the datasets:
In [1]:
data_loc = "data/"
import pandas as pd
df_meth = pd.read_csv(data_loc + "CLL_data_Methylation.csv", index_col=0)
df_mrna = pd.read_csv(data_loc + "CLL_data_mRNA.csv", index_col=0)
# drop nans by column
df_mrna = df_mrna.dropna(axis='columns')
df_meth = df_meth.dropna(axis='columns')
df_mrna = df_mrna.T
df_meth = df_meth.T
In [2]:
df_meth.head()
Out[2]:
cg10146935 | cg26837773 | cg17801765 | cg13244315 | cg06181703 | cg19626656 | cg15207968 | cg12755103 | cg23651812 | cg14287724 | ... | cg07016730 | cg25152348 | cg08425796 | cg05418105 | cg22249529 | cg07600533 | cg08260245 | cg19112186 | cg10770023 | cg00270625 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
H045 | 1.811086 | -5.172572 | 5.411526 | -0.118825 | 5.120384 | 0.145951 | -3.436869 | -3.844246 | 2.075422 | 3.501829 | ... | 3.547843 | 0.060132 | 4.442026 | 2.861301 | 5.246799 | 3.901933 | 5.713831 | 5.703520 | 5.166255 | 4.911655 |
H109 | -3.997508 | 1.594870 | 5.412693 | 1.043871 | 1.279480 | -3.928433 | 2.989245 | 0.393004 | 4.800121 | 3.159201 | ... | 0.887926 | -0.214753 | 4.561187 | 3.919911 | 5.058302 | 2.634941 | 5.107460 | 1.326244 | 0.677912 | 5.281115 |
H024 | -2.844313 | 0.161170 | 0.365706 | -4.219236 | 0.721100 | -3.418859 | -3.250385 | -2.691305 | 0.534854 | -4.629484 | ... | -4.486709 | 0.121749 | -2.841373 | -3.607177 | 0.765651 | 1.516759 | 5.676245 | 5.488636 | 4.221828 | 5.379716 |
H056 | -3.338656 | -2.093433 | 0.373634 | -1.592196 | 4.047059 | 0.226601 | 2.377386 | -2.775075 | 0.419985 | 0.312388 | ... | -4.238214 | 0.137862 | -3.964855 | -2.270940 | -2.631909 | -3.884756 | 5.950338 | 5.354059 | 4.934536 | 5.366823 |
H079 | -0.019362 | 3.748980 | 5.412010 | 1.416418 | 5.237422 | 0.324213 | -0.647632 | -3.098837 | 5.397188 | 3.410770 | ... | 2.758021 | 0.021011 | 0.673296 | 3.455230 | -3.140733 | -4.238106 | 6.040756 | 5.584746 | 5.095111 | 5.338470 |
5 rows × 4248 columns
To ensure non-negativity, add the negative modality data as new features. Just some ugly code, not important. :)
In [3]:
cols = df_meth.columns.copy()
columns = {}
for c in cols:
mask = df_meth[c] < 0
columns[c + '_p'] = df_meth[c].mask(mask)
columns[c + '_n'] = - df_meth[c].mask(~mask)
df_meth = pd.concat(list(columns.values()), keys=list(columns.keys()), axis=1)
df_meth = df_meth.fillna(0)
In [4]:
df_meth.T.head()
Out[4]:
H045 | H109 | H024 | H056 | H079 | H164 | H059 | H167 | H113 | H049 | ... | H106 | H176 | H136 | H178 | H166 | H174 | H177 | H259 | H175 | H179 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cg10146935_p | 1.811086 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
cg10146935_n | 0.000000 | 3.997508 | 2.844313 | 3.338656 | 0.019362 | 2.485997 | 1.460211 | 4.952291 | 2.980209 | 0.091713 | ... | 4.946979 | 0.838754 | 4.684879 | 5.077259 | 0.625954 | 4.918812 | 4.727493 | 5.193812 | 4.437189 | 5.060459 |
cg26837773_p | 0.000000 | 1.594870 | 0.161170 | 0.000000 | 3.748980 | 0.060530 | 0.000000 | 0.547577 | 2.440098 | 2.940767 | ... | 0.199544 | 0.341156 | 3.496116 | 0.000000 | 0.000000 | 3.214163 | 2.036858 | 0.000000 | 4.043775 | 0.000000 |
cg26837773_n | 5.172572 | 0.000000 | 0.000000 | 2.093433 | 0.000000 | 0.000000 | 3.472232 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 4.821910 | 0.858200 | 0.000000 | 0.000000 | 0.816088 | 0.000000 | 2.345652 |
cg17801765_p | 5.411526 | 5.412693 | 0.365706 | 0.373634 | 5.412010 | 5.268908 | 0.000000 | 5.337081 | 0.749546 | 0.426493 | ... | 0.574758 | 0.707883 | 0.438992 | 4.873615 | 0.753594 | 0.000000 | 0.000000 | 0.547390 | 4.086683 | 0.135581 |
5 rows × 196 columns
Apply the frobenius norms to normalize the datasets, then concatenate the data.
In [5]:
df = df_mrna.T
df = df/df.mean()
fro = df.apply(lambda x: (x**2).sum()**.5, axis=0)
df_mrna = df / fro
In [6]:
df = df_meth.T
df = df/df.mean()
fro = df.apply(lambda x: (x**2).sum()**.5, axis=0)
df_meth = df / fro
Unfortunately the dataset is not column matched, so we got to do it ourselves.
In [7]:
X = pd.concat([df_mrna, df_meth])
X = X.dropna(axis='columns')
print(X.shape)
(13496, 135)
In [8]:
X.head()
Out[8]:
H045 | H109 | H024 | H056 | H079 | H164 | H059 | H167 | H113 | H049 | ... | H271 | H006 | H084 | H260 | H192 | H070 | H255 | H135 | H247 | H066 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ENSG00000244734 | 0.009022 | 0.005391 | 0.020382 | 0.026755 | 0.012140 | 0.005026 | 0.010195 | 0.003092 | 0.004567 | 0.005180 | ... | 0.008746 | 0.017387 | 0.021283 | 0.024401 | 0.013766 | 0.006895 | 0.006624 | 0.003180 | 0.017701 | 0.007995 |
ENSG00000158528 | 0.023238 | 0.026321 | 0.004801 | 0.006514 | 0.023816 | 0.022488 | 0.011320 | 0.005297 | 0.005619 | 0.005180 | ... | 0.007797 | 0.007975 | 0.005282 | 0.025212 | 0.006281 | 0.003239 | 0.025178 | 0.003180 | 0.006260 | 0.024524 |
ENSG00000198478 | 0.017657 | 0.005391 | 0.025392 | 0.016334 | 0.009752 | 0.024930 | 0.007419 | 0.008332 | 0.010429 | 0.005966 | ... | 0.004636 | 0.010719 | 0.018570 | 0.020788 | 0.012456 | 0.003239 | 0.005890 | 0.003180 | 0.016220 | 0.023507 |
ENSG00000175445 | 0.025108 | 0.021643 | 0.003135 | 0.003081 | 0.026607 | 0.021274 | 0.023107 | 0.023463 | 0.004567 | 0.005966 | ... | 0.004636 | 0.018250 | 0.018723 | 0.023259 | 0.003126 | 0.009398 | 0.023147 | 0.003180 | 0.011393 | 0.021184 |
ENSG00000174469 | 0.005235 | 0.025055 | 0.003135 | 0.027334 | 0.010923 | 0.021449 | 0.016399 | 0.005297 | 0.004567 | 0.026762 | ... | 0.028580 | 0.020996 | 0.024090 | 0.018633 | 0.010119 | 0.026308 | 0.003097 | 0.026659 | 0.019995 | 0.021093 |
5 rows × 135 columns
Fiting an NMF model¶
- So now we fit a two component NMF model and check results..
In [9]:
from sklearn.decomposition import NMF
model = NMF(n_components=2, init='random', random_state=0)
W = model.fit_transform(X)
H = model.components_
In [10]:
print(H.shape, W.shape)
(2, 135) (13496, 2)
In [11]:
W
Out[11]:
array([[0.12609143, 0.0962925 ], [0.04392555, 0.11771427], [0.09126849, 0.10719918], ..., [0.00193168, 0.01028896], [0.17476978, 0.07858805], [0.00302261, 0.02073489]])
In [12]:
#W_df = pd.DataFrame(W, columns=X.index.values)
W_df = pd.DataFrame(W, columns=["latent_1", "latent_2"], index = X.index)
W_df.head()
Out[12]:
latent_1 | latent_2 | |
---|---|---|
ENSG00000244734 | 0.126091 | 0.096292 |
ENSG00000158528 | 0.043926 | 0.117714 |
ENSG00000198478 | 0.091268 | 0.107199 |
ENSG00000175445 | 0.065235 | 0.146637 |
ENSG00000174469 | 0.197567 | 0.098666 |
In [13]:
H_df = pd.DataFrame(H, columns=X.columns, index = ["latent_1", "latent_2"])
H_df
Out[13]:
H045 | H109 | H024 | H056 | H079 | H164 | H059 | H167 | H113 | H049 | ... | H271 | H006 | H084 | H260 | H192 | H070 | H255 | H135 | H247 | H066 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
latent_1 | 0.005286 | 0.001123 | 0.116696 | 0.1156 | 0.008202 | 0.000000 | 0.108295 | 0.004208 | 0.068335 | 0.053253 | ... | 0.112052 | 0.117386 | 0.000702 | 0.005236 | 0.05371 | 0.112368 | 0.012251 | 0.113764 | 0.116019 | 0.000000 |
latent_2 | 0.144156 | 0.149550 | 0.000000 | 0.0000 | 0.140064 | 0.150715 | 0.009437 | 0.145258 | 0.061816 | 0.082893 | ... | 0.004414 | 0.000000 | 0.150970 | 0.142550 | 0.08071 | 0.004421 | 0.134242 | 0.000000 | 0.000000 | 0.152084 |
2 rows × 135 columns
Clustering
- Clustering the samples on two subtypes doesn't require any specific clustering algorithms :) But for more subtypes you are welcome to measure up against MOFA.
In [14]:
import numpy as np
clusters = np.argmax(H, axis = 0)
clusters
Out[14]:
array([1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1])
NMF analysis¶
What samples drive our first component (cluster)?
In [15]:
samples = H_df.loc["latent_1",].sort_values(ascending=False)
samples
Out[15]:
H022 0.117501 H006 0.117386 H110 0.116995 H236 0.116745 H024 0.116696 ... H042 0.000000 H077 0.000000 H008 0.000000 H027 0.000000 H066 0.000000 Name: latent_1, Length: 135, dtype: float64
In [16]:
top10 = samples.index.values[0:10]
top10
Out[16]:
array(['H022', 'H006', 'H110', 'H236', 'H024', 'H099', 'H038', 'H247', 'H133', 'H060'], dtype=object)
Similarly, what transcripts ID and methylation probes are driving the signal in the same component?
In [17]:
# first component of W
features = W_df.iloc[:,0].sort_values(ascending=False)
features
Out[17]:
ENSG00000092820 0.298223 ENSG00000067082 0.293701 ENSG00000142102 0.285527 ENSG00000177606 0.282739 ENSG00000156738 0.281571 ... cg18766900_p 0.000000 cg06528737_p 0.000000 cg10853431_p 0.000000 cg07240413_p 0.000000 cg04387658_p 0.000000 Name: latent_1, Length: 13496, dtype: float64
Let's get the first ten genes and the first ten probes...
In [18]:
top_g = [f for f in features.index.values if f[:2]=="EN"]
top_p = [f for f in features.index.values if f[:2]=="cg"]
print(top_g[:10])
print(top_p[:10])
['ENSG00000092820', 'ENSG00000067082', 'ENSG00000142102', 'ENSG00000177606', 'ENSG00000156738', 'ENSG00000077238', 'ENSG00000215301', 'ENSG00000087074', 'ENSG00000112149', 'ENSG00000143297'] ['cg04813697_n', 'cg01360627_n', 'cg06401414_n', 'cg03633073_p', 'cg04694619_n', 'cg02650512_n', 'cg05277504_p', 'cg04173586_n', 'cg05656688_n', 'cg10379077_n']
Does is look simpler than MOFA?¶
well...
Now, for finer points, expand this study:
- Use a range of K values and fit the model then cluster via Kmeans and perform silhouette coefficient testing.
- Observe convergence by computing the mean squared error.
- Use the other omics tables as well, for .. something.
- Compare with the results from the MOFA analysis https://bioconductor.riken.jp/packages/3.9/bioc/vignettes/MOFA/inst/doc/MOFA_example_CLL.html
Shameless self promotion:
- I hold a course of data science using advanced python, that covers everything, from numerical computing to deep learning. For details check the SciLifeLab page.
In [ ]: