by Sergiu Netotea, PhD, NBIS, Chalmers
Further read:
Observations:
Matrix factorization (MF): [Credit: Wikipedia]
Latent (hidden) factors:
V matrix (values matrix) | (weights, scores) W matrix | (hidden, loadings) H matrix |
---|---|---|
expression values (gene x samples) | genes x factors (metagenes) | factors (metagenes) x samples |
protein counts (proteins x samples) | proteins x factors (domains) | factors (domains) x samples |
multiomics observations (genes, proteins, etc x samples) | (genes, proteins, etc) x factors (multiomic features) | factors (multiomic features) x samples |
multiple datasets (genes x samples x batches) | genes x factors (multi_batch domains) | factors (multi_batch domains) x samples |
V matrix (values matrix) | (weights, scores) W matrix | (hidden, loadings) H matrix |
---|---|---|
recommender systems (item x user) | items x factors (preferences) | factors (preferences) x users |
collaborative filtering (user x user connections) | user x factors (communities) | factors (communities) x users |
language processing (word distribution x document) | word x factors (topics) | factors (topics) x documents |
facial recognition (faces x labels) | faces x factors (facial features) | factors (facial features) x labels |
microscopy pictures (picture x samples) | pictures x factors (image segments) | factors x samples |
spectrometry (spectra x sample) | spectra x factors (component molecules) | factors (component molecules) x samples |
Further read:
import pandas as pd
import numpy as np
m = np.array([[0,1,0,1,2,2],
[0,1,1,1,3,4],
[2,3,1,1,2,2],
[1,1,1,0,1,1],
[0,2,3,4,1,1],
[0,0,0,0,1,0]])
dataset = pd.DataFrame(m, columns=['John', 'Alice', 'Mary', 'Greg', 'Peter', 'Jennifer'])
dataset.index = ['diabetes_gene1', 'diabetes_gene2', 'cancer_protein1', 'unclear', 'melancholy_gene', 'cofee_dependency_gene']
print(groceries)
John Alice Mary Greg Peter Jennifer diabetes_gene1 0 1 0 1 2 2 diabetes_gene2 0 1 1 1 3 4 cancer_protein1 2 3 1 1 2 2 unclear 1 1 1 0 1 1 melancholy_gene 0 2 3 4 1 1 cofee_dependency_gene 0 0 0 0 1 0
Intuitively we can see that the users (samples) are conected to their items (genes) via a hidden scheme, that could simplify this table. The elements of such a hidden scheme we call hidden (latent) factors. Here is a possible example:
latent_factors = ['latent1', 'latent2', 'latent3']
from sklearn.decomposition import NMF
nmf = NMF(3)
V = dataset
nmf.fit(V)
H = pd.DataFrame(np.round(nmf.components_,2), columns=V.columns)
H.index = latent_factors
W = pd.DataFrame(np.round(nmf.transform(V),2), columns=H.index)
W.index = V.index
print("\n\n V - Initial Data matrix (features x samples):")
print(V)
V - Initial Data matrix (features x samples): John Alice Mary Greg Peter Jennifer diabetes_gene1 0 1 0 1 2 2 diabetes_gene2 0 1 1 1 3 4 cancer_protein1 2 3 1 1 2 2 unclear 1 1 1 0 1 1 melancholy_gene 0 2 3 4 1 1 cofee_dependency_gene 0 0 0 0 1 0
print("\n\n W - factors matrix (features, factors):")
print(W)
W - factors matrix (features, factors): latent1 latent2 latent3 diabetes_gene1 0.17 0.03 0.37 diabetes_gene2 0.30 0.00 0.58 cancer_protein1 0.07 0.47 0.49 unclear 0.04 0.21 0.16 melancholy_gene 0.00 0.00 2.29 cofee_dependency_gene 0.04 0.00 0.00
print("\n\n H - coefficients matrix (factors, samples):")
print(H)
H - coefficients matrix (factors, samples): John Alice Mary Greg Peter Jennifer latent1 0.00 1.95 0.00 0.41 9.68 11.86 latent2 4.32 4.96 1.22 0.00 2.48 2.10 latent3 0.00 0.88 1.30 1.76 0.43 0.44
Can we figure out the hidden factors? We can do this in one of two ways, if we know the real afflictions, or as it is in our toy model, we only know the effect of our omics features. Thus, we have to look into W (weights or factors matrix)
latent_factors = ['Diabetes', 'Cancer', 'Melancholy']
H = pd.DataFrame(np.round(nmf.components_,2), columns=V.columns)
W = pd.DataFrame(np.round(nmf.transform(V),2), columns=latent_factors)
H.index = latent_factors
W.index = V.index
print(H)
print(W)
John Alice Mary Greg Peter Jennifer Diabetes 0.00 1.95 0.00 0.41 9.68 11.86 Cancer 4.32 4.96 1.22 0.00 2.48 2.10 Melancholy 0.00 0.88 1.30 1.76 0.43 0.44 Diabetes Cancer Melancholy diabetes_gene1 0.17 0.03 0.37 diabetes_gene2 0.30 0.00 0.58 cancer_protein1 0.07 0.47 0.49 unclear 0.04 0.21 0.16 melancholy_gene 0.00 0.00 2.29 cofee_dependency_gene 0.04 0.00 0.00
Example findings:
Hipothesis hunting: W x H is an approximation of V, so by transforming the dataset based on the NMF model we can learn some new things.
reconstructed = pd.DataFrame(np.round(np.dot(W,H),2), columns=V.columns)
reconstructed.index = V.index
print(reconstructed)
John Alice Mary Greg Peter Jennifer diabetes_gene1 0.13 0.81 0.52 0.72 1.88 2.24 diabetes_gene2 0.00 1.10 0.75 1.14 3.15 3.81 cancer_protein1 2.03 2.90 1.21 0.89 2.05 2.03 unclear 0.91 1.26 0.46 0.30 0.98 0.99 melancholy_gene 0.00 2.02 2.98 4.03 0.98 1.01 cofee_dependency_gene 0.00 0.08 0.00 0.02 0.39 0.47
Reference: For the toy example I drew inspiration from the following Medium article:
NMF has many solvers, that are ultra efficient:
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html
got a pytorch module: https://pypi.org/project/nmf-torch/
Used recently in a method for online learning of integrative omics for single cell:
- Deep architecture: CNN, with backpropagation, each NMF layer performs a hierarchical decomposition
In comparison to other unsupervised clustering methods including K-means and hierarchical clustering, NMF has higher accuracy in separating similar groups in various datasets. We ranked genes by their importance scores (D-scores) in separating these groups, and discovered that NMF uniquely identifies genes expressed at intermediate levels as top-ranked genes.
Rank estimation of NMF:
Other integrative NMF:
Fast Tensorial Calculus:
Non-negative CP Decomposition (NTF)