Biological and biomedical research has been tremendously benefiting last decade from the technological progress delivering DNA sequence (genomics), gene expression (transcriptomics), protein abundance (proteomics) and many other levels of biological information commonly referred to as OMICs. Despite individual OMICs layers are capable of answering many important biological questions, their combination and consequent synergistic effects from their complementarity promise new insights into behavior of biological systems such as cells, tissues and organisms. Therefore OMICs integration represents the contemporary challenge in Biology and Biomedicine.
from IPython.display import Image
Image('/home/nikolay/Documents/Teaching/IntegrativeOMICs2020/Day2/DeepLearningDataIntegration.jpg', width=2000)
The problem of data integration is not entirely new for Data Science. Imagine we know that a person looks at certain images, reads certain texts and listens to certain music. Image, text and sound are very different types of data, however we can try to combine those types of data in order to build e.g. a better recommender system which achieves a higher accuracy of capturing the interests of the person. As for Biology and Biomedicine, the idea of data integration has only recently arrived here, however it was actively developed with the biological angle resulting in several interesting methodologies such as mixOmics, MOFA, Similarity Network Fusion (SNF), OnPLS/JIVE/DISCO, Bayesian Networks etc.
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/DeepLearningDataIntegration/IntegrOmicsMethods.png', width=2000)
One problem which all the listed above integrative OMICs methods face is the curse of dimensionality, i.e. inability to work in high-dimensional space with limited number of statistical observations, which is a typical setup for biological data analysis. This is where Single Cell OMICs technologies are very helpful as they deliver hundreds of thousands and even millions of statistical observations (cells) as we discussed in the previous article, and provide thus truly Big Data ideal for integration. It is very exciting that such multi-OMICs single cell technologies as CITEseq and scNMTseq provide two and three levels of biological information, respectively, from exactly the same cells.
from IPython.display import Image
Path = '/home/nikolay/Documents/Conferences/RIKEN2018'
Image(Path+'/Strategies-for-multi-omics-profiling-of-single-cells-Three-major-types-of-molecules.png',width=2000)
Here we will perform unsupervised integration of single cell transcriptomics (scRNAseq) and proteomics (scProteomics) data from CITEseq, 8 617 cord blood mononuclear cells (CBMC), using Autoencoder which is ideally suited for capturing highly non-linear nature of single cell OMICs data. We covered advantages of using Autoencoders for Single Cell Biology in the previous post, but briefly they are related to the fact that single cell analysis is essentially unsupervised. We start by downloading CITEseq data from here, reading them with Pandas and log-transforming, which is equivalent to a mild normalization. As usually, rows are cells, columns are mRNA or protein features, last column corresponds to cell annotation.
import os
import numpy as np
import pandas as pd
from tensorflow.keras.models import Model
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from tensorflow.keras.utils import plot_model
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.layers import concatenate
os.chdir('/home/nikolay/WABI/Misc/SingleCell/CITEseq/')
scRNAseq = pd.read_csv('./session_ml/UMAP_DataIntegration/scRNAseq.txt',sep='\t')
scProteomics = pd.read_csv('./session_ml/UMAP_DataIntegration/scProteomics.csv',sep='\t')
X_scRNAseq = scRNAseq.values[:,0:(scRNAseq.shape[1]-1)]
Y_scRNAseq = scRNAseq.values[:,scRNAseq.shape[1]-1]
X_scProteomics = scProteomics.values[:,0:(scProteomics.shape[1]-1)]
Y_scProteomics = scProteomics.values[:,scProteomics.shape[1]-1]
X_scRNAseq = np.log(X_scRNAseq + 1)
X_scProteomics = np.log(X_scProteomics + 1)
print('scRNAseq dimensions: ' + str(X_scRNAseq.shape))
print('scProteomics dimensions: ' + str(X_scProteomics.shape))
Let us display tSNE embeddings of individual scRNAseq and scProteomics layers before performing the integration step.
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
X_reduced = PCA(n_components = 20).fit_transform(X_scRNAseq)
model_tsne_scRNAseq = TSNE(learning_rate = 200, n_components = 2, random_state = 123,
perplexity = 90, n_iter = 1000, verbose = 1)
tsne_scRNAseq = model_tsne_scRNAseq.fit_transform(X_reduced)
plt.figure(figsize = (20,15))
plt.scatter(tsne_scRNAseq[:, 0], tsne_scRNAseq[:, 1], c = Y_scRNAseq, cmap = 'tab20', s = 50)
plt.title('tSNE on scRNAseq from CITEseq', fontsize = 22)
plt.xlabel("tSNE1", fontsize = 22)
plt.ylabel("tSNE2", fontsize = 22)
plt.show()
from sklearn.manifold import TSNE
model_tsne_scProteomics = TSNE(learning_rate = 200, n_components = 2, random_state = 123,
perplexity = 90, n_iter = 1000, verbose = 1)
tsne_scProteomics = model_tsne_scProteomics.fit_transform(X_scProteomics)
plt.figure(figsize = (20,15))
plt.scatter(tsne_scProteomics[:, 0], tsne_scProteomics[:, 1], c = Y_scRNAseq, cmap = 'tab20', s = 50)
plt.title('tSNE on scProteomics from CITEseq', fontsize = 22)
plt.xlabel("tSNE1", fontsize = 22)
plt.ylabel("tSNE2", fontsize = 22)
plt.show()
Now we are going to build an Autoencoder model with 4 hidden layers using Keras functional API. The Autoencoder has two inputs, one for each layer of information, i.e. scRNAseq and scProteomics, and corresponding two outputs which aim to reconstruct the inputs. The two input layers are separately linearly transformed in the first hidden layer (equivalent to PCA dimensionality reduction) before they are concatenated in the second hidden layer. Finally, the merged OMICs are processed through the bottleneck of the Autoencoder, and finally the dimensions are gradually reconstructed to the initial ones according to the “butterfly” symmetry typical for Autoencoders.
from IPython.display import Image
Image('/home/nikolay/WABI/Misc/SingleCell/CITEseq/autoencoder_graph.png', width=800)
In the code for the Autoencoder below, it is important to note that the first hidden layer imposes severe dimensionality reduction on the scRNAseq from 977 to 50 genes, while it leaves the scProteomics almost untouched, i.e. reduces dimensions from 11 to 10. The bottleneck further reduces the total 60 dimensions after concatenation down to 50 latent variables which represent combinations of both mRNA and protein features.
# Input Layer
ncol_scRNAseq = X_scRNAseq.shape[1]
input_dim_scRNAseq = Input(shape = (ncol_scRNAseq, ), name = "scRNAseq")
ncol_scProteomics = X_scProteomics.shape[1]
input_dim_scProteomics = Input(shape = (ncol_scProteomics, ), name = "scProteomics")
# Dimensions of Encoder for each OMIC
encoding_dim_scRNAseq = 50
encoding_dim_scProteomics = 10
# Encoder layer for each OMIC
encoded_scRNAseq = Dense(encoding_dim_scRNAseq, activation = 'elu',
name = "Encoder_scRNAseq")(input_dim_scRNAseq)
encoded_scProteomics = Dense(encoding_dim_scProteomics, activation = 'linear',
name = "Encoder_scProteomics")(input_dim_scProteomics)
# Merging Encoder layers from different OMICs
merge = concatenate([encoded_scRNAseq, encoded_scProteomics])
# Bottleneck compression
bottleneck = Dense(50, kernel_initializer = 'uniform', activation = 'linear',
name = "Bottleneck")(merge)
#Inverse merging
merge_inverse = Dense(encoding_dim_scRNAseq + encoding_dim_scProteomics,
activation = 'elu', name = "Concatenate_Inverse")(bottleneck)
# Decoder layer for each OMIC
decoded_scRNAseq = Dense(ncol_scRNAseq, activation = 'linear',
name = "Decoder_scRNAseq")(merge_inverse)
decoded_scProteomics = Dense(ncol_scProteomics, activation = 'linear',
name = "Decoder_scProteomics")(merge_inverse)
# Combining Encoder and Decoder into an Autoencoder model
autoencoder = Model([input_dim_scRNAseq, input_dim_scProteomics], [decoded_scRNAseq, decoded_scProteomics])
# Compile Autoencoder
autoencoder.compile(optimizer = 'adam',
loss={'Decoder_scRNAseq': 'mean_squared_error',
'Decoder_scProteomics': 'mean_squared_error'})
autoencoder.summary()
A very handy thing here is that we can assign different loss functions to OMICs coming from different statistical distributions, e.g. combining categorical and continuous data we can apply categorical cross entropy and mean squared error, respectively. Another great thing about data integration via Autoencoders is that all OMICs know about each other as the weights for each node / feature are updated through back propagation in the context of each other. Finally, let us train the Autoencoder and feed the bottleneck into tSNE for visualization:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
# Autoencoder training
estimator = autoencoder.fit([X_scRNAseq, X_scProteomics],
[X_scRNAseq, X_scProteomics],
epochs = 10, batch_size = 128,
validation_split = 0.2, shuffle = True, verbose = 1)
print("Training Loss: ",estimator.history['loss'][-1])
print("Validation Loss: ",estimator.history['val_loss'][-1])
plt.figure(figsize = (20,15))
plt.plot(estimator.history['loss'])
plt.plot(estimator.history['val_loss'])
plt.title('Model Loss', fontsize = 22)
plt.ylabel('Loss', fontsize = 22)
plt.xlabel('Epoch', fontsize = 22)
plt.legend(['Train','Validation'], loc = 'upper right', fontsize = 22)
plt.show()
# Encoder model
encoder = Model([input_dim_scRNAseq, input_dim_scProteomics], bottleneck)
bottleneck_representation = encoder.predict([X_scRNAseq, X_scProteomics])
# tSNE on Autoencoder bottleneck representation
model_tsne_auto = TSNE(learning_rate = 200, n_components = 2, random_state = 123,
perplexity = 90, n_iter = 1000, verbose = 1)
tsne_auto = model_tsne_auto.fit_transform(bottleneck_representation)
plt.figure(figsize = (20,15))
plt.scatter(tsne_auto[:, 0], tsne_auto[:, 1], c = Y_scRNAseq, cmap = 'tab20', s = 50)
plt.title('tSNE on Autoencoder: Data Integration, CITEseq', fontsize = 22)
plt.xlabel("tSNE1", fontsize = 22)
plt.ylabel("tSNE2", fontsize = 22)
plt.show()
Comparing the tSNE plots obtained using individual OMICs with the tSNE on the bottleneck of the Autoencoder that combines the data, we can immediately see that the integration somewhat averages and reinforces the individual OMICs. For example, the purple cluster would be hard to discover using the scRNAseq data alone as it is not distinct from the blue cell population, however after integration the purple group of cells is easily distinguishable. This is the power of data integration!
While CITEseq includes two single cell levels of information (transcriptomics and proteomics), another fantastic technology, scNMTseq, delivers three OMICs from the same biological cells: 1) transcriptomics (scRNAseq), 2) methylation pattern (scBSseq), and 3) open chromatin regions (scATACseq). The raw data can be downloaded from here.
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/DeepLearningDataIntegration/scNMT_Method.png', width=2000)
The architecture of the Autoencoder is analogous to the one used for CITEseq with only one peculiarity: Dropout regularization is used on the input layers. This is due to the fact that we have only ~120 cells sequenced while the dimensionality of the feature space is tens of thousands, so we need to apply regularization to overcome the curse of dimensionality. Note that this was not necessary for CITEseq where we had ~8K cells and ~1K features, so exactly opposite situation. Nevertheless, overall scNMTseq is not an easy case for data integration, I firmly believe though that this is just the beginning of single cell multi-OMICs era and many more cells will arrive soon from this exciting technology, so it is better to be prepared.
import os
import numpy as np
import pandas as pd
from umap import UMAP
from tensorflow.keras.models import Model
import matplotlib.pyplot as plt
from tensorflow.keras.layers import concatenate
from tensorflow.keras.layers import Input, Dense, Dropout
os.chdir('/home/nikolay/WABI/Misc/SingleCell/scNMTseq/')
################## READ AND TRANSFORM DATA ##################
scRNAseq = pd.read_csv('scRNAseq.txt',sep='\t')
scBSseq = pd.read_csv('scBSseq.txt',sep='\t')
scATACseq = pd.read_csv('scATACseq.txt',sep='\t')
X_scRNAseq = scRNAseq.values[:,0:(scRNAseq.shape[1]-1)]
Y_scRNAseq = scRNAseq.values[:,scRNAseq.shape[1]-1]
X_scBSseq = scBSseq.values[:,0:(scBSseq.shape[1]-1)]
Y_scBSseq = scBSseq.values[:,scBSseq.shape[1]-1]
X_scATACseq = scATACseq.values[:,0:(scATACseq.shape[1]-1)]
Y_scATACseq = scATACseq.values[:,scATACseq.shape[1]-1]
X_scRNAseq = np.log(X_scRNAseq + 1)
X_scBSseq = np.log(X_scBSseq + 1)
X_scATACseq = np.log(X_scATACseq + 1)
print('scRNAseq dimensions: ' + str(X_scRNAseq.shape))
print('scBSseq dimensions: ' + str(X_scBSseq.shape))
print('scATACseq dimensions: ' + str(X_scATACseq.shape))
Again, let us display the tSNE embeddings of the individual scRNAseq layers from the scNMT data set before integrating it together with scBSseq and scATACseq with Autoencoder.
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
X_reduced = PCA(n_components = 5).fit_transform(X_scRNAseq)
model_tsne_scRNAseq = TSNE(n_components = 2, random_state = 123, perplexity = 10, n_iter = 1000, verbose = 1)
tsne_scRNAseq = model_tsne_scRNAseq.fit_transform(X_reduced)
plt.figure(figsize = (20,15))
plt.scatter(tsne_scRNAseq[:, 0], tsne_scRNAseq[:, 1], c = Y_scRNAseq, cmap = 'tab20', s = 50)
plt.title('tSNE on scRNAseq from scNMT', fontsize = 22)
plt.xlabel("tSNE1", fontsize = 22)
plt.ylabel("tSNE2", fontsize = 22)
plt.show()
######################## AUTOENCODER ########################
# Input Layer
ncol_scRNAseq = X_scRNAseq.shape[1]
input_dim_scRNAseq = Input(shape = (ncol_scRNAseq, ), name = "scRNAseq")
ncol_scBSseq = X_scBSseq.shape[1]
input_dim_scBSseq = Input(shape = (ncol_scBSseq, ), name = "scBSseq")
ncol_scATACseq = X_scATACseq.shape[1]
input_dim_scATACseq = Input(shape = (ncol_scATACseq, ), name = "scATACseq")
encoding_dim_scRNAseq = 10
encoding_dim_scBSseq = 10
encoding_dim_scATACseq = 10
# Dropout on Input Layer
dropout_scRNAseq = Dropout(0.2, name = "Dropout_scRNAseq")(input_dim_scRNAseq)
dropout_scBSseq = Dropout(0.2, name = "Dropout_scBSseq")(input_dim_scBSseq)
dropout_scATACseq = Dropout(0.2, name = "Dropout_scATACseq")(input_dim_scATACseq)
# Encoder layer for each OMIC
encoded_scRNAseq = Dense(encoding_dim_scRNAseq, activation = 'elu',
name = "Encoder_scRNAseq")(dropout_scRNAseq)
encoded_scBSseq = Dense(encoding_dim_scBSseq, activation = 'elu',
name = "Encoder_scBSseq")(dropout_scBSseq)
encoded_scATACseq = Dense(encoding_dim_scATACseq, activation = 'elu',
name = "Encoder_scATACseq")(dropout_scATACseq)
# Merging Encoder layers from different OMICs
merge = concatenate([encoded_scRNAseq, encoded_scBSseq, encoded_scATACseq])
# Bottleneck compression
bottleneck = Dense(5, kernel_initializer = 'uniform', activation = 'elu',
name = "Bottleneck")(merge)
#Inverse merging
merge_inverse = Dense(encoding_dim_scRNAseq + encoding_dim_scBSseq +
encoding_dim_scATACseq,
activation = 'elu', name = "Concatenate_Inverse")(bottleneck)
# Decoder layer for each OMIC
decoded_scRNAseq = Dense(ncol_scRNAseq, activation = 'elu',
name = "Decoder_scRNAseq")(merge_inverse)
decoded_scBSseq = Dense(ncol_scBSseq, activation = 'elu',
name = "Decoder_scBSseq")(merge_inverse)
decoded_scATACseq = Dense(ncol_scATACseq, activation = 'elu',
name = "Decoder_scATACseq")(merge_inverse)
# Combining Encoder and Decoder into an Autoencoder model
autoencoder = Model([input_dim_scRNAseq, input_dim_scBSseq, input_dim_scATACseq],
[decoded_scRNAseq, decoded_scBSseq, decoded_scATACseq])
# Compile Autoencoder
autoencoder.compile(optimizer = 'adam',
loss={'Decoder_scRNAseq': 'mean_squared_error',
'Decoder_scBSseq': 'binary_crossentropy',
'Decoder_scATACseq': 'binary_crossentropy'})
autoencoder.summary()
# Autoencoder training
estimator = autoencoder.fit([X_scRNAseq, X_scBSseq, X_scATACseq],
[X_scRNAseq, X_scBSseq, X_scATACseq], epochs = 20,
batch_size = 16, validation_split = 0.2,
shuffle = True, verbose = 0)
print("Training Loss: ",estimator.history['loss'][-1])
print("Validation Loss: ",estimator.history['val_loss'][-1])
plt.figure(figsize = (20,15))
plt.plot(estimator.history['loss']); plt.plot(estimator.history['val_loss'])
plt.title('Model Loss', fontsize = 22); plt.ylabel('Loss', fontsize = 22); plt.xlabel('Epoch', fontsize = 22)
plt.legend(['Train','Validation'], loc = 'upper right', fontsize = 22)
# Encoder model
encoder = Model([input_dim_scRNAseq, input_dim_scBSseq, input_dim_scATACseq], bottleneck)
bottleneck_representation = encoder.predict([X_scRNAseq, X_scBSseq, X_scATACseq])
############### UNIFORM MANIFOLD APPROXIMATION AND PROJECTION (UMAP) ###############
model_umap = UMAP(n_neighbors = 10, min_dist = 0.5, n_components = 2)
umap = model_umap.fit_transform(bottleneck_representation)
plt.figure(figsize = (20,15))
plt.scatter(umap[:, 0], umap[:, 1], c = Y_scRNAseq, cmap = 'tab10', s = 50)
plt.title('UMAP on Autoencoder: Data Integration, scNMTseq', fontsize = 22)
plt.xlabel("UMAP1", fontsize = 22); plt.ylabel("UMAP2", fontsize = 22)
plt.show()
Here out of curiosity I fed the bottleneck of the Autoencoder that combines the three scNMTseq OMICs into Uniform Manifold Approximation and Projection (UMAP) non-linear dimensionality reduction technique which seems to outperform tSNE in sense of scalability for large amounts of data. We can immediately see that the homogeneous in sense of gene expression brown cluster splits into two clusters when scRNAseq is combined with epigenetics information from the same cells (scBSseq and scATACseq). Therefore it seems that we have captured a new heterogeneity between cells which was hidden when looking only at gene expression scRNAseq data. Can this be a new way of classifying cells across populations by using the whole complexity of their biology? If so, then the question comes: what is a cell population or cell type? I do not know the answer for this question.
Here we have learnt that multiple sources of molecular and clinical information are becoming common in Biology and Biomedicine thanks to the recent technological progress. Therefore data integration is a logical next step which provides a more comprehensive understanding of the biological processes by utilizing the whole complexity of the data. Deep Learning framework is ideally suited for data integration due to its truly “integrative” updating of parameters through back propagation when multiple data types learn information from each other. I showed that data integration can result in discoveries of novel patterns in the data which were not previously seen in the individual data types.