Deep Learning for Data Integration

Biological and biomedical research has been tremendously benefiting last decade from the technological progress delivering DNA sequence (genomics), gene expression (transcriptomics), protein abundance (proteomics) and many other levels of biological information commonly referred to as OMICs. Despite individual OMICs layers are capable of answering many important biological questions, their combination and consequent synergistic effects from their complementarity promise new insights into behavior of biological systems such as cells, tissues and organisms. Therefore OMICs integration represents the contemporary challenge in Biology and Biomedicine.

In [1]:
from IPython.display import Image
Image('/home/nikolay/Documents/Teaching/IntegrativeOMICs2020/Day2/DeepLearningDataIntegration.jpg', width=2000)
Out[1]:

Single Cells Make Big Data

The problem of data integration is not entirely new for Data Science. Imagine we know that a person looks at certain images, reads certain texts and listens to certain music. Image, text and sound are very different types of data, however we can try to combine those types of data in order to build e.g. a better recommender system which achieves a higher accuracy of capturing the interests of the person. As for Biology and Biomedicine, the idea of data integration has only recently arrived here, however it was actively developed with the biological angle resulting in several interesting methodologies such as mixOmics, MOFA, Similarity Network Fusion (SNF), OnPLS/JIVE/DISCO, Bayesian Networks etc.

In [2]:
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/DeepLearningDataIntegration/IntegrOmicsMethods.png', width=2000)
Out[2]:

One problem which all the listed above integrative OMICs methods face is the curse of dimensionality, i.e. inability to work in high-dimensional space with limited number of statistical observations, which is a typical setup for biological data analysis. This is where Single Cell OMICs technologies are very helpful as they deliver hundreds of thousands and even millions of statistical observations (cells) as we discussed in the previous article, and provide thus truly Big Data ideal for integration. It is very exciting that such multi-OMICs single cell technologies as CITEseq and scNMTseq provide two and three levels of biological information, respectively, from exactly the same cells.

In [3]:
from IPython.display import Image
Path = '/home/nikolay/Documents/Conferences/RIKEN2018'
Image(Path+'/Strategies-for-multi-omics-profiling-of-single-cells-Three-major-types-of-molecules.png',width=2000)
Out[3]:

Integrating CITEseq data with Deep Learning

Here we will perform unsupervised integration of single cell transcriptomics (scRNAseq) and proteomics (scProteomics) data from CITEseq, 8 617 cord blood mononuclear cells (CBMC), using Autoencoder which is ideally suited for capturing highly non-linear nature of single cell OMICs data. We covered advantages of using Autoencoders for Single Cell Biology in the previous post, but briefly they are related to the fact that single cell analysis is essentially unsupervised. We start by downloading CITEseq data from here, reading them with Pandas and log-transforming, which is equivalent to a mild normalization. As usually, rows are cells, columns are mRNA or protein features, last column corresponds to cell annotation.

In [4]:
import os
import numpy as np
import pandas as pd
from tensorflow.keras.models import Model
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from tensorflow.keras.utils import plot_model
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.layers import concatenate

os.chdir('/home/nikolay/WABI/Misc/SingleCell/CITEseq/')

scRNAseq = pd.read_csv('./session_ml/UMAP_DataIntegration/scRNAseq.txt',sep='\t')
scProteomics = pd.read_csv('./session_ml/UMAP_DataIntegration/scProteomics.csv',sep='\t')

X_scRNAseq = scRNAseq.values[:,0:(scRNAseq.shape[1]-1)]
Y_scRNAseq = scRNAseq.values[:,scRNAseq.shape[1]-1]
X_scProteomics = scProteomics.values[:,0:(scProteomics.shape[1]-1)]
Y_scProteomics = scProteomics.values[:,scProteomics.shape[1]-1]

X_scRNAseq = np.log(X_scRNAseq + 1)
X_scProteomics = np.log(X_scProteomics + 1)
print('scRNAseq dimensions: ' + str(X_scRNAseq.shape))
print('scProteomics dimensions: ' + str(X_scProteomics.shape))
scRNAseq dimensions: (8617, 976)
scProteomics dimensions: (8617, 10)

Let us display tSNE embeddings of individual scRNAseq and scProteomics layers before performing the integration step.

In [20]:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

X_reduced = PCA(n_components = 20).fit_transform(X_scRNAseq)

model_tsne_scRNAseq = TSNE(learning_rate = 200, n_components = 2, random_state = 123, 
                       perplexity = 90, n_iter = 1000, verbose = 1)
tsne_scRNAseq = model_tsne_scRNAseq.fit_transform(X_reduced)

plt.figure(figsize = (20,15))
plt.scatter(tsne_scRNAseq[:, 0], tsne_scRNAseq[:, 1], c = Y_scRNAseq, cmap = 'tab20', s = 50)
plt.title('tSNE on scRNAseq from CITEseq', fontsize = 22)
plt.xlabel("tSNE1", fontsize = 22)
plt.ylabel("tSNE2", fontsize = 22)
plt.show()
[t-SNE] Computing 271 nearest neighbors...
[t-SNE] Indexed 8617 samples in 0.022s...
[t-SNE] Computed neighbors for 8617 samples in 1.813s...
[t-SNE] Computed conditional probabilities for sample 1000 / 8617
[t-SNE] Computed conditional probabilities for sample 2000 / 8617
[t-SNE] Computed conditional probabilities for sample 3000 / 8617
[t-SNE] Computed conditional probabilities for sample 4000 / 8617
[t-SNE] Computed conditional probabilities for sample 5000 / 8617
[t-SNE] Computed conditional probabilities for sample 6000 / 8617
[t-SNE] Computed conditional probabilities for sample 7000 / 8617
[t-SNE] Computed conditional probabilities for sample 8000 / 8617
[t-SNE] Computed conditional probabilities for sample 8617 / 8617
[t-SNE] Mean sigma: 0.738020
[t-SNE] KL divergence after 250 iterations with early exaggeration: 67.639267
[t-SNE] KL divergence after 1000 iterations: 1.444937
In [15]:
from sklearn.manifold import TSNE

model_tsne_scProteomics = TSNE(learning_rate = 200, n_components = 2, random_state = 123, 
                       perplexity = 90, n_iter = 1000, verbose = 1)
tsne_scProteomics = model_tsne_scProteomics.fit_transform(X_scProteomics)

plt.figure(figsize = (20,15))
plt.scatter(tsne_scProteomics[:, 0], tsne_scProteomics[:, 1], c = Y_scRNAseq, cmap = 'tab20', s = 50)
plt.title('tSNE on scProteomics from CITEseq', fontsize = 22)
plt.xlabel("tSNE1", fontsize = 22)
plt.ylabel("tSNE2", fontsize = 22)
plt.show()
[t-SNE] Computing 271 nearest neighbors...
[t-SNE] Indexed 8617 samples in 0.017s...
[t-SNE] Computed neighbors for 8617 samples in 1.254s...
[t-SNE] Computed conditional probabilities for sample 1000 / 8617
[t-SNE] Computed conditional probabilities for sample 2000 / 8617
[t-SNE] Computed conditional probabilities for sample 3000 / 8617
[t-SNE] Computed conditional probabilities for sample 4000 / 8617
[t-SNE] Computed conditional probabilities for sample 5000 / 8617
[t-SNE] Computed conditional probabilities for sample 6000 / 8617
[t-SNE] Computed conditional probabilities for sample 7000 / 8617
[t-SNE] Computed conditional probabilities for sample 8000 / 8617
[t-SNE] Computed conditional probabilities for sample 8617 / 8617
[t-SNE] Mean sigma: 0.471751
[t-SNE] KL divergence after 250 iterations with early exaggeration: 65.956917
[t-SNE] KL divergence after 1000 iterations: 1.193578

Now we are going to build an Autoencoder model with 4 hidden layers using Keras functional API. The Autoencoder has two inputs, one for each layer of information, i.e. scRNAseq and scProteomics, and corresponding two outputs which aim to reconstruct the inputs. The two input layers are separately linearly transformed in the first hidden layer (equivalent to PCA dimensionality reduction) before they are concatenated in the second hidden layer. Finally, the merged OMICs are processed through the bottleneck of the Autoencoder, and finally the dimensions are gradually reconstructed to the initial ones according to the “butterfly” symmetry typical for Autoencoders.

In [5]:
from IPython.display import Image
Image('/home/nikolay/WABI/Misc/SingleCell/CITEseq/autoencoder_graph.png', width=800)
Out[5]:

In the code for the Autoencoder below, it is important to note that the first hidden layer imposes severe dimensionality reduction on the scRNAseq from 977 to 50 genes, while it leaves the scProteomics almost untouched, i.e. reduces dimensions from 11 to 10. The bottleneck further reduces the total 60 dimensions after concatenation down to 50 latent variables which represent combinations of both mRNA and protein features.

In [35]:
# Input Layer
ncol_scRNAseq = X_scRNAseq.shape[1]
input_dim_scRNAseq = Input(shape = (ncol_scRNAseq, ), name = "scRNAseq")
ncol_scProteomics = X_scProteomics.shape[1]
input_dim_scProteomics = Input(shape = (ncol_scProteomics, ), name = "scProteomics")

# Dimensions of Encoder for each OMIC
encoding_dim_scRNAseq = 50
encoding_dim_scProteomics = 10

# Encoder layer for each OMIC
encoded_scRNAseq = Dense(encoding_dim_scRNAseq, activation = 'elu', 
                         name = "Encoder_scRNAseq")(input_dim_scRNAseq)
encoded_scProteomics = Dense(encoding_dim_scProteomics, activation = 'linear', 
                             name = "Encoder_scProteomics")(input_dim_scProteomics)

# Merging Encoder layers from different OMICs
merge = concatenate([encoded_scRNAseq, encoded_scProteomics])

# Bottleneck compression
bottleneck = Dense(50, kernel_initializer = 'uniform', activation = 'linear', 
                   name = "Bottleneck")(merge)

#Inverse merging
merge_inverse = Dense(encoding_dim_scRNAseq + encoding_dim_scProteomics, 
                      activation = 'elu', name = "Concatenate_Inverse")(bottleneck)

# Decoder layer for each OMIC
decoded_scRNAseq = Dense(ncol_scRNAseq, activation = 'linear', 
                         name = "Decoder_scRNAseq")(merge_inverse)
decoded_scProteomics = Dense(ncol_scProteomics, activation = 'linear', 
                             name = "Decoder_scProteomics")(merge_inverse)

# Combining Encoder and Decoder into an Autoencoder model
autoencoder = Model([input_dim_scRNAseq, input_dim_scProteomics], [decoded_scRNAseq, decoded_scProteomics])

# Compile Autoencoder
autoencoder.compile(optimizer = 'adam', 
                    loss={'Decoder_scRNAseq': 'mean_squared_error', 
                          'Decoder_scProteomics': 'mean_squared_error'})
autoencoder.summary()
Model: "functional_51"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
scRNAseq (InputLayer)           [(None, 976)]        0                                            
__________________________________________________________________________________________________
scProteomics (InputLayer)       [(None, 10)]         0                                            
__________________________________________________________________________________________________
Encoder_scRNAseq (Dense)        (None, 50)           48850       scRNAseq[0][0]                   
__________________________________________________________________________________________________
Encoder_scProteomics (Dense)    (None, 10)           110         scProteomics[0][0]               
__________________________________________________________________________________________________
concatenate_12 (Concatenate)    (None, 60)           0           Encoder_scRNAseq[0][0]           
                                                                 Encoder_scProteomics[0][0]       
__________________________________________________________________________________________________
Bottleneck (Dense)              (None, 50)           3050        concatenate_12[0][0]             
__________________________________________________________________________________________________
Concatenate_Inverse (Dense)     (None, 60)           3060        Bottleneck[0][0]                 
__________________________________________________________________________________________________
Decoder_scRNAseq (Dense)        (None, 976)          59536       Concatenate_Inverse[0][0]        
__________________________________________________________________________________________________
Decoder_scProteomics (Dense)    (None, 10)           610         Concatenate_Inverse[0][0]        
==================================================================================================
Total params: 115,216
Trainable params: 115,216
Non-trainable params: 0
__________________________________________________________________________________________________

A very handy thing here is that we can assign different loss functions to OMICs coming from different statistical distributions, e.g. combining categorical and continuous data we can apply categorical cross entropy and mean squared error, respectively. Another great thing about data integration via Autoencoders is that all OMICs know about each other as the weights for each node / feature are updated through back propagation in the context of each other. Finally, let us train the Autoencoder and feed the bottleneck into tSNE for visualization:

In [36]:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

# Autoencoder training
estimator = autoencoder.fit([X_scRNAseq, X_scProteomics], 
                            [X_scRNAseq, X_scProteomics], 
                            epochs = 10, batch_size = 128, 
                            validation_split = 0.2, shuffle = True, verbose = 1)
print("Training Loss: ",estimator.history['loss'][-1])
print("Validation Loss: ",estimator.history['val_loss'][-1])

plt.figure(figsize = (20,15))
plt.plot(estimator.history['loss'])
plt.plot(estimator.history['val_loss'])
plt.title('Model Loss', fontsize = 22)
plt.ylabel('Loss', fontsize = 22)
plt.xlabel('Epoch', fontsize = 22)
plt.legend(['Train','Validation'], loc = 'upper right', fontsize = 22)
plt.show()

# Encoder model
encoder = Model([input_dim_scRNAseq, input_dim_scProteomics], bottleneck)
bottleneck_representation = encoder.predict([X_scRNAseq, X_scProteomics])

# tSNE on Autoencoder bottleneck representation
model_tsne_auto = TSNE(learning_rate = 200, n_components = 2, random_state = 123, 
                       perplexity = 90, n_iter = 1000, verbose = 1)
tsne_auto = model_tsne_auto.fit_transform(bottleneck_representation)

plt.figure(figsize = (20,15))
plt.scatter(tsne_auto[:, 0], tsne_auto[:, 1], c = Y_scRNAseq, cmap = 'tab20', s = 50)
plt.title('tSNE on Autoencoder: Data Integration, CITEseq', fontsize = 22)
plt.xlabel("tSNE1", fontsize = 22)
plt.ylabel("tSNE2", fontsize = 22)
plt.show()
Epoch 1/10
54/54 [==============================] - 0s 6ms/step - loss: 5.0046 - Decoder_scRNAseq_loss: 0.2470 - Decoder_scProteomics_loss: 4.7576 - val_loss: 0.9255 - val_Decoder_scRNAseq_loss: 0.0851 - val_Decoder_scProteomics_loss: 0.8403
Epoch 2/10
54/54 [==============================] - 0s 4ms/step - loss: 0.7506 - Decoder_scRNAseq_loss: 0.1964 - Decoder_scProteomics_loss: 0.5542 - val_loss: 0.5149 - val_Decoder_scRNAseq_loss: 0.0706 - val_Decoder_scProteomics_loss: 0.4442
Epoch 3/10
54/54 [==============================] - 0s 4ms/step - loss: 0.4644 - Decoder_scRNAseq_loss: 0.1257 - Decoder_scProteomics_loss: 0.3386 - val_loss: 0.3815 - val_Decoder_scRNAseq_loss: 0.0629 - val_Decoder_scProteomics_loss: 0.3186
Epoch 4/10
54/54 [==============================] - 0s 4ms/step - loss: 0.3156 - Decoder_scRNAseq_loss: 0.0697 - Decoder_scProteomics_loss: 0.2459 - val_loss: 0.3129 - val_Decoder_scRNAseq_loss: 0.0584 - val_Decoder_scProteomics_loss: 0.2545
Epoch 5/10
54/54 [==============================] - 0s 4ms/step - loss: 0.2548 - Decoder_scRNAseq_loss: 0.0631 - Decoder_scProteomics_loss: 0.1917 - val_loss: 0.2699 - val_Decoder_scRNAseq_loss: 0.0544 - val_Decoder_scProteomics_loss: 0.2156
Epoch 6/10
54/54 [==============================] - 0s 4ms/step - loss: 0.2187 - Decoder_scRNAseq_loss: 0.0607 - Decoder_scProteomics_loss: 0.1580 - val_loss: 0.2286 - val_Decoder_scRNAseq_loss: 0.0522 - val_Decoder_scProteomics_loss: 0.1765
Epoch 7/10
54/54 [==============================] - 0s 4ms/step - loss: 0.1944 - Decoder_scRNAseq_loss: 0.0592 - Decoder_scProteomics_loss: 0.1352 - val_loss: 0.2083 - val_Decoder_scRNAseq_loss: 0.0503 - val_Decoder_scProteomics_loss: 0.1580
Epoch 8/10
54/54 [==============================] - 0s 4ms/step - loss: 0.1773 - Decoder_scRNAseq_loss: 0.0580 - Decoder_scProteomics_loss: 0.1193 - val_loss: 0.1816 - val_Decoder_scRNAseq_loss: 0.0486 - val_Decoder_scProteomics_loss: 0.1329
Epoch 9/10
54/54 [==============================] - 0s 4ms/step - loss: 0.1644 - Decoder_scRNAseq_loss: 0.0571 - Decoder_scProteomics_loss: 0.1073 - val_loss: 0.1683 - val_Decoder_scRNAseq_loss: 0.0473 - val_Decoder_scProteomics_loss: 0.1210
Epoch 10/10
54/54 [==============================] - 0s 4ms/step - loss: 0.1538 - Decoder_scRNAseq_loss: 0.0563 - Decoder_scProteomics_loss: 0.0975 - val_loss: 0.1557 - val_Decoder_scRNAseq_loss: 0.0463 - val_Decoder_scProteomics_loss: 0.1095
Training Loss:  0.15378396213054657
Validation Loss:  0.15572811663150787
[t-SNE] Computing 271 nearest neighbors...
[t-SNE] Indexed 8617 samples in 0.047s...
[t-SNE] Computed neighbors for 8617 samples in 3.367s...
[t-SNE] Computed conditional probabilities for sample 1000 / 8617
[t-SNE] Computed conditional probabilities for sample 2000 / 8617
[t-SNE] Computed conditional probabilities for sample 3000 / 8617
[t-SNE] Computed conditional probabilities for sample 4000 / 8617
[t-SNE] Computed conditional probabilities for sample 5000 / 8617
[t-SNE] Computed conditional probabilities for sample 6000 / 8617
[t-SNE] Computed conditional probabilities for sample 7000 / 8617
[t-SNE] Computed conditional probabilities for sample 8000 / 8617
[t-SNE] Computed conditional probabilities for sample 8617 / 8617
[t-SNE] Mean sigma: 0.329691
[t-SNE] KL divergence after 250 iterations with early exaggeration: 65.723244
[t-SNE] KL divergence after 1000 iterations: 1.175469

Comparing the tSNE plots obtained using individual OMICs with the tSNE on the bottleneck of the Autoencoder that combines the data, we can immediately see that the integration somewhat averages and reinforces the individual OMICs. For example, the purple cluster would be hard to discover using the scRNAseq data alone as it is not distinct from the blue cell population, however after integration the purple group of cells is easily distinguishable. This is the power of data integration!

Integrating scNMTseq data with Deep Learning

While CITEseq includes two single cell levels of information (transcriptomics and proteomics), another fantastic technology, scNMTseq, delivers three OMICs from the same biological cells: 1) transcriptomics (scRNAseq), 2) methylation pattern (scBSseq), and 3) open chromatin regions (scATACseq). The raw data can be downloaded from here.

In [37]:
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/DeepLearningDataIntegration/scNMT_Method.png', width=2000)
Out[37]:

The architecture of the Autoencoder is analogous to the one used for CITEseq with only one peculiarity: Dropout regularization is used on the input layers. This is due to the fact that we have only ~120 cells sequenced while the dimensionality of the feature space is tens of thousands, so we need to apply regularization to overcome the curse of dimensionality. Note that this was not necessary for CITEseq where we had ~8K cells and ~1K features, so exactly opposite situation. Nevertheless, overall scNMTseq is not an easy case for data integration, I firmly believe though that this is just the beginning of single cell multi-OMICs era and many more cells will arrive soon from this exciting technology, so it is better to be prepared.

In [64]:
import os
import numpy as np
import pandas as pd
from umap import UMAP
from tensorflow.keras.models import Model
import matplotlib.pyplot as plt
from tensorflow.keras.layers import concatenate
from tensorflow.keras.layers import Input, Dense, Dropout

os.chdir('/home/nikolay/WABI/Misc/SingleCell/scNMTseq/')

################## READ AND TRANSFORM DATA ##################
scRNAseq = pd.read_csv('scRNAseq.txt',sep='\t')
scBSseq = pd.read_csv('scBSseq.txt',sep='\t')
scATACseq = pd.read_csv('scATACseq.txt',sep='\t')

X_scRNAseq = scRNAseq.values[:,0:(scRNAseq.shape[1]-1)]
Y_scRNAseq = scRNAseq.values[:,scRNAseq.shape[1]-1]
X_scBSseq = scBSseq.values[:,0:(scBSseq.shape[1]-1)]
Y_scBSseq = scBSseq.values[:,scBSseq.shape[1]-1]
X_scATACseq = scATACseq.values[:,0:(scATACseq.shape[1]-1)]
Y_scATACseq = scATACseq.values[:,scATACseq.shape[1]-1]

X_scRNAseq = np.log(X_scRNAseq + 1)
X_scBSseq = np.log(X_scBSseq + 1)
X_scATACseq = np.log(X_scATACseq + 1)

print('scRNAseq dimensions: ' + str(X_scRNAseq.shape))
print('scBSseq dimensions: ' + str(X_scBSseq.shape))
print('scATACseq dimensions: ' + str(X_scATACseq.shape))
scRNAseq dimensions: (113, 12313)
scBSseq dimensions: (113, 8574)
scATACseq dimensions: (113, 11799)

Again, let us display the tSNE embeddings of the individual scRNAseq layers from the scNMT data set before integrating it together with scBSseq and scATACseq with Autoencoder.

In [69]:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

X_reduced = PCA(n_components = 5).fit_transform(X_scRNAseq)

model_tsne_scRNAseq = TSNE(n_components = 2, random_state = 123, perplexity = 10, n_iter = 1000, verbose = 1)
tsne_scRNAseq = model_tsne_scRNAseq.fit_transform(X_reduced)

plt.figure(figsize = (20,15))
plt.scatter(tsne_scRNAseq[:, 0], tsne_scRNAseq[:, 1], c = Y_scRNAseq, cmap = 'tab20', s = 50)
plt.title('tSNE on scRNAseq from scNMT', fontsize = 22)
plt.xlabel("tSNE1", fontsize = 22)
plt.ylabel("tSNE2", fontsize = 22)
plt.show()
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 113 samples in 0.000s...
[t-SNE] Computed neighbors for 113 samples in 0.002s...
[t-SNE] Computed conditional probabilities for sample 113 / 113
[t-SNE] Mean sigma: 10.616893
[t-SNE] KL divergence after 250 iterations with early exaggeration: 62.282795
[t-SNE] KL divergence after 1000 iterations: 0.346419
In [63]:
######################## AUTOENCODER ########################
# Input Layer
ncol_scRNAseq = X_scRNAseq.shape[1]
input_dim_scRNAseq = Input(shape = (ncol_scRNAseq, ), name = "scRNAseq")
ncol_scBSseq = X_scBSseq.shape[1]
input_dim_scBSseq = Input(shape = (ncol_scBSseq, ), name = "scBSseq")
ncol_scATACseq = X_scATACseq.shape[1]
input_dim_scATACseq = Input(shape = (ncol_scATACseq, ), name = "scATACseq")

encoding_dim_scRNAseq = 10
encoding_dim_scBSseq = 10
encoding_dim_scATACseq = 10

# Dropout on Input Layer
dropout_scRNAseq = Dropout(0.2, name = "Dropout_scRNAseq")(input_dim_scRNAseq)
dropout_scBSseq = Dropout(0.2, name = "Dropout_scBSseq")(input_dim_scBSseq)
dropout_scATACseq = Dropout(0.2, name = "Dropout_scATACseq")(input_dim_scATACseq) 

# Encoder layer for each OMIC
encoded_scRNAseq = Dense(encoding_dim_scRNAseq, activation = 'elu', 
                         name = "Encoder_scRNAseq")(dropout_scRNAseq)
encoded_scBSseq = Dense(encoding_dim_scBSseq, activation = 'elu', 
                        name = "Encoder_scBSseq")(dropout_scBSseq)
encoded_scATACseq = Dense(encoding_dim_scATACseq, activation = 'elu', 
                          name = "Encoder_scATACseq")(dropout_scATACseq)

# Merging Encoder layers from different OMICs
merge = concatenate([encoded_scRNAseq, encoded_scBSseq, encoded_scATACseq])

# Bottleneck compression
bottleneck = Dense(5, kernel_initializer = 'uniform', activation = 'elu', 
                   name = "Bottleneck")(merge)

#Inverse merging
merge_inverse = Dense(encoding_dim_scRNAseq + encoding_dim_scBSseq + 
                      encoding_dim_scATACseq, 
                      activation = 'elu', name = "Concatenate_Inverse")(bottleneck)

# Decoder layer for each OMIC
decoded_scRNAseq = Dense(ncol_scRNAseq, activation = 'elu', 
                         name = "Decoder_scRNAseq")(merge_inverse)
decoded_scBSseq = Dense(ncol_scBSseq, activation = 'elu', 
                        name = "Decoder_scBSseq")(merge_inverse)
decoded_scATACseq = Dense(ncol_scATACseq, activation = 'elu', 
                          name = "Decoder_scATACseq")(merge_inverse)

# Combining Encoder and Decoder into an Autoencoder model
autoencoder = Model([input_dim_scRNAseq, input_dim_scBSseq, input_dim_scATACseq], 
                    [decoded_scRNAseq, decoded_scBSseq, decoded_scATACseq])

# Compile Autoencoder
autoencoder.compile(optimizer = 'adam', 
                    loss={'Decoder_scRNAseq': 'mean_squared_error', 
                          'Decoder_scBSseq': 'binary_crossentropy', 
                          'Decoder_scATACseq': 'binary_crossentropy'})
autoencoder.summary()

# Autoencoder training
estimator = autoencoder.fit([X_scRNAseq, X_scBSseq, X_scATACseq], 
                            [X_scRNAseq, X_scBSseq, X_scATACseq], epochs = 20, 
                            batch_size = 16, validation_split = 0.2, 
                            shuffle = True, verbose = 0)
print("Training Loss: ",estimator.history['loss'][-1])
print("Validation Loss: ",estimator.history['val_loss'][-1])

plt.figure(figsize = (20,15))
plt.plot(estimator.history['loss']); plt.plot(estimator.history['val_loss'])
plt.title('Model Loss', fontsize = 22); plt.ylabel('Loss', fontsize = 22); plt.xlabel('Epoch', fontsize = 22)
plt.legend(['Train','Validation'], loc = 'upper right', fontsize = 22)

# Encoder model
encoder = Model([input_dim_scRNAseq, input_dim_scBSseq, input_dim_scATACseq], bottleneck)
bottleneck_representation = encoder.predict([X_scRNAseq, X_scBSseq, X_scATACseq])

############### UNIFORM MANIFOLD APPROXIMATION AND PROJECTION (UMAP) ###############
model_umap = UMAP(n_neighbors = 10, min_dist = 0.5, n_components = 2)
umap = model_umap.fit_transform(bottleneck_representation)

plt.figure(figsize = (20,15))
plt.scatter(umap[:, 0], umap[:, 1], c = Y_scRNAseq, cmap = 'tab10', s = 50)
plt.title('UMAP on Autoencoder: Data Integration, scNMTseq', fontsize = 22)
plt.xlabel("UMAP1", fontsize = 22); plt.ylabel("UMAP2", fontsize = 22)
plt.show()
Model: "functional_125"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
scRNAseq (InputLayer)           [(None, 12313)]      0                                            
__________________________________________________________________________________________________
scBSseq (InputLayer)            [(None, 8574)]       0                                            
__________________________________________________________________________________________________
scATACseq (InputLayer)          [(None, 11799)]      0                                            
__________________________________________________________________________________________________
Dropout_scRNAseq (Dropout)      (None, 12313)        0           scRNAseq[0][0]                   
__________________________________________________________________________________________________
Dropout_scBSseq (Dropout)       (None, 8574)         0           scBSseq[0][0]                    
__________________________________________________________________________________________________
Dropout_scATACseq (Dropout)     (None, 11799)        0           scATACseq[0][0]                  
__________________________________________________________________________________________________
Encoder_scRNAseq (Dense)        (None, 10)           123140      Dropout_scRNAseq[0][0]           
__________________________________________________________________________________________________
Encoder_scBSseq (Dense)         (None, 10)           85750       Dropout_scBSseq[0][0]            
__________________________________________________________________________________________________
Encoder_scATACseq (Dense)       (None, 10)           118000      Dropout_scATACseq[0][0]          
__________________________________________________________________________________________________
concatenate_32 (Concatenate)    (None, 30)           0           Encoder_scRNAseq[0][0]           
                                                                 Encoder_scBSseq[0][0]            
                                                                 Encoder_scATACseq[0][0]          
__________________________________________________________________________________________________
Bottleneck (Dense)              (None, 5)            155         concatenate_32[0][0]             
__________________________________________________________________________________________________
Concatenate_Inverse (Dense)     (None, 30)           180         Bottleneck[0][0]                 
__________________________________________________________________________________________________
Decoder_scRNAseq (Dense)        (None, 12313)        381703      Concatenate_Inverse[0][0]        
__________________________________________________________________________________________________
Decoder_scBSseq (Dense)         (None, 8574)         265794      Concatenate_Inverse[0][0]        
__________________________________________________________________________________________________
Decoder_scATACseq (Dense)       (None, 11799)        365769      Concatenate_Inverse[0][0]        
==================================================================================================
Total params: 1,340,491
Trainable params: 1,340,491
Non-trainable params: 0
__________________________________________________________________________________________________
Training Loss:  -2.740434169769287
Validation Loss:  1.7009251117706299

Here out of curiosity I fed the bottleneck of the Autoencoder that combines the three scNMTseq OMICs into Uniform Manifold Approximation and Projection (UMAP) non-linear dimensionality reduction technique which seems to outperform tSNE in sense of scalability for large amounts of data. We can immediately see that the homogeneous in sense of gene expression brown cluster splits into two clusters when scRNAseq is combined with epigenetics information from the same cells (scBSseq and scATACseq). Therefore it seems that we have captured a new heterogeneity between cells which was hidden when looking only at gene expression scRNAseq data. Can this be a new way of classifying cells across populations by using the whole complexity of their biology? If so, then the question comes: what is a cell population or cell type? I do not know the answer for this question.

Summary

Here we have learnt that multiple sources of molecular and clinical information are becoming common in Biology and Biomedicine thanks to the recent technological progress. Therefore data integration is a logical next step which provides a more comprehensive understanding of the biological processes by utilizing the whole complexity of the data. Deep Learning framework is ideally suited for data integration due to its truly “integrative” updating of parameters through back propagation when multiple data types learn information from each other. I showed that data integration can result in discoveries of novel patterns in the data which were not previously seen in the individual data types.

In [ ]: