UMAP for Data Integration¶

Here we will overlap graphs from single cell RNAseq and Proteomics data sets from the CITE-seq technology, and produce a consensus / integrative OMICs UMAP plot. The idea is to convert the two data types into graphs, i.e. non-parametric space where they forget their technological differences, and overlap the graphs. The resulting graph will be used for constructing the UMAP embeddings.

In [1]:
from IPython.display import Image
Image('assests/UMAP_DataIntegration.png', width=2000)
Out[1]:
No description has been provided for this image

We will start with reading and log-transforming the OMICs data.

In [2]:
import os
import numpy as np
import pandas as pd
from tensorflow.keras.models import Model
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from tensorflow.keras.utils import plot_model
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.layers import concatenate

import warnings
warnings.filterwarnings("ignore")

scRNAseq = pd.read_csv('data/scRNAseq.txt', sep='\t')
scProteomics = pd.read_csv('data/scProteomics.txt', sep='\t')

X_scRNAseq = scRNAseq.values[:, 0:(scRNAseq.shape[1]-1)]
Y_scRNAseq = scRNAseq.values[:, scRNAseq.shape[1]-1]
X_scProteomics = scProteomics.values[:, 0:(scProteomics.shape[1]-1)]
Y_scProteomics = scProteomics.values[:, scProteomics.shape[1]-1]

X_scRNAseq = np.log(X_scRNAseq + 1)
X_scProteomics = np.log(X_scProteomics + 1)
2024-10-15 09:44:41.921077: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-15 09:44:41.963042: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
In [3]:
X_scRNAseq.shape
Out[3]:
(8617, 976)
In [4]:
X_scProteomics.shape
Out[4]:
(8617, 10)

Let us display the UMAP embeddings of individual OMICs data sets. We will start with the scRNAseq dataset. As an optimal number of nearest neighbors we will choose sqrt(N) = sqrt(8617) = 93.

In [5]:
from umap import UMAP
from sklearn.decomposition import PCA

X_reduced = PCA(n_components = 20).fit_transform(X_scRNAseq)
model = UMAP(n_components = 2, min_dist = 1, n_neighbors = 93, init = X_reduced[:, 0:2], n_epochs = 1000, 
             verbose = 2)
umap = model.fit_transform(X_reduced)

plt.figure(figsize = (20,15))
plt.scatter(umap[:, 0], umap[:, 1], c = Y_scRNAseq, cmap = 'tab20', s = 20)
plt.title('UMAP: scRNAseq', fontsize = 25); 
plt.xlabel("UMAP1", fontsize = 22); plt.ylabel("UMAP2", fontsize = 22)
plt.show()
UMAP(init=array([[45.31235894, 0.07477159],
       [46.53271754, 1.71486063],
       [46.8638235 , 0.98846408],
       ...,
       [ 1.11426643, -1.86512337],
       [ 4.54796886, -0.69171758],
       [ 2.61794017, -0.97065852]]), min_dist=1, n_epochs=1000, n_neighbors=93, verbose=2)
Tue Oct 15 09:44:56 2024 Construct fuzzy simplicial set
Tue Oct 15 09:44:56 2024 Finding Nearest Neighbors
Tue Oct 15 09:44:56 2024 Building RP forest with 10 trees
Tue Oct 15 09:45:00 2024 NN descent for 13 iterations
	 1  /  13
	 2  /  13
	 3  /  13
	Stopping threshold met -- exiting after 3 iterations
Tue Oct 15 09:45:16 2024 Finished Nearest Neighbor Search
Tue Oct 15 09:45:18 2024 Construct embedding
Epochs completed:   0%|            4/1000 [00:01]
	completed  0  /  1000 epochs
Epochs completed:  10%| â–ˆ          101/1000 [00:25]
	completed  100  /  1000 epochs
Epochs completed:  20%| ██         201/1000 [00:51]
	completed  200  /  1000 epochs
Epochs completed:  30%| ███        301/1000 [01:18]
	completed  300  /  1000 epochs
Epochs completed:  40%| ████       401/1000 [01:44]
	completed  400  /  1000 epochs
Epochs completed:  50%| █████      501/1000 [02:11]
	completed  500  /  1000 epochs
Epochs completed:  60%| ██████     601/1000 [02:38]
	completed  600  /  1000 epochs
Epochs completed:  70%| ███████    701/1000 [03:05]
	completed  700  /  1000 epochs
Epochs completed:  80%| ████████   801/1000 [03:32]
	completed  800  /  1000 epochs
Epochs completed:  90%| █████████  901/1000 [03:59]
	completed  900  /  1000 epochs
Epochs completed: 100%| ██████████ 1000/1000 [04:25]
Tue Oct 15 09:49:44 2024 Finished embedding
No description has been provided for this image

Now let us display UMAP embeddings for the scProteomics dataset of the CITE-seq sequencing technology.

In [6]:
X_reduced = PCA(n_components = 10).fit_transform(X_scProteomics)
model = UMAP(n_components = 2, min_dist = 0.8, n_neighbors = 93, init = X_reduced[:, 0:2], n_epochs = 1000, 
             verbose = 2)
umap = model.fit_transform(X_reduced)

plt.figure(figsize = (20,15))
plt.scatter(umap[:, 0], umap[:, 1], c = Y_scProteomics, cmap = 'tab20', s = 20)
plt.title('UMAP: scProteomics', fontsize = 25); 
plt.xlabel("UMAP1", fontsize = 22); plt.ylabel("UMAP2", fontsize = 22)
plt.show()
UMAP(init=array([[-1.65366402, -0.2564025 ],
       [-1.7084814 , 1.69115812],
       [-1.03060202, -0.13172222],
       ...,
       [-2.19859839, -5.16795233],
       [-2.10143214, -2.48594223],
       [-2.43134931, -3.33697131]]), min_dist=0.8, n_epochs=1000, n_neighbors=93, verbose=2)
Tue Oct 15 09:49:45 2024 Construct fuzzy simplicial set
Tue Oct 15 09:49:45 2024 Finding Nearest Neighbors
Tue Oct 15 09:49:45 2024 Building RP forest with 10 trees
Tue Oct 15 09:49:45 2024 NN descent for 13 iterations
	 1  /  13
	 2  /  13
	 3  /  13
	Stopping threshold met -- exiting after 3 iterations
Tue Oct 15 09:49:51 2024 Finished Nearest Neighbor Search
Tue Oct 15 09:49:52 2024 Construct embedding
Epochs completed:   0%|            4/1000 [00:01]
	completed  0  /  1000 epochs
Epochs completed:  10%| â–ˆ          101/1000 [00:26]
	completed  100  /  1000 epochs
Epochs completed:  20%| ██         201/1000 [00:52]
	completed  200  /  1000 epochs
Epochs completed:  30%| ███        301/1000 [01:19]
	completed  300  /  1000 epochs
Epochs completed:  40%| ████       401/1000 [01:46]
	completed  400  /  1000 epochs
Epochs completed:  50%| █████      501/1000 [02:12]
	completed  500  /  1000 epochs
Epochs completed:  60%| ██████     601/1000 [02:39]
	completed  600  /  1000 epochs
Epochs completed:  70%| ███████    701/1000 [03:05]
	completed  700  /  1000 epochs
Epochs completed:  80%| ████████   801/1000 [03:32]
	completed  800  /  1000 epochs
Epochs completed:  90%| █████████  901/1000 [03:58]
	completed  900  /  1000 epochs
Epochs completed: 100%| ██████████ 1000/1000 [04:25]
Tue Oct 15 09:54:17 2024 Finished embedding
No description has been provided for this image

Now we will construct the intersection of the two graphs / simplicial sets, and visualize the resulting embeddings.

In [7]:
import umap

X_reduced_scRNAseq = PCA(n_components = 20).fit_transform(X_scRNAseq)
X_reduced_scProteomics = PCA(n_components = 10).fit_transform(X_scProteomics)

fit1 = umap.UMAP(n_components = 2, min_dist = 1, n_neighbors = 93, n_epochs = 1000, 
                 init = X_reduced_scRNAseq[:, 0:2], verbose = 2).fit(X_reduced_scRNAseq)
fit2 = umap.UMAP(n_components = 2, min_dist = 0.8, n_neighbors = 93, n_epochs = 1000, 
                 init = X_reduced_scProteomics[:, 0:2], verbose = 2).fit(X_reduced_scProteomics)
intersection = umap.umap_. general_simplicial_set_intersection(fit1.graph_, fit2.graph_, weight = 0.45)
intersection = umap.umap_.reset_local_connectivity(intersection)
embedding = umap.umap_.simplicial_set_embedding(fit1._raw_data, intersection, fit1.n_components, 
                                                fit1.learning_rate, fit1._a, fit1._b, 
                                                fit1.repulsion_strength, fit1.negative_sample_rate, 
                                                1000, 'random', np.random, fit1.metric, 
                                                fit1._metric_kwds, False, densmap_kwds=False, output_dens=False)
UMAP(init=array([[45.31235894, 0.07477159],
       [46.53271754, 1.71486063],
       [46.8638235 , 0.98846408],
       ...,
       [ 1.11426643, -1.86512337],
       [ 4.54796886, -0.69171758],
       [ 2.61794017, -0.97065852]]), min_dist=1, n_epochs=1000, n_neighbors=93, verbose=2)
Tue Oct 15 09:54:18 2024 Construct fuzzy simplicial set
Tue Oct 15 09:54:18 2024 Finding Nearest Neighbors
Tue Oct 15 09:54:18 2024 Building RP forest with 10 trees
Tue Oct 15 09:54:18 2024 NN descent for 13 iterations
	 1  /  13
	 2  /  13
	 3  /  13
	Stopping threshold met -- exiting after 3 iterations
Tue Oct 15 09:54:24 2024 Finished Nearest Neighbor Search
Tue Oct 15 09:54:25 2024 Construct embedding
Epochs completed:   0%|            5/1000 [00:01]
	completed  0  /  1000 epochs
Epochs completed:  10%| â–ˆ          102/1000 [00:16]
	completed  100  /  1000 epochs
Epochs completed:  20%| ██         202/1000 [00:33]
	completed  200  /  1000 epochs
Epochs completed:  30%| ███        302/1000 [00:50]
	completed  300  /  1000 epochs
Epochs completed:  40%| ████       402/1000 [01:06]
	completed  400  /  1000 epochs
Epochs completed:  50%| █████      502/1000 [01:23]
	completed  500  /  1000 epochs
Epochs completed:  60%| ██████     602/1000 [01:40]
	completed  600  /  1000 epochs
Epochs completed:  70%| ███████    702/1000 [01:56]
	completed  700  /  1000 epochs
Epochs completed:  80%| ████████   802/1000 [02:13]
	completed  800  /  1000 epochs
Epochs completed:  90%| █████████  902/1000 [02:29]
	completed  900  /  1000 epochs
Epochs completed: 100%| ██████████ 1000/1000 [02:45]
Tue Oct 15 09:57:11 2024 Finished embedding
UMAP(init=array([[-1.65366402, -0.2564025 ],
       [-1.7084814 , 1.69115812],
       [-1.03060202, -0.13172222],
       ...,
       [-2.19859839, -5.16795233],
       [-2.10143214, -2.48594223],
       [-2.43134931, -3.33697131]]), min_dist=0.8, n_epochs=1000, n_neighbors=93, verbose=2)
Tue Oct 15 09:57:11 2024 Construct fuzzy simplicial set
Tue Oct 15 09:57:11 2024 Finding Nearest Neighbors
Tue Oct 15 09:57:11 2024 Building RP forest with 10 trees
Tue Oct 15 09:57:11 2024 NN descent for 13 iterations
	 1  /  13
	 2  /  13
	Stopping threshold met -- exiting after 2 iterations
Tue Oct 15 09:57:16 2024 Finished Nearest Neighbor Search
Tue Oct 15 09:57:17 2024 Construct embedding
Epochs completed:   0%|            5/1000 [00:02]
	completed  0  /  1000 epochs
Epochs completed:  10%| â–ˆ          102/1000 [00:15]
	completed  100  /  1000 epochs
Epochs completed:  20%| ██         202/1000 [00:30]
	completed  200  /  1000 epochs
Epochs completed:  30%| ███        302/1000 [00:45]
	completed  300  /  1000 epochs
Epochs completed:  40%| ████       402/1000 [01:00]
	completed  400  /  1000 epochs
Epochs completed:  50%| █████      502/1000 [01:14]
	completed  500  /  1000 epochs
Epochs completed:  60%| ██████     602/1000 [01:29]
	completed  600  /  1000 epochs
Epochs completed:  70%| ███████    702/1000 [01:44]
	completed  700  /  1000 epochs
Epochs completed:  80%| ████████   802/1000 [01:58]
	completed  800  /  1000 epochs
Epochs completed:  90%| █████████  902/1000 [02:13]
	completed  900  /  1000 epochs
Epochs completed: 100%| ██████████ 1000/1000 [02:27]
Tue Oct 15 09:59:45 2024 Finished embedding
In [8]:
plt.figure(figsize = (20,15))
plt.scatter(embedding[0][:, 0], embedding[0][:, 1], c = Y_scRNAseq, cmap = 'tab20', s = 20)
plt.title('UMAP: scRNAseq + scProteomics', fontsize = 25); 
plt.xlabel("UMAP1", fontsize = 22); plt.ylabel("UMAP2", fontsize = 22)
plt.show()
No description has been provided for this image