1.1. Read the pancreas dataset
1.2. Why do we use an AnnData object?
1.3. Where are the un/spliced counts and annotations for observations and variables stored?
1.4. Describe in form of text the rational for these steps in your markdown report.
Load necessary packages, scanpy and scVelo.
import scanpy as sc
import scvelo as scv
import numpy as np
import matplotlib.pyplot as plt
Read in the pancreas dataset which is availible within the scvelo package.
adata = scv.datasets.pancreas()
The dataset is now stored in an AnnData object. AnnData is a popular format for storing single cell data used by scanpy and scVelo. It allows for a comprenhensive and scalable storage of data matrix and annotation information features and samples on different layers.
Spliced counts
adata.layers['spliced']
Unspliced counts
adata.layers['unspliced']
Cell annotation
adata.obs
Gene annotation
adata.var
2.1. Why do we need to filter out a number of genes?
2.2. What are meaningful gene selection criteria and why?
2.3. What is the rationale of normalizing the counts?
2.4. After gene selection and normalization the data is log-transformed. What is it necessary for?
2.5. Show how the PCA embeddings compare when you use log vs. non-log data.
In the preprocessing step we need to filter cells, genes and normalise.
Some genes don't provide any meaningful biological information. We filter them out to remove noise. Additionally, it cuts down computational time downstream.
Selection criteria for gene filtering (Luecken et al., 2019):
Next we need to normalise the counts because of differences in library depth between cells and sampling effects. Count depths can differ between identical cells due to inherent variance in the multiple steps of sequence capture, reverse transcription, and the sequencing proper. This serves to allow for cells to be intra-comparable. Genes can also be scaled, for example with log transformation.
Log transform is a canonical way to measure expression change. It mitigates mean-variance relationship and reduces data skewness (Brennecke et al., 2013).
adata_log = scv.pp.filter_and_normalize(adata, copy=True)
adata_nolog = scv.pp.filter_and_normalize(adata, copy=True, log=False)
sc.pl.highest_expr_genes(adata_log, n_top=20)
We also remove the Malat1 gene.
malat_list = ['Malat1']
malat_indicator = np.in1d(adata_log.var_names, malat_list)
adata_log = adata_log[:, ~malat_indicator]
sc.pl.highest_expr_genes(adata_log, n_top=20)
sc.tl.pca(adata_log, svd_solver='arpack')
sc.pl.pca(adata_log, color='clusters_coarse', title='WITH log transformation')
sc.tl.pca(adata_nolog, svd_solver='arpack')
sc.pl.pca(adata_nolog, color='clusters_coarse', title='WITHOUT log transformation')
Log transformed data looks better, we can already see cells clustering together within predefined clusters. Without log transformation data is heavily skewed.
3.1. The PCA representation (computed on the log-normalized data) is used to compute a neighbor graph from cell-to-cell distances. What does this graph represent and what can it be used for?
3.2. Now, we use the graph to compute first-order moments, i.e., for each cell, the mean (first-order moments) and variance (second-order moments) of its neighboring cells is computed. This can also be regarded as a knn-imputation of the counts.
3.3. Show how the k parameter (number of neighbors) impacts the imputed counts, e.g., by comparing the imputed gene expression of two genes with different values for k.
We keep the log transformed data and set the .raw
attribute of the AnnData object to the normalized and logarithmized raw gene expression for later use in differential testing and visualizations of gene expression.
adata = adata_log
adata.raw = adata
Now use the PCA representation of data matrix to compute neighborhood graph of cells. It is a knn graph, it represents distance and connectivity between cells. Each of the cells is connected to its K nearest neigboring cells. The default k in scVelo is 30.
Next we calcualte moments: for each cell, the mean (first-order moments) and variance (second-order moments) of its neighboring cells is computed. This can also be regarded as a knn-imputation of the counts.
The imputed counts vary depending on the number of neighbors (k). We inspect the relationship between imputed mean and variance (Ms and Mu) for different values of the k parameter.
adata_nn_10 = adata.copy()
adata_nn_30 = adata.copy()
adata_nn_100 = adata.copy()
adata_nn_300 = adata.copy()
scv.pp.neighbors(adata_nn_10, n_neighbors=10)
scv.pp.moments(adata_nn_10, n_neighbors=10)
scv.pp.neighbors(adata_nn_30, n_neighbors=30)
scv.pp.moments(adata_nn_30, n_neighbors=30)
scv.pp.neighbors(adata_nn_100, n_neighbors=100)
scv.pp.moments(adata_nn_100, n_neighbors=100)
scv.pp.neighbors(adata_nn_300, n_neighbors=300)
scv.pp.moments(adata_nn_300, n_neighbors=300)
scv.settings.set_figure_params(dpi=100)
scv.pl.scatter(adata_nn_10, ['Sulf2'], color='clusters')
scv.pl.scatter(adata_nn_30, ['Sulf2'], color='clusters')
scv.pl.scatter(adata_nn_300, ['Sulf2'], color='clusters')
scv.pl.scatter(adata_nn_30, ['Abcc8'], color='clusters')
scv.pl.scatter(adata_nn_100, ['Abcc8'], color='clusters')
In the end we continue with the default value of 30 neigbors as it allows to see the pattern of induction/repression in the first- and second-order moments without overfitting.
If we use too few neighbors (k=10
), the result is a nosiy blob in the middle, which makes it difficult to inffer meaningful biological information. Too many neighbors (k=300
) produces unnatural results - it completely smoothes out the variance. Even with less neghbors, k=100
, it can produce artifacts like in the case of Abcc8. The default value of k=30
seems to work fine, since already at this point we can see the patterns of induction and repression.
adata = adata_nn_30
4.1. For velocity estimation, per default the stochastic model is applied. It estimates the steady-state ratio for each gene, and computes velocities as the residuals of the observations from this steady-state line.
4.2. Demonstrate for a few genes, how the steady-state line is fit. Can you show mathematically why velocities are given as the residuals from this line?
4.3. Now, apply the dynamical model instead and show how the steady-state lines differ.
4.4. Now use a few gene phase portraits to demonstrate how inferred up- and down-regulation can be read from these portraits.
4.5. What are transient states as opposed to steady states? How many steady states exist?
We compute the velocity. The full dynamical model can be obtained by setting mode='dynamical'
, it is however, very time consuming to run. Therefore, we first test the default stochastic
mode, which allows for good approximation of the full model results.
scv.tl.velocity(adata)
scv.pl.scatter(adata, ['Abcc8', 'Sulf2', 'Top2a'], color='clusters')
Rate equations for a single gene, which describes how the expected number of unspliced mRNA molecules 𝑢, and spliced molecules 𝑠, evolve over time:
Here, 𝛼(𝑡) is the time-dependent rate of transcription, 𝛽(𝑡) is the rate of splicing, 𝛾(𝑡) is the rate of degradation. Under an assumption of constant (time-independent) rates 𝛼(𝑡) = 𝛼, 𝛾(𝑡) = 𝛾, and setting 𝛽(𝑡) = 1 (i.e. measuring all rates in units of the splicing rate), the rate equations simplify to:
$\gamma s(𝑡)$ is the steady state line and $u(𝑡)$ is the observed amount of unspliced counts. Therefore velocity is the residual between $u(t)$ and the steady state line.
To try if we can get a better fit of the steady-state line, we now use the full dynamical model.
scv.tl.recover_dynamics(adata, n_jobs=4)
scv.tl.velocity(adata, mode='dynamical')
We look at the genes which show up-regulation, meaning increased number of cells in the inducation phase.
sc.tl.umap(adata)
sc.pl.umap(adata, color='clusters', title='UMAP embedding')
scv.pl.scatter(adata, ['Cpe', 'Nnat', 'Gng12', 'Pcsk2'], color='clusters', ncols=2)
In a simialr way we can also detect down-regulated genes, with more cells in the repression phase.
sc.pl.umap(adata, color='clusters', title='UMAP embedding')
scv.pl.scatter(adata, ['Adk', 'Sulf2'], color='clusters')
The original model of RNA velocity, the deterministic model, assumes one splicing rate common for all of the genes ($\beta = 1$). This means that the model only fits the two steady states (active and inactive). In contrast, the dynamical model also calculates the transient states, induction and repression (along with the two steady states). Those two additional transient states are determined based on gene-specific transcriptional rates, from which a likelihood-dynamic model is built on a gene-shared latent time. This allows for a more comprehensive and faithful representation of the underlying gene dynamics. It more accuarately models the heterogeneity of celltypes and gene expression patterns in a dataset.
5.1. Compute various dimensionality reduction methods / embeddings (e.g., tsne, umap, diffmap) and show how they differ and what type of topology they capture (local vs. global similarity).
5.2. As you vary the main parameters (e.g., min_dist in umap), how does that impact the embedding?
5.3. Which of these embeddings would you use for visualising trajectories and why?
sc.tl.umap(adata)
sc.pl.umap(adata, color='clusters', title='UMAP embedding')
sc.tl.tsne(adata)
sc.pl.tsne(adata, color='clusters', title='Tsne embedding')
sc.tl.diffmap(adata)
sc.pl.diffmap(adata, color='clusters', title='Diffmap embedding')
sc.tl.draw_graph(adata)
sc.pl.draw_graph(adata, color='clusters', title='Force directed embedding')
Embedding calculations offer parameters which we can modify. Depending on the type of embedding the parameters may affect different aspects of the embedding creation. In the case of UMAP, the two parameters, min_dist
and spread
, control how far spread out the cells are from each other.
sc.tl.umap(adata, min_dist=0.2, spread=2)
sc.pl.umap(adata, color='clusters', title='min_dist=0.2, spread=2')
sc.tl.umap(adata, min_dist=1, spread=0.5)
sc.pl.umap(adata, color='clusters', title='min_dist=1, spread=0.5')
Lowering the spread parameter forces the cells more in-line with the global topology, but get rids of some of the finer local structers. For example, the ductal cells loose the circular pattern of cell cycle. We should evaluate the results keeping in mind the underlying biology of the dataset and how well the results represent it.
Tsne captures well the local topology of the dataset, but fails at accurately representing the global structures.
Embeddings like diffmap (tl.diffmap
) or the force-directed graph drawing (tl.draw_graph
) are the best for visualising trajectories, because they focus the most on capturing the global topolgy by sacryficing local differneces between cells.
UMAP is a good middle ground between local and global structure. UMAP parameters can be further tweaked to better represent specific features of the respective datasets.
sc.tl.umap(adata)
sc.pl.umap(adata, color='clusters', title='default')
6.1. The velocity graph summarises all cell-to-cell transitions inferred from velocity. What exactly does the ij-th entry in the graph represent?
6.2. Can you run a number of cell transitions using the graph to simulate a trajectory starting from a particular cell in one of the embeddings?
6.3. How do the trajectories compare, when run from different starting cells (e.g., cell cycle vs. differentiation)?
scv.tl.velocity_graph(adata)
The velocity graph is a graph of cell-to-cell transitions inferred from velocity. For two cells, $i$ and $j$, it represents cosine similarities between velocity vector $v_i$ and gene expression change $x_j - x_i$.
# NB! Due to the randomness of the transition path selection, the trajectories may look different between runs of the code
seed = 120
root_cell_cc = np.flatnonzero(adata.obs['clusters'] == 'Ductal')[1]
root_cell_diff = np.flatnonzero(adata.obs['clusters'] == 'Ngn3 high EP')[0]
x, y = scv.utils.get_cell_transitions(adata, basis='umap', starting_cell=root_cell_cc, random_state=seed)
ax = scv.pl.velocity_graph(adata, c='lightgrey', edge_width=.05, show=False)
ax = scv.pl.scatter(adata, x=x, y=y, s=120, c='ascending', cmap='gnuplot', ax=ax)
x, y = scv.utils.get_cell_transitions(adata, basis='umap', starting_cell=root_cell_diff, random_state=seed)
ax = scv.pl.velocity_graph(adata, c='lightgrey', edge_width=.05, show=False)
ax = scv.pl.scatter(adata, x=x, y=y, s=120, c='ascending', cmap='gnuplot', ax=ax)
At basal developmental stages cells can display more locally confined trajectories without clear transitions into other celltypes/clusters, indicating cell cycle-related velocity. Cells from more developmentally advanced clusters will usualy exhibit a more clear trajectory towards more mature/terminally differentiated cell types.
7.1. Now the high-dimensional velocity vector field is visualized by projection into the two-dimensional embedding. Show how the velocity graph (transition matrix) is being used to project the data. For instance, you can pick one particular cell, and show its potential transitions, and how the actual arrow is derived from these.
7.2. Vary with the n_neighbors parameter and show how that impacts the projected vector field.
7.3. Why is the vector field getting smoother/averaged with a large n_neighbors parameter?
scv.settings.set_figure_params(dpi=300)
adata = sc.read('afterTuesday.h5ad')
scv.pl.velocity_embedding(adata, arrow_length=3, arrow_size=2, dpi=120)
We pick a random selected_cell
from a cluster
and visualise the transition probabilities (trans
) for that cell.
This one cell from Ngn3 high cluster have the high probability transitions very focused in the immediate neighbourhood along the differentation trajectory.
cluster = 'Ngn3 high EP'
selected_cell = np.flatnonzero(adata.obs['clusters'] == cluster)[0]
trans = scv.utils.get_transition_matrix(adata).todense()
ax = scv.pl.scatter(adata[adata.obs['clusters'] == cluster], color=selected_cell, show=False, alpha=0, size=20)
ax = scv.pl.scatter(adata, color=trans[:,selected_cell], show=False, cmap='Reds', alpha=1, perc=99, ax=ax, size=100)
ax = scv.pl.scatter(adata, color='clusters', alpha=0.1, ax=ax, size=100, legend_loc='right margin', title='', show=False)
ax = scv.pl.velocity_embedding(adata[adata.obs['clusters'] == cluster][selected_cell], color='black', ax=ax, size=0, arrow_length=30, arrow_size=10)
trans_cells = trans[:,selected_cell] > 0
x = np.array(adata[trans_cells].obsm['X_umap'][:,0])
y = np.array(adata[trans_cells].obsm['X_umap'][:,1])
x_center = (x - x.mean())
y_center = (y - y.mean())
variance_x = np.var(x_center)
variance_y = np.var(y_center)
print("Variance: ", np.mean([variance_x, variance_y]))
To measure differentiation potential of a given cell we calculate variance on the UMAP coordinates cells with high transition probabilites.
On the other hand, this other cell have a much larger (further) transition. Again, the cell transitions outside its own cluster.
cluster = 'Ngn3 high EP'
selected_cell = np.flatnonzero(adata.obs['clusters'] == cluster)[4]
trans = scv.utils.get_transition_matrix(adata).todense()
ax = scv.pl.scatter(adata[adata.obs['clusters'] == cluster], color=selected_cell, show=False, alpha=0, size=20)
ax = scv.pl.scatter(adata, color=trans[:,selected_cell], show=False, cmap='Reds', alpha=1, perc=99, ax=ax, size=100)
ax = scv.pl.scatter(adata, color='clusters', alpha=0.1, ax=ax, size=100, legend_loc='right margin', title='', show=False)
ax = scv.pl.velocity_embedding(adata[adata.obs['clusters'] == cluster][selected_cell], color='black', ax=ax, size=0, arrow_length=30, arrow_size=10)
trans_cells = trans[:,selected_cell] > 0
x = np.array(adata[trans_cells].obsm['X_umap'][:,0])
y = np.array(adata[trans_cells].obsm['X_umap'][:,1])
x_center = (x - x.mean())
y_center = (y - y.mean())
variance_x = np.var(x_center)
variance_y = np.var(y_center)
print("Variance: ", np.mean([variance_x, variance_y]))
On the other hand, some of the Ductal cells fall into the cell cycle loop and stay in their own cluster.
cluster = 'Ductal'
selected_cell = np.flatnonzero(adata.obs['clusters'] == cluster)[1]
trans = scv.utils.get_transition_matrix(adata).todense()
ax = scv.pl.scatter(adata[adata.obs['clusters'] == cluster], color=selected_cell, show=False, alpha=0, size=20)
ax = scv.pl.scatter(adata, color=trans[:,selected_cell], show=False, cmap='Reds', alpha=1, perc=99, ax=ax, size=100)
ax = scv.pl.scatter(adata, color='clusters', alpha=0.1, ax=ax, size=100, legend_loc='right margin', title='', show=False)
ax = scv.pl.velocity_embedding(adata[adata.obs['clusters'] == cluster][selected_cell], color='black', ax=ax, size=0, arrow_length=30, arrow_size=10)
trans_cells = trans[:,selected_cell] > 0
x = np.array(adata[trans_cells].obsm['X_umap'][:,0])
y = np.array(adata[trans_cells].obsm['X_umap'][:,1])
x_center = (x - x.mean())
y_center = (y - y.mean())
variance_x = np.var(x_center)
variance_y = np.var(y_center)
print("Variance: ", np.mean([variance_x, variance_y]))
Delta cells stay in the delta cluster.
cluster = 'Delta'
selected_cell = np.flatnonzero(adata.obs['clusters'] == cluster)[0]
trans = scv.utils.get_transition_matrix(adata).todense()
ax = scv.pl.scatter(adata[adata.obs['clusters'] == cluster], color=selected_cell, show=False, alpha=0, size=20)
ax = scv.pl.scatter(adata, color=trans[:,selected_cell], show=False, cmap='Reds', alpha=1, perc=99, ax=ax, size=100)
ax = scv.pl.scatter(adata, color='clusters', alpha=0.1, ax=ax, size=100, legend_loc='right margin', title='', show=False)
ax = scv.pl.velocity_embedding(adata[adata.obs['clusters'] == cluster][selected_cell], color='black', ax=ax, size=0, arrow_length=30, arrow_size=10)
trans_cells = trans[:,selected_cell] > 0
x = np.array(adata[trans_cells].obsm['X_umap'][:,0])
y = np.array(adata[trans_cells].obsm['X_umap'][:,1])
x_center = (x - x.mean())
y_center = (y - y.mean())
variance_x = np.var(x_center)
variance_y = np.var(y_center)
print("Variance: ", np.mean([variance_x, variance_y]))
Altough some have a high probabilty going against their cell fate.
cluster = 'Delta'
selected_cell = np.flatnonzero(adata.obs['clusters'] == cluster)[1]
trans = scv.utils.get_transition_matrix(adata).todense()
ax = scv.pl.scatter(adata[adata.obs['clusters'] == cluster], color=selected_cell, show=False, alpha=0, size=20)
ax = scv.pl.scatter(adata, color=trans[:,selected_cell], show=False, cmap='Reds', alpha=1, perc=99, ax=ax, size=100)
ax = scv.pl.scatter(adata, color='clusters', alpha=0.1, ax=ax, size=100, legend_loc='right margin', title='', show=False)
ax = scv.pl.velocity_embedding(adata[adata.obs['clusters'] == cluster][selected_cell], color='black', ax=ax, size=0, arrow_length=30, arrow_size=10)
trans_cells = trans[:,selected_cell] > 0
x = np.array(adata[trans_cells].obsm['X_umap'][:,0])
y = np.array(adata[trans_cells].obsm['X_umap'][:,1])
x_center = (x - x.mean())
y_center = (y - y.mean())
variance_x = np.var(x_center)
variance_y = np.var(y_center)
print("Variance: ", np.mean([variance_x, variance_y]))
Some cells have a probability of transitioning into multiple states.
cluster = 'Pre-endocrine'
selected_cell = np.flatnonzero(adata.obs['clusters'] == cluster)[42]
trans = scv.utils.get_transition_matrix(adata).todense()
ax = scv.pl.scatter(adata[adata.obs['clusters'] == cluster], color=selected_cell, show=False, alpha=0, size=20)
ax = scv.pl.scatter(adata, color=trans[:,selected_cell], show=False, cmap='Reds', alpha=1, perc=99, ax=ax, size=100)
ax = scv.pl.scatter(adata, color='clusters', alpha=0.1, ax=ax, size=100, legend_loc='right margin', title='', show=False)
ax = scv.pl.velocity_embedding(adata[adata.obs['clusters'] == cluster][selected_cell], color='black', ax=ax, size=0, arrow_length=30, arrow_size=10)
trans_cells = trans[:,selected_cell] > 0
x = np.array(adata[trans_cells].obsm['X_umap'][:,0])
y = np.array(adata[trans_cells].obsm['X_umap'][:,1])
x_center = (x - x.mean())
y_center = (y - y.mean())
variance_x = np.var(x_center)
variance_y = np.var(y_center)
print("Variance: ", np.mean([variance_x, variance_y]))
The code below can be used to calculate the variance for all the cells in the dataset. Because it contains a loop through all the cells, it crushes the AWS instance. Therefore run it only on your own machine.
trans = scv.utils.get_transition_matrix(adata).todense()
variance_array = []
for selected_cell in range(len(adata.obs_names)):
# Keep only cells with positive transition probability
trans_cells = trans[:,selected_cell] > 0.0001
# Remove the selected cell itself
trans_cells[selected_cell] = False
x = np.array(adata[trans_cells].obsm['X_umap'][:,0])
y = np.array(adata[trans_cells].obsm['X_umap'][:,1])
x_center = (x - x.mean())
y_center = (y - y.mean())
variance_x = np.var(x_center)
variance_y = np.var(y_center)
variance_mean = np.mean([variance_x, variance_y])
variance_array.append(variance_mean)
variance_array_nonan = np.nan_to_num(variance_array)
adata.obs['transition_variance'] = variance_array_nonan
sc.pl.umap(adata, color='transition_variance', cmap='YlOrRd')
scv.settings.set_figure_params(dpi=150)
scv.pl.velocity_graph(adata, color='clusters', n_neighbors=3)
scv.pl.velocity_graph(adata, color='clusters', n_neighbors=10)
scv.pl.velocity_graph(adata, color='clusters', n_neighbors=100)
scv.pl.velocity_graph(adata, color='clusters', n_neighbors=500)
The vector field getting smoother/averaged with a large n_neighbors parameter, because it increases the requirement for connectivities based on the neighborhood graph, converging the vectors and therefore smoothing them.
8.1.1. Identification of dynamically relevant genes is motivated in three ways:
8.1.2. We would like to understand, which genes are the ones that drive the vector field representation in the embeeding, i.e., which are most impactful on the projection.
8.1.3. We would like to understand, which gene are transiently expressed, i.e., dynamically activating in the differentiation process.
8.1.4. Finally, we want to know which genes are actual driver of the underlying biological processes.
8.1.5. Using only a small number of selected genes, show how the velocity projection differs from the projection using all genes. Can you quantify that difference?
8.1.6. There are different ways of computing ‘relevant’ genes in scvelo. Show, which of them best represents the entire vector field.
8.1.7. Which genes are supporting / contradicting the embedded trajectories? How does the projection differ, as you subset to these genes or exclude these genes?
8.1.8. Find more ways of identifying important genes, e.g., using PCA on the velocity vector field.
scv.settings.set_figure_params(dpi=150)
adata = sc.read('afterTuesday.h5ad')
scv.tl.rank_velocity_genes(adata, groupby='clusters', min_corr=.3)
df = scv.DataFrame(adata.uns['rank_velocity_genes']['names'])
df.head(n=10)
scv.tl.velocity_graph(adata)
adata.uns['velocity_graph_all'] = adata.uns['velocity_graph']
scv.pl.velocity_embedding_stream(adata, title='All')
scv.tl.velocity_graph(adata, gene_subset=['Veph1','Creb5'])
adata.uns['velocity_graph_ductal'] = adata.uns['velocity_graph']
scv.pl.velocity_embedding_stream(adata, title='Ductal')
scv.tl.velocity_graph(adata, gene_subset=['Rims3','Col6a6'])
adata.uns['velocity_graph_beta'] = adata.uns['velocity_graph']
scv.pl.velocity_embedding_stream(adata, title='Beta')
scv.tl.velocity_graph(adata, gene_subset=['Ncor2','Spsb4'])
adata.uns['velocity_graph_delta'] = adata.uns['velocity_graph']
scv.pl.velocity_embedding_stream(adata, title='Delta')
scv.tl.velocity_graph(adata, gene_subset=['Ptprs','Ajap1'])
adata.uns['velocity_graph_ngn3HIGH'] = adata.uns['velocity_graph']
scv.pl.velocity_embedding_stream(adata, title='Ngn3 high EP')
adata
df_dict = {}
for uns in adata.uns:
if uns.startswith('velocity_graph_'):
if uns != 'velocity_graph_neg':
df = scv.get_df(adata, keys=[uns])
df_dict[uns] = df
corr_df_dict = {}
for keys, df in df_dict.items():
if keys != 'velocity_graph_all':
corr = df_dict[keys].corrwith(df_dict['velocity_graph_all'])
corr_df_dict[keys] = corr
for key in corr_df_dict:
adata.obs[key] = corr_df_dict[key].to_numpy()
sc.pl.umap(adata, color=corr_df_dict.keys(), ncols=2)
corr_df_merged = pd.concat(corr_df_w_index, axis=1, join='inner')
corr_df_merged.columns = corr_df_merged.columns.droplevel()
corr_df_merged = corr_df_merged.loc[:,~corr_df_merged.columns.duplicated()]
corr_df_merged = corr_df_merged.set_index('clusters', drop=True)
corr_df_merged = corr_df_merged.sort_index(ascending=True)
corr_df_merged
# Order by latent time / clusters
import seaborn as sns
sns.clustermap(corr_df_merged, cmap='gnuplot', row_cluster=False)
plt.show()
# Calculate dynamic genes
scv.tl.rank_dynamical_genes(adata, groupby='clusters')
df = scv.get_df(adata, 'rank_dynamical_genes/names')
# calculate velocity genes
scv.tl.rank_velocity_genes(adata, groupby='clusters', min_corr=.3)
df2 = scv.DataFrame(adata.uns['rank_velocity_genes']['names'])
# Extract top 5 dynamic genes per cluster
dyn5Genes = []
for cluster in df:
for gene in df[cluster][0:5].tolist():
dyn5Genes.append(gene)
# Extract top 5 velocity genes per cluster
velo5Genes = []
for cluster in df2:
for gene in df2[cluster][0:5].tolist():
velo5Genes.append(gene)
# Velocity embedding based on top 5 dynamic genes per cluster
scv.tl.velocity_graph(adata, gene_subset=list(set(dyn5Genes)))
adata.uns['velocity_graph_top5_dynGenes'] = adata.uns['velocity_graph']
# Velocity embedding based on top 5 velocity genes per cluster
scv.tl.velocity_graph(adata, gene_subset=list(set(velo5Genes)))
adata.uns['velocity_graph_top5_veloGenes'] = adata.uns['velocity_graph']
# Extract top 10 dynamic genes per cluster
dyn10Genes = []
for cluster in df:
for gene in df[cluster][0:10].tolist():
dyn10Genes.append(gene)
# Extract top 10 velocity genes per cluster
velo10Genes = []
for cluster in df2:
for gene in df2[cluster][0:10].tolist():
velo10Genes.append(gene)
# Velocity embedding based on top 10 dynamic genes per cluster
scv.tl.velocity_graph(adata, gene_subset=list(set(dyn10Genes)))
adata.uns['velocity_graph_top10_dynGenes'] = adata.uns['velocity_graph']
# Velocity embedding based on top 10 velocity genes per cluster
scv.tl.velocity_graph(adata, gene_subset=list(set(velo10Genes)))
adata.uns['velocity_graph_top10_veloGenes'] = adata.uns['velocity_graph']
# Extract top 30 dynamic genes per cluster
dyn30Genes = []
for cluster in df:
for gene in df[cluster][0:30].tolist():
dyn30Genes.append(gene)
# Extract top 30 velocity genes per cluster
velo30Genes = []
for cluster in df2:
for gene in df2[cluster][0:30].tolist():
velo30Genes.append(gene)
# Velocity embedding based on top 30 dynamic genes per cluster
scv.tl.velocity_graph(adata, gene_subset=list(set(dyn30Genes)))
adata.uns['velocity_graph_top30_dynGenes'] = adata.uns['velocity_graph']
# Velocity embedding based on top 30 velocity genes per cluster
scv.tl.velocity_graph(adata, gene_subset=list(set(velo30Genes)))
adata.uns['velocity_graph_top30_veloGenes'] = adata.uns['velocity_graph']
# Create a dictionary to store all gene-based velocity projections
df_dict = {}
for uns in adata.uns:
if uns.startswith('velocity_graph_'):
if uns != 'velocity_graph_neg':
daf = scv.get_df(adata, keys=[uns])
daf.index = adata.obs.index.values
df_dict[uns] = daf
# Compute velocity correlations between gene-based projections and original projection
corr_df_dict = {}
for keys, values in df_dict.items():
if keys != 'velocity_graph_all':
corr = df_dict[keys].corrwith(df_dict['velocity_graph_all'])
corr.index = adata.obs.index.values
corr_df_dict[keys] = corr
# Velocity embedding based on top 5 dynamic genes per cluster
scv.tl.velocity_graph(adata, gene_subset=list(set(dyn5Genes)))
adata.uns['velocity_graph_top5_dynGenes'] = adata.uns['velocity_graph']
scv.pl.velocity_embedding_stream(adata, color=corr_df_dict['velocity_graph_top5_dynGenes'])
# Velocity embedding based on top 10 dynamic genes per cluster
scv.tl.velocity_graph(adata, gene_subset=list(set(dyn10Genes)))
adata.uns['velocity_graph_top10_dynGenes'] = adata.uns['velocity_graph']
scv.pl.velocity_embedding_stream(adata, color=corr_df_dict['velocity_graph_top10_dynGenes'])
# Velocity embedding based on top 30 dynamic genes per cluster
scv.tl.velocity_graph(adata, gene_subset=list(set(dyn30Genes)))
adata.uns['velocity_graph_top30_dynGenes'] = adata.uns['velocity_graph']
scv.pl.velocity_embedding_stream(adata, color=corr_df_dict['velocity_graph_top30_dynGenes'])
# Velocity embedding based on top 5 velocity genes per cluster
scv.tl.velocity_graph(adata, gene_subset=list(set(velo5Genes)))
adata.uns['velocity_graph_top5_veloGenes'] = adata.uns['velocity_graph']
scv.pl.velocity_embedding_stream(adata, color=corr_df_dict['velocity_graph_top5_veloGenes'])
# Velocity embedding based on top 10 velocity genes per cluster
scv.tl.velocity_graph(adata, gene_subset=list(set(velo10Genes)))
adata.uns['velocity_graph_top10_veloGenes'] = adata.uns['velocity_graph']
scv.pl.velocity_embedding_stream(adata, color=corr_df_dict['velocity_graph_top10_veloGenes'])
# Velocity embedding based on top 30 velocity genes per cluster
scv.tl.velocity_graph(adata, gene_subset=list(set(velo30Genes)))
adata.uns['velocity_graph_top30_veloGenes'] = adata.uns['velocity_graph']
scv.pl.velocity_embedding_stream(adata, color=corr_df_dict['velocity_graph_top30_veloGenes'])
# Visualize top velocity markers in Beta and Ductal cells
for cluster in ['Beta', 'Ductal']:
scv.pl.scatter(adata, df2[cluster][:5], ylabel=cluster, frameon=False)
# Visualize top dynamic markers in Beta and Ductal cells
for cluster in ['Beta', 'Ductal']:
scv.pl.scatter(adata, df[cluster][:5], ylabel=cluster, frameon=False)