conda install notebook
pip install -U scvelo
pip install python-igraph louvain
pip install pybind11 hnswlib
import scvelo as scv
import scanpy as sc
import numpy as np
scv.logging.print_version()
scv.settings.verbosity = 3
scv.settings.presenter_view = True
scv.set_figure_params('scvelo')
adata = scv.datasets.pancreas()
adata
adata.layers["spliced"]
adata.layers["unspliced"]
adata.X.A
adata.obs
scv.pl.proportions(adata)
Raw count matrices often include over 20,000 genes. This number can be drastically reduced by filtering out genes that are not expressed in more than a few cells and are thus not informative of the cellular heterogeneity. A guideline to setting this threshold is to use the minimum cell cluster size that is of interest and leaving some leeway for dropout effects.
For example, filtering out genes expressed in fewer than 20 cells may make it difficult to detect cell clusters with fewer than 20 cells. For datasets with high dropout rates, this threshold may also complicate the detection of larger clusters.
The choice of threshold should scale with the number of cells in the dataset and the intended downstream analysis. Luecken and Theis (2019), Current best practices in single-cell RNA-seq analysis: a tutorial
scv.pp.filter_genes(adata, min_shared_counts=20)
scv.pp.normalize_per_cell(adata)
scv.pp.filter_genes_dispersion(adata, n_top_genes=2000)
scv.pp.pca(adata)
sc.pl.pca(adata, color = "clusters")
adata1 = scv.datasets.pancreas()
scv.pp.filter_genes(adata1, min_shared_counts=20)
scv.pp.normalize_per_cell(adata1)
scv.pp.filter_genes_dispersion(adata1, n_top_genes=2000)
scv.pp.log1p(adata1)
scv.pp.pca(adata1)
sc.pl.pca(adata1, color = "clusters")
scv.plotting.pca(adata1)
The PCA representation (computed on the log-normalized data) is used to compute a neighbor graph from cell-to-cell distances. What does this graph represent and what can it be used for?
Now, we use the graph to compute first-order moments, i.e., for each cell, the mean (first-order moments) and variance (second-order moments) of its neighboring cells is computed. This can also be regarded as a knn-imputation of the counts.
Show how the k parameter (number of neighbors) impacts the imputed counts, e.g., by comparing the imputed gene expression of two genes with different values for k.
The kNN graph helps to connect the neighboring cells and then it can be used to compute clusters.
scv.pp.moments(adata1, n_pcs=30, n_neighbors=20)
adata1.layers['Mu'][:,1]
#Impact of k parameter on the imputed counts by comparing the expression of a gene with 2 different k values
scv.pp.moments(adata1, n_neighbors =500)
gene1_50=adata1.layers['Ms'][:,1]
scv.pp.moments(adata1, n_neighbors = 5)
gene1_20=adata1.layers['Ms'][:,1]
import matplotlib.pyplot as plt
p1=plt.plot(gene1_50, '.', color='blue', alpha=0.5)
p2=plt.plot(gene1_20, 'v', color='black', alpha=0.2)
plt.plot(gene1_50, '.', color='blue', alpha=0.5)
plt.ylim((0,2.5))
p2=plt.plot(gene1_20, 'v', color='black', alpha=0.2)
sc.pl.violin(adata1, ['Abcc8','Sulf2','Top2a'],size=3,layer='spliced',
jitter=.3, groupby = 'clusters', rotation= 45)
#Phase portaits of marker genes and comparison for more different numbers of neighbors using logarithmized counts
k=[5,20,30,50,100,500]
scv.pl.scatter(adata1, basis=["Ins2", "Abcc8", "Sulf2", "Erdr1", "Neurog3"], use_raw=True)
for i in k:
print( i," neighbours")
scv.pp.moments(adata1, n_neighbors=i)
scv.pl.scatter(adata1, basis=['Ins2', "Abcc8", "Sulf2", "Erdr1", "Neurog3"], use_raw=False)
scv.tl.velocity(adata1)
scv.tl.velocity_graph(adata1)
#Phase portaits of marker genes and comparison for more different numbers of neighbors using raw counts
k=[5,20,50,100]
scv.pl.scatter(adata, basis='Abcc8', use_raw=True)
for i in k:
print( i," neighbours")
scv.pp.moments(adata, n_neighbors=i)
scv.pl.scatter(adata, basis=['Abcc8', 'Sulf2','Top2a'], use_raw=False)
scv.pl.scatter(adata1, 'Adk', color=['clusters', 'velocity'])
scv.pl.scatter(adata1, 'Abcc8', color=['clusters', 'velocity'])
Mathematical equations for single gene that describe how the expected number of unspliced mRNA molecules u and the spliced molecules s evolve over time:
$\frac{du}{dt} = a(t)- β(t)u(t)$
$\frac{ds}{dt} = β(t)u(t) - γ(t)s(t) $
If we assume that:
$a(t)=a$ $γ(t)= γ$ $b(t)= 1$
the resulting equations are:
$\frac{du}{dt} = a- u(t)$
$\frac{ds}{dt} = u(t) - γ s(t) $
where u(t) is the observed amount of unspliced counts and γ s(t) is the steady state ratio. Therefore velocity is the residual of u(t) and the steady state line.
scv.pp.moments(adata1, n_neighbors =50)
scv.tl.recover_dynamics(adata1,n_jobs=10)
scv.tl.velocity(adata1, mode="dynamical")
scv.tl.velocity_graph(adata1)
scv.pl.scatter(adata1, basis=['Abcc8','Sulf2','Top2a'],color='clusters', use_raw=False)
Transient states are the ones that we have downregulation or upregulation, while steady states are the ones where the ratio is 1. We have 2 steady states the on and the off.
Compute various dimensionality reduction methods / embeddings (e.g., tsne, umap, diffmap) and show how they differ and what type of topology they capture (local vs. global similarity).
As you vary the main parameters (e.g., min_dist in umap), how does that impact the embedding?
Which of these embeddings would you use for visualising trajectories and why?
scv.pl.velocity_embedding_stream(adata1, basis='umap')
#umap embedding, min_dist=0.3
scv.tl.umap(adata1, min_dist=0.3)
scv.pl.velocity_embedding_stream(adata1, basis='umap')
#umap embedding, min_dist=0.5
scv.tl.umap(adata1, min_dist=0.5)
scv.pl.velocity_embedding_stream(adata1, basis='umap')
#umap embedding, min_dist=1
scv.tl.umap(adata1, min_dist=1)
scv.pl.velocity_embedding_stream(adata1, basis='umap')
#tsne embedding
scv.tl.tsne(adata1)
scv.pl.velocity_embedding_stream(adata1, basis='tsne')
#diffusion map embedding
scv.tl.diffmap(adata1)
scv.pl.velocity_embedding_stream(adata1, basis='diffmap')
Tsne and umap look very similar, while diffusion map differs. In order to decide which graph (umap or tsne) to use we should know more about our data. As for the minimum distance 0.3 seems to give the best represtation.
# function for ummin_dist umap - default 0.5
k=[0.1,0.5,1,2,3]
for i in k:
print( i," min_dist")
sc.tl.umap(adata, min_dist = i)
sc.pl.umap(adata, color= "clusters")
scv.tl.velocity_embedding(adata, basis = "umap")
scv.pl.velocity_embedding_stream(adata, basis= "umap")
x, y = scv.utils.get_cell_transitions(adata, basis='umap', starting_cell=1, random_state= 20)
ax = scv.pl.velocity_graph(adata, c='lightgrey', edge_width=.05, show=False)
ax = scv.pl.scatter(adata, x=x, y=y, s=120, c='ascending', cmap='gnuplot', ax=ax)
x, y = scv.utils.get_cell_transitions(adata, basis='umap', starting_cell=3, random_state= 20)
ax = scv.pl.velocity_graph(adata, c='lightgrey', edge_width=.05, show=False)
ax = scv.pl.scatter(adata, x=x, y=y, s=120, c='ascending', cmap='gnuplot', ax=ax)
6.1. The velocity graph summarises all cell-to-cell transitions inferred from velocity. What exactly does the ij-th entry in the graph represent?
6.2. Can you run a number of cell transitions using the graph to simulate a trajectory starting from a particular cell in one of the embeddings?
6.3. How do the trajectories compare, when run from different starting cells (e.g., cell cycle vs. differentiation)?
scv.pl.velocity_embedding_stream(adata1, basis='umap')
scv.pl.velocity_embedding_grid(adata1, basis='umap')
scv.pl.velocity_embedding(adata1, basis='umap')
scv.pl.velocity_embedding(adata1, arrow_length=3, arrow_size=2, dpi=120)
The velocity graph matrix shows the probability of a cell transitioning to another cell.
p1=scv.utils.get_transition_matrix(adata1)
print(p1)
#terminal states of cells
scv.tl.terminal_states(adata1)
scv.pl.scatter(adata1, color=['root_cells', 'end_points'])
#Phase portraits of marker genes
scv.pl.velocity(adata1, ['Cpe', 'Gnao1', 'Ins2', 'Adk'], ncols=2)
adata1.uns['velocity_graph'].A
adata1.uns['velocity_graph_neg'].A
adata1.obs
#Velocity pseudotime for different cells
scv.tl.velocity_pseudotime(adata1,root_key='AAACCTGAGGCAATTA')
scv.pl.scatter(adata1, color='velocity_pseudotime', cmap='gnuplot')
scv.tl.velocity_pseudotime(adata1,root_key='TTTGTCAAGTGTGGCA')
scv.pl.scatter(adata1, color='velocity_pseudotime', cmap='gnuplot')
scv.tl.velocity_pseudotime(adata1,root_key='TTTGTCAAGTGACATA')
scv.pl.scatter(adata1, color='velocity_pseudotime', cmap='gnuplot')
#Cell cycle score
scv.tl.score_genes_cell_cycle(adata1)
scv.pl.scatter(adata1, color_gradients=['S_score', 'G2M_score'], smooth=True, perc=[5, 95])
When taking cells in the middle, we end up with trajectories to both directions, whereas when we take cells from extreme positions, we end up with trajectories directed only on one side.
*Cells that are not on G2M phase of S phase based on the clusted might be on G1 phase, Mitosis or even not cycling since we don't have these annotations.
x, y = scv.utils.get_cell_transitions(adata1, basis='umap', starting_cell=1000, n_steps=100, perc=10)
ax = scv.pl.velocity_graph(adata1, c='lightgrey', edge_width=.05, show=False)
ax = scv.pl.scatter(adata1, x=x, y=y, s=120, c='ascending', cmap='gnuplot', ax=ax)
x, y = scv.utils.get_cell_transitions(adata1, basis='umap', starting_cell=1000, n_steps=1000, perc=10)
ax = scv.pl.velocity_graph(adata1, c='lightgrey', edge_width=.05, show=False)
ax = scv.pl.scatter(adata1, x=x, y=y, s=120, c='ascending', cmap='gnuplot', ax=ax)
7.1. Now the high-dimensional velocity vector field is visualized by projection into the two-dimensional embedding. Show how the velocity graph (transition matrix) is being used to project the data. For instance, you can pick one particular cell, and show its potential transitions, and how the actual arrow is derived from these.
7.2. Vary with the n_neighbors parameter and show how that impacts the projected vector field.
7.3. Why is the vector field getting smoother/averaged with a large n_neighbors parameter?
startingCell=100
trans = scv.utils.get_transition_matrix(adata1)[startingCell,:]
x, y = scv.utils.get_cell_transitions(adata1, basis='umap', starting_cell=startingCell)
ax = scv.pl.velocity_embedding_stream(adata1, basis='umap',color=trans,perc=99,show=False,dpi=150)
ax = scv.pl.scatter(adata1, x=x[0], y=y[0], s=120, c='ascending', cmap='gnuplot', ax=ax, colorbar= False)
scv.pl.velocity_embedding_stream(adata1, basis='umap',size=30,alpha=1,n_neighbors=10, smooth= True )
8.1.1. Now show how the projected vector field is represented in the embedding as you vary some of the parameters from before (such as min_dist for UMAP).
8.1.2. Is there a specific parameter set, where the arrows and the embedding coordinates do not work together?
8.1.3. Which embedding provides the highest resolution into the vector field.
8.1.4. Using the simulated cellular transitions, can you show which embedding specification represent the high-dimensional vector field best?
8.1.5. Can you even systematically quantify, how well the vector field is represented in the embedding? Possible metrics could, e.g., be correlation of embedded velocities with pseudo time / latent time, 'compactness' of transitions, correlation of transitions in the embedding weighted by their probabilities.
seed=120
root_cell_cc=np.flatnonzero(adata1.obs["clusters"]=="Beta") [1]
x,y= scv.utils.get_cell_transitions(adata1, basis="umap", starting_cell=root_cell_cc, random_state=seed)
ax=scv.pl.velocity_graph(adata1, c="lightgrey", edge_width=0.5, show=False)
ax=scv.pl.scatter(adata1, x=x, y=y, s=120, c="ascending", cmap="gnuplot", ax=ax)
seed=120
root_cell_cc=np.flatnonzero(adata1.obs["clusters"]=="Beta") [1]
x,y= scv.utils.get_cell_transitions(adata1, basis="tsne", starting_cell=100)
ax=scv.pl.velocity_graph(adata1, c="lightgrey",basis= "tsne", edge_width=0.5, show=False)
ax=scv.pl.scatter(adata1, x=x, y=y, s=120, c="ascending", cmap="gnuplot", ax=ax)
#Transition probabilities of ductal cells
cluster = 'Delta'
selected_cell = np.flatnonzero(adata1.obs['clusters'] == cluster)[0]
trans = scv.utils.get_transition_matrix(adata1).todense()
ax = scv.pl.scatter(adata1[adata1.obs['clusters'] == cluster], color=selected_cell, show=False, alpha=0, size=20)
ax = scv.pl.scatter(adata1, color=trans[:,selected_cell], show=False, cmap='Reds', alpha=1, perc=99, ax=ax, size=100)
ax = scv.pl.scatter(adata1, color='clusters', alpha=0.1, ax=ax, size=100, legend_loc='right margin', title='', show=False)
# Quantification for different values of parameters and different embeddings
# UMAP
min_dist = 0.5
spread = 1
seed = 55
scv.tl.umap(adata,min_dist=min_dist,spread=spread,random_state=seed)
adata.obsm['X_umap'][:, 0] = adata.obsm['X_umap'][:, 0] / np.max(np.abs(adata.obsm['X_umap'][:, 0]))
adata.obsm['X_umap'][:, 1] = adata.obsm['X_umap'][:, 1] / np.max(np.abs(adata.obsm['X_umap'][:, 1]))
seed = 55
d = []
for cell in list(range(0,len(adata.obs))):
x = scv.utils.get_cell_transitions(adata,n_steps=100, starting_cell=cell,random_state=seed)
c=[]
for i in range(1,len(x)):
delta_x = [adata.obsm['X_umap'][x[i],0] - adata.obsm['X_umap'][x[i-1],0]]
delta_y = [adata.obsm['X_umap'][x[i],1] - adata.obsm['X_umap'][x[i-1],1]]
c += [np.sqrt(np.square(delta_x)+np.square(delta_y))]
#print(np.mean(c))
d += [np.mean(c)]
name = 'MTL_UMAPmd'+str(min_dist)+'spr'+str(spread)
adata.obs[name] = d
print(np.mean(d)) ,name, scv.pl.scatter(adata,color=name,color_map='gnuplot',title='Mean Transition Length') , scv.pl.scatter(adata,color='clusters',color_map='gnuplot')
#Replot UMAP
min_dist = 0.5
spread = 1
seed = 55
scv.tl.umap(adata,min_dist=min_dist,spread=spread,random_state=seed)
name = 'MTL_UMAPmd'+str(min_dist)+'spr'+str(spread)
print(np.mean(adata.obs[name])) ,name, scv.pl.scatter(adata,color=name,color_map='gnuplot',title='Mean Transition Length',basis='umap') , scv.pl.scatter(adata,color='clusters',color_map='gnuplot',basis='umap')
#Tsne
perplexity = 30
seed = 55
scv.tl.tsne(adata,perplexity=perplexity,n_jobs=12,random_state=seed)
adata.obsm['X_tsne'][:, 0] = adata.obsm['X_tsne'][:, 0] / np.max(np.abs(adata.obsm['X_tsne'][:, 0]))
adata.obsm['X_tsne'][:, 1] = adata.obsm['X_tsne'][:, 1] / np.max(np.abs(adata.obsm['X_tsne'][:, 1]))
d = []
for cell in list(range(0,len(adata.obs))):
x = scv.utils.get_cell_transitions(adata,n_steps=100, starting_cell=cell,random_state=seed)
c=[]
for i in range(1,len(x)):
delta_x = [adata.obsm['X_tsne'][x[i],0] - adata.obsm['X_tsne'][x[i-1],0]]
delta_y = [adata.obsm['X_tsne'][x[i],1] - adata.obsm['X_tsne'][x[i-1],1]]
c += [np.sqrt(np.square(delta_x)+np.square(delta_y))]
#print(np.mean(c))
d += [np.mean(c)]
name = 'MTL_tSNEperp'+str(perplexity)
adata.obs[name] = d
print(np.mean(d)) ,name, scv.pl.scatter(adata,color=name,color_map='gnuplot',title='Mean Transition Length') , scv.pl.scatter(adata,color='clusters',color_map='gnuplot')
#Replot tSNE
perplexity = 30
seed = 55
scv.tl.tsne(adata,perplexity=perplexity,n_jobs=12,random_state=seed)
name = 'MTL_tSNEperp'+str(perplexity)
print(np.mean(adata.obs[name])) ,name, scv.pl.scatter(adata,color=name,color_map='gnuplot',title='Mean Transition Length',basis='tsne') , scv.pl.scatter(adata,color='clusters',color_map='gnuplot',basis='tsne')
#Transition of o
scv.tl.tsne(adata)
scv.tl.umap(adata)
x = scv.utils.get_cell_transitions(adata,n_steps=100, starting_cell='TGAGCCGGTTCGGCAC',random_state=55)
ax = scv.pl.velocity_graph(adata, c='lightgrey', edge_width=.05, show=False,basis='tsne')
varX = adata.obsm['X_tsne'][x,0]
varY = adata.obsm['X_tsne'][x,1]
ax = scv.pl.scatter(adata, x=varX, y=varY, s=120, c='ascending', cmap='gnuplot',ax=ax, show=False,basis='tsne')
ax = scv.pl.scatter(adata, x=varX[0], y=varY[0], s=500, c='ascending', color='pink', ax=ax,colorbar=False,basis='tsne',show=False)
varX = adata.obsm['X_umap'][x,0]
varY = adata.obsm['X_umap'][x,1]
bx = scv.pl.velocity_graph(adata, c='lightgrey', edge_width=.05, show=False,basis='umap')
bx = scv.pl.scatter(adata, x=varX, y=varY, s=120, c='ascending', cmap='gnuplot',ax=bx, show=False,basis='umap')
bx = scv.pl.scatter(adata, x=varX[0], y=varY[0], s=500, c='ascending', color='pink', ax=bx,colorbar=False,basis='umap',show=False)
ax , bx