import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
from matplotlib import rcParams
import scanpy as sc
import scipy
import numpy as np
import matplotlib.pyplot as plt
import warnings
="ignore", category=Warning)
warnings.simplefilter(action
# verbosity: errors (0), warnings (1), info (2), hints (3)
= 3
sc.settings.verbosity =100, frameon=False, figsize=(5, 5), facecolor='white', color_map = 'viridis_r') sc.settings.set_figure_params(dpi
Code chunks run Python commands unless it starts with %%bash
, in which case, those chunks run shell commands.
Partly following this PAGA tutorial with some modifications.
1 Loading libraries
2 Preparing data
In order to speed up the computations during the exercises, we will be using a subset of a bone marrow dataset (originally containing about 100K cells). The bone marrow is the source of adult immune cells, and contains virtually all differentiation stages of cell from the immune system which later circulate in the blood to all other organs.
If you have been using the Seurat, Bioconductor or Scanpy toolkits with your own data, you need to reach to the point where can find get:
- A dimensionality reduction where to perform the trajectory (for example: PCA, ICA, MNN, harmony, Diffusion Maps, UMAP)
- The cell clustering information (for example: from Louvain, k-means)
- A KNN/SNN graph (this is useful to inspect and sanity-check your trajectories)
In this case, all the data has been preprocessed with Seurat with standard pipelines. In addition there was some manual filtering done to remove clusters that are disconnected and cells that are hard to cluster, which can be seen in this script
The file trajectory_scanpy_filtered.h5ad was converted from the Seurat object using the SeuratDisk package. For more information on how it was done, have a look at the script: convert_to_h5ad.R in the github repo.
You can download the data with the commands:
import os
import urllib.request
# download pre-computed data if missing or long compute
= True
fetch_data
# url for source and intermediate data
= "https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq"
path_data
= "data/trajectory"
path_results if not os.path.exists(path_results):
=True)
os.makedirs(path_results, exist_ok
= "data/trajectory/trajectory_seurat_filtered.h5ad"
path_file if not os.path.exists(path_file):
= os.path.join(
file_url "trajectory/trajectory_seurat_filtered.h5ad")
path_data, urllib.request.urlretrieve(file_url, path_file)
3 Reading data
We already have pre-computed and subsetted the dataset (with 6688 cells and 3585 genes) following the analysis steps in this course. We then saved the objects, so you can use common tools to open and start to work with them (either in R or Python).
= sc.read_h5ad("data/trajectory/trajectory_seurat_filtered.h5ad")
adata adata.var
features | |
---|---|
0610040J01Rik | 0610040J01Rik |
1190007I07Rik | 1190007I07Rik |
1500009L16Rik | 1500009L16Rik |
1700012B09Rik | 1700012B09Rik |
1700020L24Rik | 1700020L24Rik |
... | ... |
Sqor | Sqor |
Sting1 | Sting1 |
Tent5a | Tent5a |
Tlcd4 | Tlcd4 |
Znrd2 | Znrd2 |
3585 rows × 1 columns
# check what you have in the X matrix, should be lognormalized counts.
print(adata.X[:10,:10])
(0, 4) 0.11622072805743532
(0, 8) 0.4800893970571722
(1, 8) 0.2478910541698065
(1, 9) 0.17188973970230348
(2, 1) 0.09413397843954842
(2, 7) 0.18016412971724202
(3, 1) 0.08438841021254412
(3, 4) 0.08438841021254412
(3, 7) 0.08438841021254412
(3, 8) 0.3648216463668793
(4, 1) 0.14198147850903975
(4, 8) 0.14198147850903975
(5, 1) 0.17953169693896723
(5, 8) 0.17953169693896723
(5, 9) 0.17953169693896723
(6, 4) 0.2319546390006887
(6, 8) 0.42010741700351195
(7, 1) 0.1775659421407816
(7, 8) 0.39593115482156394
(7, 9) 0.09271901219711086
(8, 1) 0.12089079757716388
(8, 8) 0.22873058755480363
(9, 1) 0.08915380247493314
(9, 4) 0.08915380247493314
(9, 8) 0.38270398718590104
4 Explore the data
There is a umap and clusters provided with the object, first plot some information from the previous analysis onto the umap.
= ['clusters','dataset','batches','Phase'],legend_loc = 'on data', legend_fontsize = 'xx-small', ncols = 2) sc.pl.umap(adata, color
It is crucial that you performing analysis of a dataset understands what is going on, what are the clusters you see in your data and most importantly How are the clusters related to each other?. Well, let’s explore the data a bit. With the help of this table, write down which cluster numbers in your dataset express these key markers.
Marker | Cell Type |
---|---|
Cd34 | HSC progenitor |
Ms4a1 | B cell lineage |
Cd3e | T cell lineage |
Ltf | Granulocyte lineage |
Cst3 | Monocyte lineage |
Mcpt8 | Mast Cell lineage |
Alas2 | RBC lineage |
Siglech | Dendritic cell lineage |
C1qc | Macrophage cell lineage |
Pf4 | Megakaryocyte cell lineage |
= ["Cd34","Alas2","Pf4","Mcpt8","Ltf","Cst3", "Siglech", "C1qc", "Ms4a1", "Cd3e", ]
markers = markers, use_raw = False, ncols = 4) sc.pl.umap(adata, color
5 Rerun analysis in Scanpy
Redo clustering and umap using the basic Scanpy pipeline. Use the provided “X_harmony_Phase” dimensionality reduction as the staring point.
# first, store the old umap with a new name so it is not overwritten
'X_umap_old'] = adata.obsm['X_umap']
adata.obsm[
= 30, n_neighbors = 20, use_rep="X_harmony_Phase")
sc.pp.neighbors(adata, n_pcs =0.4, spread=3) sc.tl.umap(adata, min_dist
computing neighbors
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:12)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:00:09)
#sc.tl.umap(adata, min_dist=0.6, spread=1.5)
= ['clusters'],legend_loc = 'on data', legend_fontsize = 'xx-small', edges = True)
sc.pl.umap(adata, color
= markers, use_raw = False, ncols = 4)
sc.pl.umap(adata, color
# Redo clustering as well
= "leiden_1.0", resolution = 1.0) # default resolution in 1.0
sc.tl.leiden(adata, key_added = "leiden_1.2", resolution = 1.2) # default resolution in 1.0
sc.tl.leiden(adata, key_added = "leiden_1.4", resolution = 1.4) # default resolution in 1.0
sc.tl.leiden(adata, key_added
#sc.tl.louvain(adata, key_added = "leiden_1.0") # default resolution in 1.0
= ['leiden_1.0', 'leiden_1.2', 'leiden_1.4','clusters'],legend_loc = 'on data', legend_fontsize = 'xx-small', ncols =2) sc.pl.umap(adata, color
running Leiden clustering
finished: found 16 clusters and added
'leiden_1.0', the cluster labels (adata.obs, categorical) (0:00:03)
running Leiden clustering
finished: found 17 clusters and added
'leiden_1.2', the cluster labels (adata.obs, categorical) (0:00:02)
running Leiden clustering
finished: found 19 clusters and added
'leiden_1.4', the cluster labels (adata.obs, categorical) (0:00:02)
#Rename clusters with really clear markers, the rest are left unlabelled.
= pd.DataFrame(adata.obs['leiden_1.4'].astype('string'))
annot 'leiden_1.4'] == '10'] = '10_megakaryo' #Pf4
annot[annot['leiden_1.4'] == '17'] = '17_macro' #C1qc
annot[annot['leiden_1.4'] == '11'] = '11_eryth' #Alas2
annot[annot['leiden_1.4'] == '18'] = '18_dend' #Siglech
annot[annot['leiden_1.4'] == '13'] = '13_mast' #Mcpt8
annot[annot['leiden_1.4'] == '0'] = '0_mono' #Cts3
annot[annot['leiden_1.4'] == '1'] = '1_gran' #Ltf
annot[annot['leiden_1.4'] == '9'] = '9_gran'
annot[annot['leiden_1.4'] == '14'] = '14_TC' #Cd3e
annot[annot['leiden_1.4'] == '16'] = '16_BC' #Ms4a1
annot[annot['leiden_1.4'] == '8'] = '8_progen' # Cd34
annot[annot['leiden_1.4'] == '4'] = '4_progen'
annot[annot['leiden_1.4'] == '5'] = '5_progen'
annot[annot[
'annot']=annot['leiden_1.4'].astype('category')
adata.obs[
= 'annot',legend_loc = 'on data', legend_fontsize = 'xx-small', ncols =2)
sc.pl.umap(adata, color
annot.value_counts()#type(annot)
# astype('category')
leiden_1.4
0_mono 509
1_gran 487
2 479
3 463
4_progen 387
5_progen 384
7 368
6 368
8_progen 367
9_gran 366
10_megakaryo 301
11_eryth 294
12 276
13_mast 159
14_TC 151
15 128
16_BC 124
17_macro 116
18_dend 101
Name: count, dtype: int64
# plot onto the Seurat embedding:
='X_umap_old', color = 'annot',legend_loc = 'on data', legend_fontsize = 'xx-small', ncols =2) sc.pl.embedding(adata, basis
6 Run PAGA
Use the clusters from leiden clustering with leiden_1.4 and run PAGA. First we create the graph and initialize the positions using the umap.
# use the umap to initialize the graph layout.
='X_umap')
sc.tl.draw_graph(adata, init_pos='annot', legend_loc='on data', legend_fontsize = 'xx-small')
sc.pl.draw_graph(adata, color='annot')
sc.tl.paga(adata, groups='annot', edge_width_scale = 0.3) sc.pl.paga(adata, color
drawing single-cell graph using layout 'fa'
finished: added
'X_draw_graph_fa', graph_drawing coordinates (adata.obsm) (0:00:54)
running PAGA
finished: added
'paga/connectivities', connectivities adjacency (adata.uns)
'paga/connectivities_tree', connectivities subtree (adata.uns) (0:00:00)
--> added 'pos', the PAGA positions (adata.uns['paga'])
As you can see, we have edges between many clusters that we know are are unrelated, so we may need to clean up the data a bit more.
7 Filtering graph edges
First, lets explore the graph a bit. So we plot the umap with the graph connections on top.
=True, color = 'annot', legend_loc= 'on data', legend_fontsize= 'xx-small') sc.pl.umap(adata, edges
We have many edges in the graph between unrelated clusters, so lets try with fewer neighbors.
=5, use_rep = 'X_harmony_Phase', n_pcs = 30)
sc.pp.neighbors(adata, n_neighbors=True, color = 'annot', legend_loc= 'on data', legend_fontsize= 'xx-small') sc.pl.umap(adata, edges
computing neighbors
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
7.1 Rerun PAGA again on the data
='X_umap')
sc.tl.draw_graph(adata, init_pos='annot', legend_loc='on data', legend_fontsize = 'xx-small') sc.pl.draw_graph(adata, color
drawing single-cell graph using layout 'fa'
finished: added
'X_draw_graph_fa', graph_drawing coordinates (adata.obsm) (0:00:51)
='annot')
sc.tl.paga(adata, groups='annot', edge_width_scale = 0.3) sc.pl.paga(adata, color
running PAGA
finished: added
'paga/connectivities', connectivities adjacency (adata.uns)
'paga/connectivities_tree', connectivities subtree (adata.uns) (0:00:00)
--> added 'pos', the PAGA positions (adata.uns['paga'])
8 Embedding using PAGA-initialization
We can now redraw the graph using another starting position from the paga layout. The following is just as well possible for a UMAP.
='paga') sc.tl.draw_graph(adata, init_pos
drawing single-cell graph using layout 'fa'
finished: added
'X_draw_graph_fa', graph_drawing coordinates (adata.obsm) (0:00:54)
Now we can see all marker genes also at single-cell resolution in a meaningful layout.
=['annot'], legend_loc='on data', legend_fontsize= 'xx-small') sc.pl.draw_graph(adata, color
Compare the 2 graphs
sc.pl.paga_compare(=0.03, title='', right_margin=0.2, size=10, edge_width_scale=0.5,
adata, threshold=12, fontsize=12, frameon=False, edges=True) legend_fontsize
--> added 'pos', the PAGA positions (adata.uns['paga'])
9 Gene changes
We can reconstruct gene changes along PAGA paths for a given set of genes
Choose a root cell for diffusion pseudotime. We have 3 progenitor clusters, but cluster 5 seems the most clear.
'iroot'] = np.flatnonzero(adata.obs['annot'] == '5_progen')[0]
adata.uns[
sc.tl.dpt(adata)
WARNING: Trying to run `tl.dpt` without prior call of `tl.diffmap`. Falling back to `tl.diffmap` with default parameters.
computing Diffusion Maps using n_comps=15(=n_dcs)
computing transitions
finished (0:00:00)
eigenvalues of transition matrix
[1. 0.9989591 0.997628 0.9970365 0.9956704 0.99334306
0.9918951 0.9915921 0.99013233 0.98801893 0.9870309 0.9861044
0.9851118 0.9845008 0.9839531 ]
finished: added
'X_diffmap', diffmap coordinates (adata.obsm)
'diffmap_evals', eigenvalues of transition matrix (adata.uns) (0:00:00)
computing Diffusion Pseudotime using n_dcs=10
finished: added
'dpt_pseudotime', the pseudotime (adata.obs) (0:00:00)
Use the full raw data for visualization.
=['annot', 'dpt_pseudotime'], legend_loc='on data', legend_fontsize= 'x-small') sc.pl.draw_graph(adata, color
By looking at the different know lineages and the layout of the graph we define manually some paths to the graph that corresponds to specific lineages.
# Define paths
= [('erythrocytes', ['5_progen', '8_progen', '6', '3', '7', '11_eryth']),
paths 'lympoid', ['5_progen', '12', '16_BC', '14_TC']),
('granulo', ['5_progen', '4_progen', '2', '9_gran', '1_gran']),
('mono', ['5_progen', '4_progen', '0_mono', '18_dend', '17_macro'])
(
]
'distance'] = adata.obs['dpt_pseudotime'] adata.obs[
Then we select some genes that can vary in the lineages and plot onto the paths.
= ['Gata2', 'Gata1', 'Klf1', 'Epor', 'Hba-a2', # erythroid
gene_names 'Elane', 'Cebpe', 'Gfi1', # neutrophil
'Irf8', 'Csf1r', 'Ctsg', # monocyte
'Itga2b','Prss34','Cma1','Procr', # Megakaryo,Basophil,Mast,HPC
'C1qc','Siglech','Ms4a1','Cd3e','Cd34']
= pl.subplots(ncols=4, figsize=(10, 4), gridspec_kw={
_, axs 'wspace': 0.05, 'left': 0.12})
=0.05, right=0.98, top=0.82, bottom=0.2)
pl.subplots_adjust(leftfor ipath, (descr, path) in enumerate(paths):
= sc.pl.paga_path(
_, data
adata, path, gene_names,=False,
show_node_names=axs[ipath],
ax=12,
ytick_fontsize=0.15,
left_margin=50,
n_avg=['distance'],
annotations=True if ipath == 0 else False,
show_yticks=False,
show_colorbar='Greys',
color_map='annot',
groups_key={'distance': 'viridis'},
color_maps_annotations='{} path'.format(descr),
title=True,
return_data=False,
use_raw=False)
show'data/trajectory/paga_path_{}.csv'.format(descr))
data.to_csv('data/trajectory/paga_path.pdf')
pl.savefig( pl.show()
As you can see, we can manipulate the trajectory quite a bit by selecting different number of neighbors, components etc. to fit with our assumptions on the development of these celltypes.
Please explore further how you can tweak the trajectory. For instance, can you create a PAGA trajectory using the orignial umap from Seurat instead? Hint, you first need to compute the neighbors on the umap.
10 Session info
Click here
sc.logging.print_versions()
-----
anndata 0.10.5.post1
scanpy 1.9.6
-----
PIL 10.0.0
anyio NA
asttokens NA
attr 23.1.0
babel 2.12.1
backcall 0.2.0
certifi 2023.11.17
cffi 1.15.1
charset_normalizer 3.1.0
colorama 0.4.6
comm 0.1.3
cycler 0.12.1
cython_runtime NA
dateutil 2.8.2
debugpy 1.6.7
decorator 5.1.1
defusedxml 0.7.1
exceptiongroup 1.2.0
executing 1.2.0
fa2 NA
fastjsonschema NA
fontTools 4.47.2
gmpy2 2.1.2
h5py 3.9.0
idna 3.4
igraph 0.10.8
ipykernel 6.23.1
ipython_genutils 0.2.0
jedi 0.18.2
jinja2 3.1.2
joblib 1.3.2
json5 NA
jsonpointer 2.0
jsonschema 4.17.3
jupyter_events 0.6.3
jupyter_server 2.6.0
jupyterlab_server 2.22.1
kiwisolver 1.4.5
leidenalg 0.10.1
llvmlite 0.41.1
louvain 0.8.1
markupsafe 2.1.2
matplotlib 3.8.0
matplotlib_inline 0.1.6
mpl_toolkits NA
mpmath 1.3.0
natsort 8.4.0
nbformat 5.8.0
networkx 3.2.1
numba 0.58.1
numpy 1.26.3
opt_einsum v3.3.0
overrides NA
packaging 23.1
pandas 2.2.0
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
platformdirs 3.5.1
prometheus_client NA
prompt_toolkit 3.0.38
psutil 5.9.5
ptyprocess 0.7.0
pure_eval 0.2.2
pvectorc NA
pycparser 2.21
pydev_ipython NA
pydevconsole NA
pydevd 2.9.5
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pygments 2.15.1
pynndescent 0.5.11
pyparsing 3.1.1
pyrsistent NA
pythonjsonlogger NA
pytz 2023.3
requests 2.31.0
rfc3339_validator 0.1.4
rfc3986_validator 0.1.1
scipy 1.12.0
send2trash NA
session_info 1.0.0
six 1.16.0
sklearn 1.4.0
sniffio 1.3.0
socks 1.7.1
sparse 0.15.1
stack_data 0.6.2
sympy 1.12
texttable 1.7.0
threadpoolctl 3.2.0
torch 2.0.0
tornado 6.3.2
tqdm 4.65.0
traitlets 5.9.0
typing_extensions NA
umap 0.5.5
urllib3 2.0.2
wcwidth 0.2.6
websocket 1.5.2
yaml 6.0
zmq 25.0.2
zoneinfo NA
zstandard 0.19.0
-----
IPython 8.13.2
jupyter_client 8.2.0
jupyter_core 5.3.0
jupyterlab 4.0.1
notebook 6.5.4
-----
Python 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0]
Linux-5.15.0-92-generic-x86_64-with-glibc2.35
-----
Session information updated at 2024-02-06 00:46