import numpy as np
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt
import warnings
import os
import urllib.request
="ignore", category=Warning)
warnings.simplefilter(action
# verbosity: errors (0), warnings (1), info (2), hints (3)
= 3
sc.settings.verbosity =80) sc.settings.set_figure_params(dpi
Code chunks run Python commands unless it starts with %%bash
, in which case, those chunks run shell commands.
In this tutorial we will continue the analysis of the integrated dataset. We will use the scanpy enbedding to perform the clustering using graph community detection algorithms.
Let’s first load all necessary libraries and also the integrated dataset from the previous step.
# download pre-computed data if missing or long compute
= True
fetch_data
# url for source and intermediate data
= "https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq"
path_data
= "data/covid/results"
path_results if not os.path.exists(path_results):
=True)
os.makedirs(path_results, exist_ok
= "data/covid/results/scanpy_covid_qc_dr_scanorama.h5ad"
path_file if fetch_data and not os.path.exists(path_file):
urllib.request.urlretrieve(os.path.join('covid/results/scanpy_covid_qc_dr_scanorama.h5ad'), path_file)
path_data,
= sc.read_h5ad(path_file)
adata adata
AnnData object with n_obs × n_vars = 7222 × 2626
obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'
var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'tsne', 'umap'
obsm: 'Scanorama', 'X_pca', 'X_tsne', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
1 Graph clustering
The procedure of clustering on a Graph can be generalized as 3 main steps:
- Build a kNN graph from the data.
- Prune spurious connections from kNN graph (optional step). This is a SNN graph.
- Find groups of cells that maximizes the connections within the group compared other groups.
If you recall from the integration, we already constructed a knn graph before running UMAP. Hence we do not need to do it again, and can run the community detection right away.
The modularity optimization algoritm in Scanpy are Leiden and Louvain. Lets test both and see how they compare.
1.1 Leiden
= "leiden_1.0") # default resolution in 1.0
sc.tl.leiden(adata, key_added = 0.6, key_added = "leiden_0.6")
sc.tl.leiden(adata, resolution = 0.4, key_added = "leiden_0.4")
sc.tl.leiden(adata, resolution = 1.4, key_added = "leiden_1.4") sc.tl.leiden(adata, resolution
running Leiden clustering
finished: found 20 clusters and added
'leiden_1.0', the cluster labels (adata.obs, categorical) (0:00:02)
running Leiden clustering
finished: found 16 clusters and added
'leiden_0.6', the cluster labels (adata.obs, categorical) (0:00:01)
running Leiden clustering
finished: found 13 clusters and added
'leiden_0.4', the cluster labels (adata.obs, categorical) (0:00:01)
running Leiden clustering
finished: found 23 clusters and added
'leiden_1.4', the cluster labels (adata.obs, categorical) (0:00:02)
Plot the clusters, as you can see, with increased resolution, we get higher granularity in the clustering.
=['leiden_0.4', 'leiden_0.6', 'leiden_1.0','leiden_1.4']) sc.pl.umap(adata, color
Once we have done clustering, the relationships between clusters can be calculated as correlation in PCA space and we also visualize some of the marker genes that we used in the Dim Reduction lab onto the clusters.
= "leiden_0.6")
sc.tl.dendrogram(adata, groupby = "leiden_0.6")
sc.pl.dendrogram(adata, groupby
= ["CD3E", "CD4", "CD8A", "GNLY","NKG7", "MS4A1","FCGR3A","CD14","LYZ","CST3","MS4A7","FCGR1A"]
genes ='leiden_0.6', dendrogram=True) sc.pl.dotplot(adata, genes, groupby
using 'X_pca' with n_pcs = 50
Storing dendrogram info using `.uns['dendrogram_leiden_0.6']`
1.2 Louvain
= "louvain_1.0") # default resolution in 1.0
sc.tl.louvain(adata, key_added = 0.6, key_added = "louvain_0.6")
sc.tl.louvain(adata, resolution = 0.4, key_added = "louvain_0.4")
sc.tl.louvain(adata, resolution = 1.4, key_added = "louvain_1.4")
sc.tl.louvain(adata, resolution
=['louvain_0.4', 'louvain_0.6', 'louvain_1.0','louvain_1.4'])
sc.pl.umap(adata, color
= "louvain_0.6")
sc.tl.dendrogram(adata, groupby = "louvain_0.6")
sc.pl.dendrogram(adata, groupby
= ["CD3E", "CD4", "CD8A", "GNLY","NKG7", "MS4A1","FCGR3A","CD14","LYZ","CST3","MS4A7","FCGR1A"]
genes
='louvain_0.6', dendrogram=True) sc.pl.dotplot(adata, genes, groupby
running Louvain clustering
using the "louvain" package of Traag (2017)
finished: found 15 clusters and added
'louvain_1.0', the cluster labels (adata.obs, categorical) (0:00:00)
running Louvain clustering
using the "louvain" package of Traag (2017)
finished: found 11 clusters and added
'louvain_0.6', the cluster labels (adata.obs, categorical) (0:00:00)
running Louvain clustering
using the "louvain" package of Traag (2017)
finished: found 8 clusters and added
'louvain_0.4', the cluster labels (adata.obs, categorical) (0:00:00)
running Louvain clustering
using the "louvain" package of Traag (2017)
finished: found 20 clusters and added
'louvain_1.4', the cluster labels (adata.obs, categorical) (0:00:00)
using 'X_pca' with n_pcs = 50
Storing dendrogram info using `.uns['dendrogram_louvain_0.6']`
2 K-means clustering
K-means is a generic clustering algorithm that has been used in many application areas. In R, it can be applied via the kmeans()
function. Typically, it is applied to a reduced dimension representation of the expression data (most often PCA, because of the interpretability of the low-dimensional distances). We need to define the number of clusters in advance. Since the results depend on the initialization of the cluster centers, it is typically recommended to run K-means with multiple starting configurations (via the nstart
argument).
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
# extract pca coordinates
= adata.obsm['Scanorama']
X_pca
# kmeans with k=5
= KMeans(n_clusters=5, random_state=0).fit(X_pca)
kmeans 'kmeans5'] = kmeans.labels_.astype(str)
adata.obs[
# kmeans with k=10
= KMeans(n_clusters=10, random_state=0).fit(X_pca)
kmeans 'kmeans10'] = kmeans.labels_.astype(str)
adata.obs[
# kmeans with k=15
= KMeans(n_clusters=15, random_state=0).fit(X_pca)
kmeans 'kmeans15'] = kmeans.labels_.astype(str)
adata.obs[
=['kmeans5', 'kmeans10', 'kmeans15'])
sc.pl.umap(adata, color
adata.obsm
AxisArrays with keys: Scanorama, X_pca, X_tsne, X_umap
3 Hierarchical clustering
Hierarchical clustering is another generic form of clustering that can be applied also to scRNA-seq data. As K-means, it is typically applied to a reduced dimension representation of the data. Hierarchical clustering returns an entire hierarchy of partitionings (a dendrogram) that can be cut at different levels. Hierarchical clustering is done in these steps:
- Define the distances between samples. The most common are Euclidean distance (a.k.a. straight line between two points) or correlation coefficients.
- Define a measure of distances between clusters, called linkage criteria. It can for example be average distances between clusters. Commonly used methods are
single
,complete
,average
,median
,centroid
andward
. - Define the dendrogram among all samples using Bottom-up or Top-down approach. Bottom-up is where samples start with their own cluster which end up merged pair-by-pair until only one cluster is left. Top-down is where samples start all in the same cluster that end up being split by 2 until each sample has its own cluster.
As you might have realized, correlation is not a method implemented in the dist()
function. However, we can create our own distances and transform them to a distance object. We can first compute sample correlations using the cor
function.
As you already know, correlation range from -1 to 1, where 1 indicates that two samples are closest, -1 indicates that two samples are the furthest and 0 is somewhat in between. This, however, creates a problem in defining distances because a distance of 0 indicates that two samples are closest, 1 indicates that two samples are the furthest and distance of -1 is not meaningful. We thus need to transform the correlations to a positive scale (a.k.a. adjacency):
\[adj = \frac{1- cor}{2}\]
Once we transformed the correlations to a 0-1 scale, we can simply convert it to a distance object using as.dist()
function. The transformation does not need to have a maximum of 1, but it is more intuitive to have it at 1, rather than at any other number.
The function AgglomerativeClustering
has the option of running with disntance metrics “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed”. However, with ward linkage only euklidean distances works. Here we will try out euclidean distance and ward linkage calculated in PCA space.
from sklearn.cluster import AgglomerativeClustering
= AgglomerativeClustering(n_clusters=5, linkage='ward')
cluster 'hclust_5'] = cluster.fit_predict(X_pca).astype(str)
adata.obs[
= AgglomerativeClustering(n_clusters=10, linkage='ward')
cluster 'hclust_10'] = cluster.fit_predict(X_pca).astype(str)
adata.obs[
= AgglomerativeClustering(n_clusters=15, linkage='ward')
cluster 'hclust_15'] = cluster.fit_predict(X_pca).astype(str)
adata.obs[
=['hclust_5', 'hclust_10', 'hclust_15']) sc.pl.umap(adata, color
Finally, lets save the clustered data for further analysis.
'./data/covid/results/scanpy_covid_qc_dr_scanorama_cl.h5ad') adata.write_h5ad(
4 Distribution of clusters
Now, we can select one of our clustering methods and compare the proportion of samples across the clusters.
Select the “leiden_0.6” and plot proportion of samples per cluster and also proportion covid vs ctrl.
Plot proportion of cells from each condition per cluster.
= pd.crosstab(adata.obs['leiden_0.6'],adata.obs['type'], normalize='index')
tmp =True).legend(bbox_to_anchor=(1.4, 1), loc='upper right')
tmp.plot.bar(stacked
= pd.crosstab(adata.obs['leiden_0.6'],adata.obs['sample'], normalize='index')
tmp =True).legend(bbox_to_anchor=(1.4, 1),loc='upper right') tmp.plot.bar(stacked
<matplotlib.legend.Legend at 0x7f7c62595090>
In this case we have quite good representation of each sample in each cluster. But there are clearly some biases with more cells from one sample in some clusters and also more covid cells in some of the clusters.
We can also plot it in the other direction, the proportion of each cluster per sample.
= pd.crosstab(adata.obs['sample'],adata.obs['leiden_0.6'], normalize='index')
tmp =True).legend(bbox_to_anchor=(1.4, 1), loc='upper right') tmp.plot.bar(stacked
<matplotlib.legend.Legend at 0x7f7c9a9a3dc0>
By now you should know how to plot different features onto your data. Take the QC metrics that were calculated in the first exercise, that should be stored in your data object, and plot it as violin plots per cluster using the clustering method of your choice. For example, plot number of UMIS, detected genes, percent mitochondrial reads. Then, check carefully if there is any bias in how your data is separated by quality metrics. Could it be explained biologically, or could there be a technical bias there?
5 Session info
Click here
sc.logging.print_versions()
-----
anndata 0.10.5.post1
scanpy 1.9.6
-----
PIL 10.0.0
anyio NA
asttokens NA
attr 23.1.0
babel 2.12.1
backcall 0.2.0
certifi 2023.11.17
cffi 1.15.1
charset_normalizer 3.1.0
colorama 0.4.6
comm 0.1.3
cycler 0.12.1
cython_runtime NA
dateutil 2.8.2
debugpy 1.6.7
decorator 5.1.1
defusedxml 0.7.1
exceptiongroup 1.2.0
executing 1.2.0
fastjsonschema NA
gmpy2 2.1.2
h5py 3.9.0
idna 3.4
igraph 0.10.8
ipykernel 6.23.1
ipython_genutils 0.2.0
jedi 0.18.2
jinja2 3.1.2
joblib 1.3.2
json5 NA
jsonpointer 2.0
jsonschema 4.17.3
jupyter_events 0.6.3
jupyter_server 2.6.0
jupyterlab_server 2.22.1
kiwisolver 1.4.5
leidenalg 0.10.1
llvmlite 0.41.1
louvain 0.8.1
markupsafe 2.1.2
matplotlib 3.8.0
matplotlib_inline 0.1.6
mpl_toolkits NA
mpmath 1.3.0
natsort 8.4.0
nbformat 5.8.0
numba 0.58.1
numpy 1.26.3
opt_einsum v3.3.0
overrides NA
packaging 23.1
pandas 2.2.0
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
platformdirs 3.5.1
prometheus_client NA
prompt_toolkit 3.0.38
psutil 5.9.5
ptyprocess 0.7.0
pure_eval 0.2.2
pvectorc NA
pydev_ipython NA
pydevconsole NA
pydevd 2.9.5
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pygments 2.15.1
pyparsing 3.1.1
pyrsistent NA
pythonjsonlogger NA
pytz 2023.3
requests 2.31.0
rfc3339_validator 0.1.4
rfc3986_validator 0.1.1
scipy 1.12.0
send2trash NA
session_info 1.0.0
six 1.16.0
sklearn 1.4.0
sniffio 1.3.0
socks 1.7.1
stack_data 0.6.2
sympy 1.12
texttable 1.7.0
threadpoolctl 3.2.0
torch 2.0.0
tornado 6.3.2
tqdm 4.65.0
traitlets 5.9.0
typing_extensions NA
urllib3 2.0.2
wcwidth 0.2.6
websocket 1.5.2
yaml 6.0
zmq 25.0.2
zoneinfo NA
zstandard 0.19.0
-----
IPython 8.13.2
jupyter_client 8.2.0
jupyter_core 5.3.0
jupyterlab 4.0.1
notebook 6.5.4
-----
Python 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0]
Linux-5.15.0-92-generic-x86_64-with-glibc2.35
-----
Session information updated at 2024-02-06 00:39