Clustering

Note

Code chunks run R commands unless otherwise specified.

In this tutorial, we will continue the analysis of the integrated dataset. We will use the integrated PCA or CCA to perform the clustering. First, we will construct a \(k\)-nearest neighbor graph in order to perform a clustering on the graph. We will also show how to perform hierarchical clustering and k-means clustering on the selected space.

Let’s first load all necessary libraries and also the integrated dataset from the previous step.

suppressPackageStartupMessages({
    library(Seurat)
    library(patchwork)
    library(ggplot2)
    library(pheatmap)
    library(clustree)
})

# download pre-computed data if missing or long compute
fetch_data <- TRUE

# url for source and intermediate data
path_data <- "https://nextcloud.dc.scilifelab.se/public.php/webdav"
curl_upass <- "-u zbC5fr2LbEZ9rSE:scRNAseq2025"

path_file <- "data/covid/results/seurat_covid_qc_dr_int.rds"
if (!dir.exists(dirname(path_file))) dir.create(dirname(path_file), recursive = TRUE)
if (fetch_data && !file.exists(path_file)) download.file(url = file.path(path_data, "covid/results_seurat/seurat_covid_qc_dr_int.rds"), destfile = path_file, method = "curl", extra = curl_upass)

alldata <- readRDS(path_file)
print(names(alldata@reductions))

 [1] "pca"               "umap"              "tsne"             
 [4] "UMAP10_on_PCA"     "UMAP_on_ScaleData" "integrated_cca"   
 [7] "umap_cca"          "tsne_cca"          "harmony"          
[10] "umap_harmony"      "scanorama"         "scanoramaC"       
[13] "umap_scanorama"    "umap_scanoramaC"

1 Graph clustering

The procedure of clustering on a Graph can be generalized as 3 main steps:
- Build a kNN graph from the data.
- Prune spurious connections from kNN graph (optional step). This is a SNN graph.
- Find groups of cells that maximizes the connections within the group compared other groups.

1.1 Building kNN / SNN graph

The first step into graph clustering is to construct a k-nn graph, in case you don’t have one. For this, we will use the PCA space. Thus, as done for dimensionality reduction, we will use ony the top N PCA dimensions for this purpose (the same used for computing UMAP / tSNE).

As we can see above, the Seurat function FindNeighbors() already computes both the KNN and SNN graphs, in which we can control the minimal percentage of shared neighbours to be kept. See ?FindNeighbors for additional options.

The main options to consider are:

dims - the number of dimensions from the initial reduction to include when calculating distances between cells.
k.param - the number of neighbors per cell to include in the KNN graph.
prune.SNN - sets the cutoff for Jaccard index when pruning the graph.

# use the CCA integration to create the neighborhood graph.
alldata <- FindNeighbors(alldata, dims = 1:30, k.param = 60, prune.SNN = 1 / 15, reduction =  "integrated_cca")

# check the names for graphs in the object.
names(alldata@graphs)

[1] "RNA_nn"  "RNA_snn"

We can take a look at the kNN and SNN graphs. The kNN graph is a matrix where every connection between cells is represented as \(1\)s. This is called a unweighted graph (default in Seurat). In the SNN graph on the other hand, some cell connections have more importance than others, and the graph scales from \(0\) to a maximum distance (in this case \(1\)). Usually, the smaller the distance, the closer two points are, and stronger is their connection. This is called a weighted graph. Both weighted and unweighted graphs are suitable for clustering, but clustering on unweighted graphs is faster for large datasets (> 100k cells).

pheatmap(alldata@graphs$RNA_nn[1:200, 1:200],
    col = c("white", "black"), border_color = "grey90", main = "KNN graph",
    legend = F, cluster_rows = F, cluster_cols = F, fontsize = 2
)

pheatmap(alldata@graphs$RNA_snn[1:200, 1:200],
    col = colorRampPalette(c("white", "yellow", "red"))(100),
    border_color = "grey90", main = "SNN graph",
    legend = F, cluster_rows = F, cluster_cols = F, fontsize = 2
)

1.2 Clustering on a graph

Once the graph is built, we can now perform graph clustering. The clustering is done respective to a resolution which can be interpreted as how coarse you want your cluster to be. Higher resolution means higher number of clusters.

In Seurat, the function FindClusters() will do a graph-based clustering using “Louvain” algorithim by default (algorithm = 1). To use the leiden algorithm, you need to set it to algorithm = 4. See ?FindClusters for additional options.

By default it will run clustering on the SNN graph we created in the previous step, but you can also specify different graphs for clustering with graph.name.

# Clustering with louvain (algorithm 1) and a few different resolutions
for (res in c(0.1, 0.25, .5, 1, 1.5, 2)) {
    alldata <- FindClusters(alldata, graph.name = "RNA_snn", resolution = res, algorithm = 1)
}

# each time you run clustering, the data is stored in meta data columns:
# seurat_clusters - lastest results only
# RNA_snn_res.XX - for each different resolution you test.

wrap_plots(
    DimPlot(alldata, reduction = "umap_cca", group.by = "RNA_snn_res.0.1", label=T) + ggtitle("louvain_0.1"),
    DimPlot(alldata, reduction = "umap_cca", group.by = "RNA_snn_res.0.25", label=T) + ggtitle("louvain_0.25"),
    DimPlot(alldata, reduction = "umap_cca", group.by = "RNA_snn_res.0.5", label=T) + ggtitle("louvain_0.5"),
    DimPlot(alldata, reduction = "umap_cca", group.by = "RNA_snn_res.1", label=T) + ggtitle("louvain_1"),
    DimPlot(alldata, reduction = "umap_cca", group.by = "RNA_snn_res.2", label=T) + ggtitle("louvain_2"),
    ncol = 3
)

We can now use the clustree package to visualize how cells are distributed between clusters depending on resolution.

suppressPackageStartupMessages(library(clustree))
clustree(alldata@meta.data, prefix = "RNA_snn_res.")

2 K-means clustering

K-means is a generic clustering algorithm that has been used in many application areas. In R, it can be applied via the kmeans() function. Typically, it is applied to a reduced dimension representation of the expression data (most often PCA, because of the interpretability of the low-dimensional distances). We need to define the number of clusters in advance. Since the results depend on the initialization of the cluster centers, it is typically recommended to run K-means with multiple starting configurations (via the nstart argument).

for (k in c(5, 7, 10, 12, 15, 17, 20)) {
    alldata@meta.data[, paste0("kmeans_", k)] <- kmeans(x = Embeddings(alldata, "integrated_cca"), centers = k, nstart = 100)$cluster
}

wrap_plots(
    DimPlot(alldata, reduction = "umap_cca", group.by = "kmeans_5", label=T) + ggtitle("kmeans_5"),
    DimPlot(alldata, reduction = "umap_cca", group.by = "kmeans_10", label=T) + ggtitle("kmeans_10"),
    DimPlot(alldata, reduction = "umap_cca", group.by = "kmeans_15", label=T) + ggtitle("kmeans_15"),
    ncol = 3
)

clustree(alldata@meta.data, prefix = "kmeans_")

3 Hierarchical clustering

3.1 Defining distance between cells

The base R stats package already contains a function dist that calculates distances between all pairs of samples. Since we want to compute distances between samples, rather than among genes, we need to transpose the data before applying it to the dist function. This can be done by simply adding the transpose function t() to the data. The distance methods available in dist are: ‘euclidean’, ‘maximum’, ‘manhattan’, ‘canberra’, ‘binary’ or ‘minkowski’.

d <- dist(Embeddings(alldata, "integrated_cca"), method = "euclidean")

As you might have realized, correlation is not a method implemented in the dist() function. However, we can create our own distances and transform them to a distance object. We can first compute sample correlations using the cor function.
As you already know, correlation range from -1 to 1, where 1 indicates that two samples are closest, -1 indicates that two samples are the furthest and 0 is somewhat in between. This, however, creates a problem in defining distances because a distance of 0 indicates that two samples are closest, 1 indicates that two samples are the furthest and distance of -1 is not meaningful. We thus need to transform the correlations to a positive scale (a.k.a. adjacency):
\[adj = \frac{1- cor}{2}\]
Once we transformed the correlations to a 0-1 scale, we can simply convert it to a distance object using as.dist() function. The transformation does not need to have a maximum of 1, but it is more intuitive to have it at 1, rather than at any other number.

# Compute sample correlations
sample_cor <- cor(Matrix::t(Embeddings(alldata, "integrated_cca")))

# Transform the scale from correlations
sample_cor <- (1 - sample_cor) / 2

# Convert it to a distance object
d2 <- as.dist(sample_cor)

3.2 Clustering cells

After having calculated the distances between samples, we can now proceed with the hierarchical clustering per-se. We will use the function hclust() for this purpose, in which we can simply run it with the distance objects created above. The methods available are: ‘ward.D’, ‘ward.D2’, ‘single’, ‘complete’, ‘average’, ‘mcquitty’, ‘median’ or ‘centroid’. It is possible to plot the dendrogram for all cells, but this is very time consuming and we will omit for this tutorial.

# euclidean
h_euclidean <- hclust(d, method = "ward.D2")

# correlation
h_correlation <- hclust(d2, method = "ward.D2")

Once your dendrogram is created, the next step is to define which samples belong to a particular cluster. After identifying the dendrogram, we can now literally cut the tree at a fixed threshold (with cutree) at different levels to define the clusters. We can either define the number of clusters or decide on a height. We can simply try different clustering levels.

# euclidean distance
alldata$hc_euclidean_5 <- cutree(h_euclidean, k = 5)
alldata$hc_euclidean_10 <- cutree(h_euclidean, k = 10)
alldata$hc_euclidean_15 <- cutree(h_euclidean, k = 15)

# correlation distance
alldata$hc_corelation_5 <- cutree(h_correlation, k = 5)
alldata$hc_corelation_10 <- cutree(h_correlation, k = 10)
alldata$hc_corelation_15 <- cutree(h_correlation, k = 15)

wrap_plots(
    DimPlot(alldata, reduction = "umap_cca", group.by = "hc_euclidean_5", label=T) + ggtitle("hc_euc_5"),
    DimPlot(alldata, reduction = "umap_cca", group.by = "hc_euclidean_10", label=T) + ggtitle("hc_euc_10"),
    DimPlot(alldata, reduction = "umap_cca", group.by = "hc_euclidean_15", label=T) + ggtitle("hc_euc_15"),
    DimPlot(alldata, reduction = "umap_cca", group.by = "hc_corelation_5", label=T) + ggtitle("hc_cor_5"),
    DimPlot(alldata, reduction = "umap_cca", group.by = "hc_corelation_10", label=T) + ggtitle("hc_cor_10"),
    DimPlot(alldata, reduction = "umap_cca", group.by = "hc_corelation_15", label=T) + ggtitle("hc_cor_15"),
    ncol = 3
) + plot_layout()

Finally, lets save the clustered data for further analysis.

saveRDS(alldata, "data/covid/results/seurat_covid_qc_dr_int_cl.rds")

4 Distribution of clusters

Now, we can select one of our clustering methods and compare the proportion of samples across the clusters.

Select the RNA_snn_res.0.5 and plot proportion of samples per cluster and also proportion covid vs ctrl.

p1 <- ggplot(alldata@meta.data, aes(x = RNA_snn_res.0.5, fill = orig.ident)) +
    geom_bar(position = "fill")
p2 <- ggplot(alldata@meta.data, aes(x = RNA_snn_res.0.5, fill = type)) +
    geom_bar(position = "fill")

p1 + p2

In this case we have quite good representation of each sample in each cluster. But there are clearly some biases with more cells from one sample in some clusters and also more covid cells in some of the clusters.

We can also plot it in the other direction, the proportion of each cluster per sample.

ggplot(alldata@meta.data, aes(x = orig.ident, fill = RNA_snn_res.0.5)) +
    geom_bar(position = "fill")

Discuss

By now you should know how to plot different features onto your data. Take the QC metrics that were calculated in the first exercise, that should be stored in your data object, and plot it as violin plots per cluster using the clustering method of your choice. For example, plot number of UMIS, detected genes, percent mitochondrial reads. Then, check carefully if there is any bias in how your data is separated by quality metrics. Could it be explained biologically, or could there be a technical bias there?

VlnPlot(alldata, group.by = "RNA_snn_res.0.5", features = c("nFeature_RNA", "percent_mito"))

Some clusters that are clearly defined by higher number of genes and counts. These are either doublets or a larger celltype. And some clusters with low values on these metrics that are either low quality cells or a smaller celltype. You will have to explore these clusters in more detail to judge what you believe them to be.

5 Subclustering of T and NK-cells

It is common that the subtypes of cells within a cluster is not so well separated when you have a heterogeneous dataset. In such a case it could be a good idea to run subclustering of individual celltypes. The main reason for subclustering is that the variable genes and the first principal components in the full analysis are mainly driven by differences between celltypes, while with subclustering we may detect smaller differences between subtypes within celltypes.

So first, lets find out where our T-cell and NK-cell clusters are. We know that T-cells express CD3E, and the main subtypes are CD4 and CD8, while NK-cells express GNLY.

# check with the lowest resolution
p1 = DimPlot(alldata, reduction = "umap_cca", group.by = "RNA_snn_res.0.1", label = T) + ggtitle("louvain_0.1")
p2 = FeaturePlot(alldata, features = "CD3E", reduction = "umap_cca", order = T) 
p3 = FeaturePlot(alldata, features = "CD4", reduction = "umap_cca", order = T) 
p4 = FeaturePlot(alldata, features = "CD8A", reduction = "umap_cca", order = T) 
p5 = FeaturePlot(alldata, features = "GNLY", reduction = "umap_cca", order = T) 


wrap_plots(p1,p2,p3,p4,p5, ncol=3) + plot_layout(guides = "collect")

We can clearly see what clusters are T-cell clusters, so lets subset the data for those cells

tcells = alldata[,alldata$RNA_snn_res.0.1 %in% c(0,3)]

table(tcells$orig.ident)


 covid_1 covid_15 covid_16 covid_17  ctrl_13  ctrl_14  ctrl_19   ctrl_5 
     280      242      130      292      636      521      562      705

Ideally we should rerun all steps of integration with that subset of cells instead of just taking the joint embedding. If you have too few cells per sample in the celltype that you want to cluster it may not be possible. We will start with selecting a new set of genes that better reflecs the variability within this celltype

tcells = FindVariableFeatures(tcells, verbose = FALSE)

# check overlap with the variable genes using all the data
length(intersect(VariableFeatures(alldata), VariableFeatures(tcells)))

[1] 1257

We clearly have a very different geneset now, so hopefully it should better capture the variability within T-cells.

Now we have to run the full pipeline with scaling, pca, integration and clustering on this subset of cells, using the new set of variable genes

# run all the steps from before:
tcells = ScaleData(tcells, vars.to.regress = c("percent_mito", "nFeature_RNA"), assay = "RNA")
tcells = RunPCA(tcells, npcs = 50, verbose = F)

tcells <- IntegrateLayers(object = tcells, 
                           method = CCAIntegration, orig.reduction = "pca", 
                           new.reduction = "integrated_tcells", verbose = FALSE)

tcells <- RunUMAP(tcells, reduction = "integrated_tcells", dims = 1:30, reduction.name = "umap_tcells")

tcells <- FindNeighbors(tcells, reduction = "integrated_tcells", dims = 1:30)
tcells <- FindClusters(tcells, graph.name = "RNA_snn", resolution = 0.5, algorithm = 1, cluster.name = "tcell_0.5")

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 3368
Number of edges: 205312

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8365
Number of communities: 8
Elapsed time: 0 seconds

wrap_plots(
  DimPlot(tcells, reduction = "umap_cca", group.by = "orig.ident")+NoAxes()+ggtitle("Full umap"),
  DimPlot(tcells, reduction = "umap_cca", group.by = "RNA_snn_res.0.5", label = T)+NoAxes()+ggtitle("Full umap, full clust"),
  DimPlot(tcells, reduction = "umap_cca", group.by = "tcell_0.5", label = T)+NoAxes()+ggtitle("Full umap, T-cell clust"),
  DimPlot(tcells, reduction = "umap_tcells", group.by = "orig.ident")+NoAxes()+ggtitle("T-cell umap, T-cell clust"),
  DimPlot(tcells, reduction = "umap_tcells", group.by = "RNA_snn_res.0.5", label=T)+NoAxes()+ggtitle("T-cell umap, full clust"),
  DimPlot(tcells, reduction = "umap_tcells", group.by = "tcell_0.5", label = T)+NoAxes()+ggtitle("T-cell umap"),  
  ncol = 3
) + plot_layout(guides = "collect")

As you can see, we do have some new clusters that did not stand out before (clusters 6,7). But in general the separation looks very similar.

Lets also have a look at some genes in the new umap:

wrap_plots(
  FeaturePlot(tcells, features = "CD3E", reduction = "umap_tcells", order = T), 
  FeaturePlot(tcells, features = "CD4", reduction = "umap_tcells", order = T), 
  FeaturePlot(tcells, features = "CD8A", reduction = "umap_tcells", order = T), 
  FeaturePlot(tcells, features = "GNLY", reduction = "umap_tcells", order = T), 
  ncol = 2
) + plot_layout(guides = "collect")

Discuss

Have a look at the T-cells in the umaps with all cells or only T/NK cells. What are the main differences? Do you think it improved with subclustering? Also, there are some cells in these clusters that fall far away from the rest in the UMAPs, why do you think that is?

6 Session info

Click here

sessionInfo()

R version 4.3.3 (2024-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /usr/local/conda/envs/seurat/lib/libopenblasp-r0.3.28.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] clustree_0.5.1     ggraph_2.2.1       pheatmap_1.0.12    ggplot2_3.5.1     
[5] patchwork_1.3.0    Seurat_5.1.0       SeuratObject_5.0.2 sp_2.2-0          

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3     jsonlite_1.8.9         magrittr_2.0.3        
  [4] ggbeeswarm_0.7.2       spatstat.utils_3.1-2   farver_2.1.2          
  [7] rmarkdown_2.29         vctrs_0.6.5            ROCR_1.0-11           
 [10] memoise_2.0.1          spatstat.explore_3.3-4 htmltools_0.5.8.1     
 [13] sctransform_0.4.1      parallelly_1.42.0      KernSmooth_2.23-26    
 [16] htmlwidgets_1.6.4      ica_1.0-3              plyr_1.8.9            
 [19] plotly_4.10.4          zoo_1.8-12             cachem_1.1.0          
 [22] igraph_2.0.3           mime_0.12              lifecycle_1.0.4       
 [25] pkgconfig_2.0.3        Matrix_1.6-5           R6_2.6.1              
 [28] fastmap_1.2.0          fitdistrplus_1.2-2     future_1.34.0         
 [31] shiny_1.10.0           digest_0.6.37          colorspace_2.1-1      
 [34] tensor_1.5             RSpectra_0.16-2        irlba_2.3.5.1         
 [37] labeling_0.4.3         progressr_0.15.1       spatstat.sparse_3.1-0 
 [40] httr_1.4.7             polyclip_1.10-7        abind_1.4-5           
 [43] compiler_4.3.3         withr_3.0.2            backports_1.5.0       
 [46] viridis_0.6.5          fastDummies_1.7.5      ggforce_0.4.2         
 [49] MASS_7.3-60.0.1        tools_4.3.3            vipor_0.4.7           
 [52] lmtest_0.9-40          beeswarm_0.4.0         httpuv_1.6.15         
 [55] future.apply_1.11.3    goftest_1.2-3          glue_1.8.0            
 [58] nlme_3.1-167           promises_1.3.2         grid_4.3.3            
 [61] checkmate_2.3.2        Rtsne_0.17             cluster_2.1.8         
 [64] reshape2_1.4.4         generics_0.1.3         gtable_0.3.6          
 [67] spatstat.data_3.1-4    tidyr_1.3.1            data.table_1.16.4     
 [70] tidygraph_1.3.0        spatstat.geom_3.3-5    RcppAnnoy_0.0.22      
 [73] ggrepel_0.9.6          RANN_2.6.2             pillar_1.10.1         
 [76] stringr_1.5.1          spam_2.11-1            RcppHNSW_0.6.0        
 [79] later_1.4.1            splines_4.3.3          dplyr_1.1.4           
 [82] tweenr_2.0.3           lattice_0.22-6         survival_3.8-3        
 [85] deldir_2.0-4           tidyselect_1.2.1       miniUI_0.1.1.1        
 [88] pbapply_1.7-2          knitr_1.49             gridExtra_2.3         
 [91] scattermore_1.2        xfun_0.50              graphlayouts_1.2.2    
 [94] matrixStats_1.5.0      stringi_1.8.4          lazyeval_0.2.2        
 [97] yaml_2.3.10            evaluate_1.0.3         codetools_0.2-20      
[100] tibble_3.2.1           cli_3.6.4              uwot_0.2.2            
[103] xtable_1.8-4           reticulate_1.40.0      munsell_0.5.1         
[106] Rcpp_1.0.14            globals_0.16.3         spatstat.random_3.3-2 
[109] png_0.1-8              ggrastr_1.0.2          spatstat.univar_3.1-1 
[112] parallel_4.3.3         dotCall64_1.2          listenv_0.9.1         
[115] viridisLite_0.4.2      scales_1.3.0           ggridges_0.5.6        
[118] leiden_0.4.3.1         purrr_1.0.2            rlang_1.1.5           
[121] cowplot_1.1.3