In this tutorial we will continue the analysis of the integrated dataset. We will use the integrated PCA to perform the clustering. First we will construct a \(k\)-nearest neighbour graph in order to perform a clustering on the graph. We will also show how to perform hierarchical clustering and k-means clustering on PCA space.

Let’s first load all necessary libraries and also the integrated dataset from the previous step.

```
if (!require(clustree)) {
install.packages("clustree", dependencies = FALSE)
}
```

`## Loading required package: clustree`

`## Loading required package: ggraph`

```
suppressPackageStartupMessages({
library(Seurat)
library(cowplot)
library(ggplot2)
library(pheatmap)
library(rafalib)
library(clustree)
})
<- readRDS("data/results/covid_qc_dr_int.rds") alldata
```

The procedure of clustering on a Graph can be generalized as 3 main steps:

Build a kNN graph from the data

Prune spurious connections from kNN graph (optional step). This is a SNN graph.

Find groups of cells that maximizes the connections within the group compared other groups.

The first step into graph clustering is to construct a k-nn graph, in case you don’t have one. For this, we will use the PCA space. Thus, as done for dimensionality reduction, we will use ony the top *N* PCA dimensions for this purpose (the same used for computing UMAP / tSNE).

As we can see above, the **Seurat** function `FindNeighbors`

already computes both the KNN and SNN graphs, in which we can control the minimal percentage of shared neighbours to be kept. See `?FindNeighbors`

for additional options.

```
# check that CCA is still the active assay
@active.assay
alldata
<- FindNeighbors(alldata, dims = 1:30, k.param = 60, prune.SNN = 1/15) alldata
```

`## Computing nearest neighbor graph`

`## Computing SNN`

```
# check the names for graphs in the object.
names(alldata@graphs)
```

```
## [1] "CCA"
## [1] "CCA_nn" "CCA_snn"
```

We can take a look at the kNN graph. It is a matrix where every connection between cells is represented as \(1\)s. This is called a **unweighted** graph (default in Seurat). Some cell connections can however have more importance than others, in that case the scale of the graph from \(0\) to a maximum distance. Usually, the smaller the distance, the closer two points are, and stronger is their connection. This is called a **weighted** graph. Both weighted and unweighted graphs are suitable for clustering, but clustering on unweighted graphs is faster for large datasets (> 100k cells).

```
pheatmap(alldata@graphs$CCA_nn[1:200, 1:200], col = c("white", "black"), border_color = "grey90",
legend = F, cluster_rows = F, cluster_cols = F, fontsize = 2)
```