Paulo Czarnewski

First, let’s load all necessary libraries and the QC-filtered dataset from the previous step.

```
suppressPackageStartupMessages({
library(scater)
library(scran)
library(cowplot)
library(ggplot2)
library(rafalib)
library(umap)
})
<- readRDS("data/results/covid_qc.rds") sce
```

Next, we first need to define which features/genes are important in our dataset to distinguish cell types. For this purpose, we need to find genes that are highly variable across cells, which in turn will also provide a good separation of the cell clusters.

```
<- computeSumFactors(sce, sizes = c(20, 40, 60, 80))
sce <- logNormCounts(sce)
sce <- modelGeneVar(sce, method = "loess")
var.out = getTopHVGs(var.out, n = 2000)
hvgs
mypar(1, 2)
# plot mean over TOTAL variance Visualizing the fit:
<- metadata(var.out)
fit.var plot(fit.var$mean, fit.var$var, xlab = "Mean of log-expression", ylab = "Variance of log-expression")
curve(fit.var$trend(x), col = "dodgerblue", add = TRUE, lwd = 2)
# Select 1000 top variable genes
<- getTopHVGs(var.out, n = 1000)
hvg.out
# highligt those cells in the plot
<- rownames(var.out) %in% hvg.out
cutoff points(fit.var$mean[cutoff], fit.var$var[cutoff], col = "red", pch = 16, cex = 0.6)
# plot mean over BIOLOGICAL variance
plot(var.out$mean, var.out$bio, pch = 16, cex = 0.4, xlab = "Mean log-expression",
ylab = "Variance of log-expression")
lines(c(min(var.out$mean), max(var.out$mean)), c(0, 0), col = "dodgerblue", lwd = 2)
points(var.out$mean[cutoff], var.out$bio[cutoff], col = "red", pch = 16, cex = 0.6)
```