Paulo Czarnewski

First, let’s load all necessary libraries and the QC-filtered dataset from the previous step.

```
suppressPackageStartupMessages({
library(scater)
library(scran)
library(cowplot)
library(ggplot2)
library(rafalib)
library(umap)
})
<- readRDS("data/results/covid_qc.rds") sce
```

Next, we first need to define which features/genes are important in our dataset to distinguish cell types. For this purpose, we need to find genes that are highly variable across cells, which in turn will also provide a good separation of the cell clusters.

```
<- computeSumFactors(sce, sizes = c(20, 40, 60, 80))
sce <- logNormCounts(sce)
sce <- modelGeneVar(sce, method = "loess")
var.out = getTopHVGs(var.out, n = 2000)
hvgs
mypar(1, 2)
# plot mean over TOTAL variance Visualizing the fit:
<- metadata(var.out)
fit.var plot(fit.var$mean, fit.var$var, xlab = "Mean of log-expression", ylab = "Variance of log-expression")
curve(fit.var$trend(x), col = "dodgerblue", add = TRUE, lwd = 2)
# Select 1000 top variable genes
<- getTopHVGs(var.out, n = 1000)
hvg.out
# highligt those cells in the plot
<- rownames(var.out) %in% hvg.out
cutoff points(fit.var$mean[cutoff], fit.var$var[cutoff], col = "red", pch = 16, cex = 0.6)
# plot mean over BIOLOGICAL variance
plot(var.out$mean, var.out$bio, pch = 16, cex = 0.4, xlab = "Mean log-expression",
ylab = "Variance of log-expression")
lines(c(min(var.out$mean), max(var.out$mean)), c(0, 0), col = "dodgerblue", lwd = 2)
points(var.out$mean[cutoff], var.out$bio[cutoff], col = "red", pch = 16, cex = 0.6)
```

Now that the data is prepared, we now proceed with PCA. Since each
gene has a different expression level, it means that genes with higher
expression values will naturally have higher variation that will be
captured by PCA. This means that we need to somehow give each gene a
similar weight when performing PCA (see below). The common practice is
to center and scale each gene before performing PCA. This exact scaling
is called Z-score normalization it is very useful for PCA, clustering
and plotting heatmaps.

Additionally, we can use regression to remove
any unwanted sources of variation from the dataset, such as
`cell cycle`

, `sequencing depth`

,
`percent mitocondria`

. This is achieved by doing a
generalized linear regression using these parameters as covariates in
the model. Then the residuals of the model are taken as the “regressed
data”. Although perhaps not in the best way, batch effect regression can
also be done here.

By default variables are scaled in the PCA step and is not done separately. But it could be acheieved by running the commads below:

```
# sce@assays$data@listData$scaled.data <-
# apply(exprs(sce)[rownames(hvg.out),,drop=FALSE],2,function(x) scale(x,T,T))
# rownames(sce@assays$data@listData$scaled.data) <- rownames(hvg.out)
```

Performing PCA has many useful applications and interpretations, which much depends on the data used. In the case of life sciences, we want to segregate samples based on gene expression patterns in the data.

As said above, we use the `logcounts`

and then set
`scale_features`

to TRUE in order to scale each gene.

```
# runPCA and specify the variable genes to use for dim reduction with
# subset_row
<- runPCA(sce, exprs_values = "logcounts", ncomponents = 50, subset_row = hvg.out,
sce scale = TRUE)
```

We then plot the first principal components.

```
plot_grid(ncol = 3, plotReducedDim(sce, dimred = "PCA", colour_by = "sample", ncomponents = 1:2,
point_size = 0.6), plotReducedDim(sce, dimred = "PCA", colour_by = "sample",
ncomponents = 3:4, point_size = 0.6), plotReducedDim(sce, dimred = "PCA", colour_by = "sample",
ncomponents = 5:6, point_size = 0.6))
```