This is a basic guided analysis for using 3 PBMC datasets


Install software dependecies

The only two softwares needed are Conda and git. They should be installed using their instructions.

All other software used in Sauron is managed by Conda environment and will be installed automatically. These includes R , Rstudio , Python and all necessary packages / libraries to run the workflow. The complete list with software and their versions can be found in the environment.yml file.


Clone this repository

First, you will need to clone this repo into your project folder. For this tutorial we will create a folder sauron_tutorial_PBMC inside our Downloads folder:

This will create a folder named “sauron” in your project folder, contating all the files required for the analysis.

Alternativelly, you can also simply create these folders and download the repository manually. Your folder structure should look like this:


Adding metadata to your folder

Add your data and metadata into the data directory. One dataset matrix per folder (i.e. one plate per folder or one 10X lane per folder). Name each folder as the desired sample names. The sample names should match the names in the 1st column of you metadata csv file.

We will manually add information into the metadata file (which can be created with any spread sheet editor) and saved as .csv.

It is important to notice that each line corresponds to a dataset and the first column is exactly the name of the dataset folders. Only names found in both the metadata and in the data folder will be used.


Your final folder should look like this:



Loading the data

With the data in place, we can now define some anlaysis parameters and then load it to Sauron. Here we will define some metadata parameters we would like to use for plotting later on. Those are defined based on the column names of the metadata file above.

We can now load them using the 00_load_data.R function.

This will create a seurat object with a slot named rna containing all the counts. The results will be outputed in the folder analysis/1_qc. The log file after running this function can be seen in the log folder log/00_load_data_log.txt. There should be now a raw_seurat_object.rds in your qc folder..



Quality control

Once the data is loaded into the Seurat object, it becomes easy to work with it and perform several quality control (QC) metris using the function 01_qc_filter.R. By default, this function will compute cell cycle scoring, remove non-coding genes (including pseudogenes), calculate percentage of several gene families.

By default, this function will calculate several percentage estimates and output two kind of files, either contating ALL/RAW cells or FILTERED cells only. Thus, you can later choose which to proceed if you dont want to exclude any cell at first. Your folder should look like this:

By default, cells need to be within:

  • 0 - 25 % mitocondrial genes
  • 0 - 50 % RPS genes
  • 0 - 50 % RPL genes
  • 0.5 - 99.5 quantiles of number UMIs (removing extreme ouliers)
  • 0.5 - 99.5 quantiles of number of counts (removing extreme ouliers)
  • 90 - 100 % protein coding genes
  • .9 - 1 Gini index
  • .95 - 1 Simpson index

Some of the outputed plots can be found below, before and after filtering:

ALL cells


FILTERED cells only



Dataset integration

Once low quality cells were filtered out we can proceed to integrate the datasets with the function 02_integrate.R. It will adjust for batch effects if present and output into a integrated space in the corresponding slot. Currently implemented methods are mnn, cca, and combat. MNN is the fastest and will be used here, but it does not generate a corrected gene expression matrix. Therefore, we will use it only to reduce dimentions and perform clustering later on.

Now there should be a folder named 2_clustering containig the your seurat object. This object contains everything from the one used as imput plus the integrated slot and variable genes computed.

This function will then calculate variable genes using both scran and seurat methods for each dataset and for all datasets together. It will output them into the variable_genes folder, which should look like this:

Variable genes folder