[1] “/Users/paulo.barenco/Box/repos/sauron/docs/documentation_files”


Introduction

Here we give a deep understanting of the functions and parameters being used in the workflow.


Running with SLURM

You can safely ignore these lines in case you are using the pipeline on a single computer.

The first lines provides a convenient way to run the workflow using the SLURM queueing system on a High Performance Computing environment.


Initializing Sauron.v1 conda environment

You can safely ignore these lines.

The following lines allows the pipeline to find the main project folder (i.e. where the run_workflow.sh is located).

The first will dictate where your data and scripts are. Because of this feature, one can simply define the path to your project as $main. The second will load the script to initiallize the Sauron.v1 conda environment with all necessary packages for the analysis. If you don’t already have the environment, it will create it for you.


Global parameters

In these lines you can define global parameters to be user across your analysis. They should either come from the metadata file for each dataset you will provide (see below), or from the pipeline itself (see below).

These names are essentially column names in your metadata inside the Seurat Object.

By default in the Quality Control step (see below), several metrics are autmatically put into the seurat object, and those can be used for plotting here. Some of them are: nCounts_RNA, nFeature_RNA , percent_rps , percent_rpl , percent_ribo , S.Score , G2M.Score.



Load datasets

input_path

The PATH to the FOLDER containing your datasets (as folders). Each folder inside this PATH is a dataset.

dataset_metadata_path

The PATH to the metadata FILE for each library (The first column should be named SampleID)

assay

The ASSAY slot in the Seurat object to load the data. Default is “RNA”. You can add any name that describes the type of data in that assay: “CITE”, “ATAC”, “PROTEIN”, etc.

output_path

The PATH to the FOLDER to output your Seurat object.



To load datasets, first place them in the data folder: create a folder for each individual dataset, and within each folder, place your dataset (see figure below). This allows a well organized rawdata folder for the analysis.

Besides the rawdata, you can add a metadata .csv file, which can contain information about each specific dataset. Each line is a dataset. Each column is a metadata variable. All those values will be parsed to the Seurat object on the loading.

The loading function will also automatically take care of finding the union of all genes across datasets and adding zeros where the genes are not expressed.



Run Quality Control

Seurat_object_path

The PATH to the Seurat Object containing your datasets and metadata.

columns_metadata

The COLUMN NAMES of the metadata matrix to load into the objects. They will be treated as factors variables (not continuous).

species_use

Species from the sample for cell cycle scoring.

remove_non_coding

Removes all non-coding and pseudogenes from the data. Default is ‘True’.

plot_gene_family

Gene families to calculate and add to the metadata, comma separated. You can place them as patterns of genes or gene families. Some examples are: “RPS” (ribosomal), “RPL” (ribosomal), “MT-” (mitocondria). Upper and Lower case are ignored. Default is “Rps,Rpl,mt-,Hb”.

remove_gene_family

Gene families to remove from the count matrix after QC, comma separated. They should start with the pattern.Upper and Lower case are ignored. Default is “mt-”.

min_gene_count

Minimun number of cells needed to consider a gene as expressed. Default is 5.

min_gene_per_cell

Minimun number of genes in a cell needed to consider a cell as good quality. Default is 200.

assay

The ASSAY slot in the Seurat object to load the data. Default is “RNA”. You can add any name that describes the type of data in that assay: “CITE”, “ATAC”, “PROTEIN”, etc.

output_path

The PATH to the FOLDER to output your Seurat object.




Integrating datasets

Seurat_object_path

The PATH to the Seurat Object containing your datasets and metadata.

columns_metadata

The COLUMN NAMES of the metadata matrix to load into the objects. They will be treated as factors variables (not continuous).

regress

Variables of the metadata to be regressed out using GLM.

var_genes

Whether use ‘Seurat’ or the ‘Scran’ method for variable genes identification. An additional value can be placed after a comma to define the level of dispersion wanted for variable gene selection. ‘Seurat,2’ will use the threshold 2 for gene dispersions. Defult is ‘Seurat,1.5’. For Scran, the user should inpup the level of biological variance ‘Scran,0.2’. An additional blocking parameter (a column from the metadata) can ba supplied to ‘Scran’ method block variation comming from uninteresting factors, which can be parsed as ‘Scran,0.2,Batch’.

integration_method

Integration method to be used. ‘CCA’, MNN’, ‘Scale’ and ‘Combat’ are available at the moment. The batches (column names in the metadata matrix) to be removed should be provided as arguments comma separated. E.g.: ‘Combat,sampling_day’. For MNN, an additional integer parameter is supplied as the k-nearest neighbour.

cluster_use

The cluster to be used for analysis parsed as “METADATA_COLUMN_NAME,factor1,factor2”. This is usefull if you want to subset your data and perform this step in only part of the dataset (a cell cluster, or a particular metadata parameter). Example: “SNN.0.4,1,2,3,4”, will get clusters 1-4 from the the metadata column “SNN.0.4”.

assay

The ASSAY slot in the Seurat object to load the data. Default is “RNA”. You can add any name that describes the type of data in that assay: “CITE”, “ATAC”, “PROTEIN”, etc.

output_path

The PATH to the FOLDER to output your Seurat object.




VDJ analysis