ELIXIR EXCELERATE
Created by: Jules Gilet
Three sequencing datasets and their corresponding cellranger qc summaries are downloaded from the 10x genomics website. They have been generated from healthy human peripheral blood mononuclear cells, with the 3’ v2 library chemistry (discontinued chemistry), the 3’ v3 chemistry (with an improved transcript capture rate within droplets) and a dataset combining transcript and protein quantification with barcoded antibodies (CITE-seq). CITE-seq downstream analyses with not be covered in this lab. The analysis of the different samples is meant to illustrate divergences that can occur across technical batches.
Except otherwise indicated, all the command are executed directly in the shell (bash).
Report summaries and quality control metrics generated by the cellranger pipeline are first downloaded from the 10x website.
mkdir cellranger_qc
cd cellrager_qc
wget -nH -nd http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v2/pbmc_1k_v2_web_summary.html \
http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_web_summary.html \
http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_protein_v3/pbmc_1k_protein_v3_web_summary.html
wget -nH -nd http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v2/pbmc_1k_v2_metrics_summary.csv \
http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_metrics_summary.csv \
http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_protein_v3/pbmc_1k_protein_v3_metrics_summary.csv
The reports are merged in R for an easier comparison.
# R code
v2 <- read.csv("pbmc_1k_v2_metrics_summary.csv")
v3 <- read.csv("pbmc_1k_v3_metrics_summary.csv")
p3 <- read.csv("pbmc_1k_protein_v3_metrics_summary.csv")
res <- data.frame(v2=t(v2), v3=t(v3), p3=t(p3[1:20]))
write.csv(res, 'qc_summaries.csv')
res
v2 v3 p3
Estimated.Number.of.Cells 996 1,222 713
Mean.Reads.per.Cell 77,230 54,502 63,617
Median.Genes.per.Cell 1,078 1,919 1,930
Number.of.Reads 76,921,082 66,601,887 45,359,427
Valid.Barcodes 96.6% 97.4% 97.1%
Sequencing.Saturation 90.1% 70.8% 73.3%
Q30.Bases.in.Barcode 94.0% 94.1% 95.0%
Q30.Bases.in.RNA.Read 88.3% 90.2% 90.3%
Q30.Bases.in.Sample.Index 90.1% 91.1% 90.8%
Q30.Bases.in.UMI 92.8% 92.7% 93.3%
Reads.Mapped.to.Genome 92.7% 95.4% 94.9%
Reads.Mapped.Confidently.to.Genome 90.1% 92.4% 91.6%
Reads.Mapped.Confidently.to.Intergenic.Regions 3.6% 4.8% 5.5%
Reads.Mapped.Confidently.to.Intronic.Regions 22.3% 31.1% 31.8%
Reads.Mapped.Confidently.to.Exonic.Regions 64.1% 56.5% 54.4%
Reads.Mapped.Confidently.to.Transcriptome 60.8% 53.7% 51.3%
Reads.Mapped.Antisense.to.Gene 1.0% 1.0% 1.1%
Fraction.Reads.in.Cells 96.1% 94.9% 87.6%
Total.Genes.Detected 16,063 18,391 17,467
Median.UMI.Counts.per.Cell 3,358 6,628 6,416
About 1 k single cells have been captured per sample, with a similar sequencing depth per cell (~ 50 k RPC to 75 k RPC). However, the samples processed with the V3 chemistry allows the detection of more genes as compared to the v2 (almost 2 times), even with less reads per cell.
Reads mapped to intergenic regions, corresponding to genomic DNA contamination, should typically count for less than 5 %. Reads mapping to intronic regions correspond to detection of unspliced pre-mRNA undergoing maturation, or the consequence of active mechanisms of intron retention. Transcripts containing intronic regions can be used to infer transcriptional dynamics with algorithms such as velocyto (that will be covered on the last day of the workshop).
Typically, 50-60 % of the reads are expected to map to the transcriptome in 10x experiments.
The fastq files provided by 10x genomics are downloaded for quality control and transcript quantification.
cd ..
wget -nH -nd http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v2/pbmc_1k_v2_fastqs.tar \
http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_fastqs.tar \
http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_protein_v3/pbmc_1k_protein_v3_fastqs.tar
Extraction of the fastq files from the archives.
for FILE in *.tar
do tar -xf ${FILE}
done
rm *.tar
Note that the fastqs containing the 10x index sequences (sample_I1_SX_L00X.fastq.gz) are optional after demultiplexing and sample attribution. Due to the costs, in most of the cases the library indexes are not used to distinguish samples (one sample is processed and sequenced at a time, except on large volume sequencers as Illumina’s NovaSeq).
zcat pbmc_1k_v2_fastqs/pbmc_1k_v2_S1_L001_I1_001.fastq.gz | head -n 24
@A00228:279:HFWFVDMXX:1:1101:5954:1000 1:N:0:NACTCGGA
NACTCGGA
+
#FFFFFFF
@A00228:279:HFWFVDMXX:1:1101:6768:1000 1:N:0:NACTCGGA
NACTCGGA
+
#FFFFFFF
@A00228:279:HFWFVDMXX:1:1101:18430:1000 1:N:0:NACTCGGA
NACTCGGA
+
#FFFFFFF
@A00228:279:HFWFVDMXX:1:1101:24035:1000 1:N:0:NACTCGGA
NACTCGGA
+
#FFFFFFF
@A00228:279:HFWFVDMXX:1:1101:13612:1016 1:N:0:CACTCGGA
CACTCGGA
+
FFFFFFFF
@A00228:279:HFWFVDMXX:1:1101:13720:1016 1:N:0:CACTCGGA
CACTCGGA
+
FFFFFFFF
zcat pbmc_1k_v2_fastqs/pbmc_1k_v2_S1_L001_I1_001.fastq.gz | tail -n 24
@A00228:279:HFWFVDMXX:1:2488:32750:37043 1:N:0:ATGCTCCG
ATGCTCCG
+
FF:FF:FF
@A00228:279:HFWFVDMXX:1:2488:5873:37059 1:N:0:ATGCTCCG
ATGCTCCG
+
FFFFFFFF
@A00228:279:HFWFVDMXX:1:2488:8657:37059 1:N:0:ATGCTCCG
ATGCTCCG
+
FFFFFFFF
@A00228:279:HFWFVDMXX:1:2488:11080:37059 1:N:0:ATGCTCCG
ATGCTCCG
+
F:FFFFFF
@A00228:279:HFWFVDMXX:1:2488:14118:37059 1:N:0:ATGCTCCG
ATGCTCCG
+
F:FFFFFF
@A00228:279:HFWFVDMXX:1:2488:16703:37059 1:N:0:ATGCTCCG
ATGCTCCG
+
FFFFFFFF
For clarity and easier manipulation, the fastq files are merged.
mkdir fastqs
# concatenation of the v2.1k fastqs
cat pbmc_1k_v2_fastqs/pbmc_1k_v2_S1_L00*_R1_001.fastq.gz > fastqs/pbmc_1k_v2_R1.fastq.gz
cat pbmc_1k_v2_fastqs/pbmc_1k_v2_S1_L00*_R2_001.fastq.gz > fastqs/pbmc_1k_v2_R2.fastq.gz
# concatenation of the v3.1k fastqs
cat pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L00*_R1_001.fastq.gz > fastqs/pbmc_1k_v3_R1.fastq.gz
cat pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L00*_R2_001.fastq.gz > fastqs/pbmc_1k_v3_R2.fastq.gz
# concatenation of the p3.1k fastqs
cat pbmc_1k_protein_v3_fastqs/pbmc_1k_protein_v3_gex_fastqs/pbmc_1k_protein_v3_gex_S1_L00*_R1_001.fastq.gz > fastqs/pbmc_1k_p3_R1.fastq.gz
cat pbmc_1k_protein_v3_fastqs/pbmc_1k_protein_v3_gex_fastqs/pbmc_1k_protein_v3_gex_S1_L00*_R2_001.fastq.gz > fastqs/pbmc_1k_p3_R2.fastq.gz
The number of reads after the merge of the fastq files are consistent with the cellranger summaries.
NLINES=$(zcat fastqs/pbmc_1k_v2_R1.fastq.gz | wc -l)
expr $NLINES / 4
76921082
76.9 M reads for sample pbmc_1k_v2, that is consistent with the cellranger websummary generated by cellranger. We have similar results for the two other samples.
NLINES=$(zcat fastqs/pbmc_1k_v3_R1.fastq.gz | wc -l)
expr $NLINES / 4
66601887
NLINES=$(zcat fastqs/pbmc_1k_p3_R1.fastq.gz | wc -l)
expr $NLINES / 4
45359427
Some cleaning
rm -r pbmc_1k_v2_fastqs pbmc_1k_v3_fastqs pbmc_1k_protein_v3_fastqs
We use the Ikura pipeline for further QC, quality-based trimming and transcript quantification.
mkdir ikura
cd ikura
ikura -1 /storage1/courses/ELIXIR/workOnCourse/fastqs/pbmc_1k_v2_R1.fastq.gz -2 /storage1/courses/ELIXIR/workOnCourse/fastqs/pbmc_1k_v2_R2.fastq.gz -n 1500 -o pbmc_1k_v2 -i /storage1/annotations/ikura/human/genecode_GRCh38p12/ikuraIndex_gencodeGRCh38p12_release29
ikura -1 /storage1/courses/ELIXIR/workOnCourse/fastqs/pbmc_1k_v3_R1.fastq.gz -2 /storage1/courses/ELIXIR/workOnCourse/fastqs/pbmc_1k_v3_R2.fastq.gz -n 1500 -o pbmc_1k_v3 -i /storage1/annotations/ikura/human/genecode_GRCh38p12/ikuraIndex_gencodeGRCh38p12_release29
ikura -1 /storage1/courses/ELIXIR/workOnCourse/fastqs/pbmc_1k_p3_R1.fastq.gz -2 /storage1/courses/ELIXIR/workOnCourse/fastqs/pbmc_1k_p3_R2.fastq.gz -n 1500 -o pbmc_1k_p3 -i /storage1/annotations/ikura/human/genecode_GRCh38p12/ikuraIndex_gencodeGRCh38p12_release29
Note that R1 from the v2 sample provided by 10x is longer than necessary (28 nt). R1 has a 16 nt cell barcode and a 10 nt UMI barcode, according to the corresponding 10x technical note. The two last nucleotides in R1 are composed of T in more that 98 %, indicating that remaining nucleotides likely come from the poly(dT) tail. Ikura will auto-detect these fastq files as coming from the v3 chemistry and will still execute normally without any impact on the quality of the process.
The whole process takes 47 to 74 minutes per sample (8 cores, 4 GB of RAM).
#v2
[...]
total execution time (min): 74
# v3
[...]
total execution time (min): 67
#p3
[...]
total execution time (min): 47
Based on stringent quality filters, typically 8-15 % of the sequenced reads can contains ambiguous nucleotides (low base calling score, PHRED score < 30) that can be safely discarded from the analyses. The fraction of low quality reads can also be estimated from the cellranger web summaries with the ‘Q30.Bases.in.Barcode’ and ‘Q30.Bases.in.RNA.Read’ stats. In case of any technical issue during the sequencing itself, the fraction of poor quality reads can dramatically go up.
The percent of reads aligned to the reference transcriptome with salmon engine is very similar to that of STAR (the mapping engine used by cellranger). Compare the ‘Reads.Mapped.Confidently.to.Transcriptome’ for the v2 sample. Typically 50-60 % of the reads are expected to align on the (spliced) transcriptome, and 15-25 % to intronic regions.
The call calling (identification of droplets containing true cells) is based on the density of reads per droplet.
On the density histogram, droplets with high UMI content correspond to true cells, while droplets with poor UMI content (left side of the density plot) likely correspond to the capture of cellular debris or ambient mRNA, and should be discarded.
The number of true cells identified with Ikura can be found with:
grep "Selected" ikura/pbmc_1k_v2/cell_calling/filteringPlot_pbmc_1k_v2_cell_thresholds.tsv
## 1015 Selected
grep "Selected" ikura/pbmc_1k_v3/cell_calling/filteringPlot_pbmc_1k_v3_cell_thresholds.tsv
## 1179 Selected
grep "Selected" ikura/pbmc_1k_p3/cell_calling/filteringPlot_pbmc_1k_p3_cell_thresholds.tsv
## 552 Selected
The fraction of duplicated reads can be used to estimate if the sequencing depth is appropriate. In a given experiment, if most (> 99 %) of the UMIs are sequenced twice, it indicates that almost all of the existing transcripts are read: a deeper sequencing is useless as the same UMIs will be read again and again. By contrast, if most of the UMIs are read only once, the probability of remaining unsequenced transcripts in the library is high. Depending of the application, a complementary sequencing can be recommended to detect lowly expressed genes (eg. cytokines). The sequencing saturation (as defined by 10x) or the level of duplicated reads is espected to be at last 60-80 % in most applications.
As a comparison, the level of duplicated reads is lower in the p3 sample: a complementary sequencing can allow the detection of additional lowly expressed genes. However, the number of detected genes per cell is twice higher than the v2 chemistry and enough for a proper classification of the cells.
Note that in case of important variations in the sequencing depth between samples (eg. a delta of 25 % or more in the level of duplicated reads), the difference in term of gene sensitivity can lead to very strong batch effects (some genes can only be detected in the context of a deep sequencing). Subsampling of the data generally perform very well in the resolving of such batch effects.
The following section is a demonstration of the compatibility of ikura output with the second part of the QC lab.
# R code
# The fastest way to load the data is
mat <- readRDS('pbmc_1k_v3/outs/ikuraEmat.rds')
# or alternatively
mat2 <- Read10X('pbmc_1k_v3/outs')
dim(mat)
[1] 19870 1179
dim(mat2)
[1] 19870 1179
# seurat object can then be created with
seu <- CreateSeuratObject(mat, min.cells=5, min.features=500)
Validation of the integration of the data with the second part of the qc course.
# R code
v3.1k <- Read10X("pbmc_1k_v3/outs")
v2.1k <- Read10X("pbmc_1k_v2/outs")
p3.1k <- Read10X("pbmc_1k_p3/outs")
sdata.v2.1k <- CreateSeuratObject(v2.1k, project = "v2.1k")
sdata.v3.1k <- CreateSeuratObject(v3.1k, project = "v3.1k")
sdata.p3.1k <- CreateSeuratObject(p3.1k, project = "p3.1k")
alldata <- merge(sdata.v2.1k, c(sdata.v3.1k,sdata.p3.1k), add.cell.ids=c("v2.1k","v3.1k","p3.1k"))
chemistry <- rep("v3",ncol(alldata))
chemistry[Idents(alldata) == "v2.1k"] <- "v2"
alldata <- AddMetaData(alldata, chemistry, col.name = "Chemistry")
alldata
## An object of class Seurat
## 19870 features across 2746 samples within 1 assay
## Active assay: RNA (19870 features)
table(Idents(alldata))
## p3.1k v2.1k v3.1k
## 552 1015 1179
mt.genes <- rownames(alldata)[grep("^MT-",rownames(alldata))]
C<-GetAssayData(object = alldata, slot = "counts")
percent.mito <- Matrix::colSums(C[mt.genes,])/Matrix::colSums(C)*100
alldata <- AddMetaData(alldata, percent.mito, col.name = "percent.mito")
rb.genes <- rownames(alldata)[grep("^RP[SL]",rownames(alldata))]
percent.ribo <- Matrix::colSums(C[rb.genes,])/Matrix::colSums(C)*100
alldata <- AddMetaData(alldata, percent.ribo, col.name = "percent.ribo")
selected <- WhichCells(alldata, expression = percent.mito < 25)
length(selected)
## [1] 2656
data.filt <- subset(alldata, cells = selected)
high.det.v3 <- WhichCells(data.filt, expression = nFeature_RNA > 4100)
high.det.v2 <- WhichCells(data.filt, expression = nFeature_RNA > 2000 & orig.ident == "v2.1k")
# remove these cells
data.filt <- subset(data.filt, cells=setdiff(WhichCells(data.filt),c(high.det.v2,high.det.v3)))
# check number of cells
ncol(data.filt)
## [1] 2570
low.det.v3 <- WhichCells(data.filt, expression = nFeature_RNA < 1000 & orig.ident != "v2.1k")
low.det.v2 <- WhichCells(data.filt, expression = nFeature_RNA < 500 & orig.ident == "v2.1k")
# remove these cells
data.filt <- subset(data.filt, cells=setdiff(WhichCells(data.filt),c(low.det.v2,low.det.v3)))
# check number of cells
ncol(data.filt)
## [1] 2493
We can compare the number of cell before and after quality filtering.
table(Idents(alldata))
# ikura - prior filtering
## p3.1k v2.1k v3.1k
## 552 1015 1179
table(Idents(data.filt))
# ikura - after filtering
## p3.1k v2.1k v3.1k
## 506 922 1065
After cell calling, Ikura generate an expression matrix with less cells as compared to cellranger. However, after appropriate quality filters in downstream analysis, the number of cells of good quality are equivalent. Thus Ikura is more stringeant and permforms better in the robust identification of cell that will be easier to identify and classify.
As a point of comparison, here is the number of cells prior and after filtering of expression matrices generated by cellranger (downloaded from 10x website).
# R code
# cellranger - prior filtering
## p3.1k v2.1k v3.1k
## 713 996 1222
# cellranger - after filtering
## p3.1k v2.1k v3.1k
## 526 933 1072
The batch effect due to both higher ribosomal content and differences between the v2/v3 chemistries is still visible, all other comments are valid. Basic analysis for data visualization.
# R code, basic Seurat processing for data visualization
seu <- data.filt
seu <- NormalizeData(seu)
seu <- ScaleData(seu, features=rownames(seu))
seu <- RunPCA(seu, features=rownames(seu))
seu <- FindNeighbors(seu, dim)
seu <- FindNeighbors(seu, dims=1:10)
seu <- FindClusters(seu, resolution=0.5)
seu <- RunTSNE(seu, reduction='pca', dims=1:10)
seu <- RunUMAP(seu, reduction='pca', dims=1:10)
DimPlot(seu, reduction='tsne', group.by='orig.ident')
VlnPlot(seu, 'percent.ribo', group.by='orig.ident')