Conesa, Ana, et al. "A survey of best practices for RNA-seq data analysis." Genome biology 17.1 (2016): 13
RnaSeqSampleSize (Power analysis), Scotty (Power analysis with cost)
Busby, Michele A., et al. "Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression." Bioinformatics 29.5 (2013): 656-657
Marioni, John C., et al. "RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays." Genome research (2008)
Hart, S. N., Therneau, T. M., Zhang, Y., Poland, G. A., & Kocher, J. P. (2013). Calculating sample size estimates for RNA sequencing data. Journal of computational biology, 20(12), 970-978.
Schurch, Nicholas J., et al. "How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use?." Rna (2016)
Zhao, Shilin, et al. "RnaSeqSampleSize: real data based sample size estimation for RNA sequencing." BMC bioinformatics 19.1 (2018): 191
Romero, Irene Gallego, et al. "RNA-seq: impact of RNA degradation on transcript quantification." BMC biology 12.1 (2014): 42
Kim, Young-Kook, et al. "Short structured RNAs with low GC content are selectively lost during extraction from a small number of cells." Molecular cell 46.6 (2012): 893-89500481-9).
Zhao, Shanrong, et al. "Comparison of stranded and non-stranded RNA-seq transcriptome profiling and investigation of gene overlap." BMC genomics 16.1 (2015): 675
Levin, Joshua Z., et al. "Comprehensive comparative analysis of strand-specific RNA sequencing methods." Nature methods 7.9 (2010): 709
Chhangawala, Sagar, et al. "The impact of read length on quantification of differentially expressed genes and splice junction detection." Genome biology 16.1 (2015): 131
Corley, Susan M., et al. "Differentially expressed genes from RNA-Seq and functional enrichment results are affected by the choice of single-end versus paired-end reads and stranded versus non-stranded protocols." BMC genomics 18.1 (2017): 399
Liu, Yuwen, Jie Zhou, and Kevin P. White. "RNA-seq differential expression studies: more sequence or more replication?." Bioinformatics 30.3 (2013): 301-304
Comparison of PE and SE for RNA-Seq, SciLifeLab
Trinity, SOAPdenovo-Trans, Oases, rnaSPAdes
Hsieh, Ping-Han et al., "Effect of de novo transcriptome assembly on transcript quantification" 2018 bioRxiv 380998
Wang, Sufang, and Michael Gribskov. "Comprehensive evaluation of de novo transcriptome assembly programs and their effects on differential gene expression analysis." Bioinformatics 33.3 (2017): 327-333
https://sequencing.qcfail.com/
Per base sequence quality
Per sequence quality scores
Per base sequence content
Per sequence GC content
Sequence duplication level
Adapter content
STAR, HiSat2, GSNAP, Novoalign (Commercial)
Baruzzo, Giacomo, et al. "Simulation-based comprehensive benchmarking of RNA-seq aligners." Nature methods 14.2 (2017): 135
Program | Time_Min | Memory_GB |
---|---|---|
HISATx1 | 22.7 | 4.3 |
HISATx2 | 47.7 | 4.3 |
HISAT | 26.7 | 4.3 |
STAR | 25 | 28 |
STARx2 | 50.5 | 28 |
GSNAP | 291.9 | 20.2 |
TopHat2 | 1170 | 4.3 |
Baruzzo, Giacomo, et al. "Simulation-based comprehensive benchmarking of RNA-seq aligners." Nature methods 14.2 (2017): 135
STAR, HiSat2, GSNAP, Novoalign (Commercial)
Baruzzo, Giacomo, et al. "Simulation-based comprehensive benchmarking of RNA-seq aligners." Nature methods 14.2 (2017): 135
@ST-E00274:179:HHYMLALXX:8:1101:1641:1309 1:N:0:NGATGTNCATCGTGGTATTTGCACATCTTTTCTTATCAAATAAAAAGTTTAACCTACTCAGTTATGCGCATACGTTTTTTGATGGCATTTCCATAAACCGATTTTTTTTTTATGCACGTACCCAAAACGTGCAGAAAAATACGCTGCTAGAAATGTA+#AAAFAFA<-AFFJJJAFA-FFJJJJFFFAJJJJ-<FFJJJ-A-F-7--FA7F7-----FFFJFA<FFFFJ<AJ--FF-A<A-<JJ-7-7-<FF-FFFJAFFAA--A--7FJ-7----77-A--7F7)---7F-A----7)7-----7<<-
@instrument:runid:flowcellid:lane:tile:xpos:ypos read:isfiltered:controlnumber:sampleid
>1 dna:chromosome chromosome:GRCz10:1:1:58871917:1 REFGATCTTAAACATTTATTCCCCCTGCAAACATTTTCAATCATTACATTGTCATTTCCCCTCCAAATTAAATTTAGCCAGAGGCGCACAACATACGACCTCTAAAAAAGGTGCTGTAACATG
#!genome-build GRCz10#!genebuild-last-updated 2016-114 ensembl_havana gene 6732 52059 . - . gene_id "ENSDARG00000104632"; gene_version "2"; gene_name "rerg"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTDARG00000044080"; havana_gene_version "1";
seq source feature start end score strand frame attribute
ST-E00274:188:H3JWNCCXY:4:1102:32431:49900 163 1 1 60 8S139M4S = 385 535 TATTTAGAGATCTTAAACATCCATTCCCCCTGCAAACATTTTCAATCATTACATTGTCATTTTCCCTCCAAATTAAATTTAGCCAGAGGCGCACAACATACGACCTCTAAAAAAGGTGCTGGAACATGTACCTATATGCAGCACCACCATC AAAFAFFAFFFFJ7FFFFJ<JAFA7F-<AJ7JJ<FFFJ--<FAJF<7<7FAFJ-<AFA<-JJJ-AF-AJ-FF<F--A<FF<-7777-7JA-77A---F-7AAFF-FJA--77FJ<--77)))7<JJA<J77<-------<7--))7)))7- NM:i:4 MD:Z:12T0T40C58T25 AS:i:119 XS:i:102 XA:Z:17,-53287490,4S33M4D114M,11; MQ:i:60 MC:Z:151M RG:Z:ST-E00274_188_H3JWNCCXY_4
query flag ref pos mapq cigar mrnm mpos tlen seq qual opt
Never store alignment files in raw SAM format. Always compress it!
Format | Size_GB |
---|---|
SAM | 7.4 |
BAM | 1.9 |
CRAM lossless Q | 1.4 |
CRAM 8 bins Q | 0.8 |
CRAM no Q | 0.26 |
SAM file format
tview
samtools tview alignment.bam genome.fasta
STAR (final log file), samtools > stats, bamtools > stats, QoRTs, RSeQC, Qualimap
MultiQC can be used to summarise and plot STAR log files.
QoRTs was run on all samples and summarised using MultiQC.
Soft clipping
Gene body coverage
Insert size
Saturation curve
PCR duplicates
Multi-mapping
Fu, Yu, et al. "Elimination of PCR duplicates in RNA-seq and small RNA-seq using unique molecular identifiers." BMC genomics 19.1 (2018): 531
Parekh, Swati, et al. "The impact of amplification on differential expression analyses by RNA-seq." Scientific reports 6 (2016): 25533
Klepikova, Anna V., et al. "Effect of method of deduplication on estimation of differential gene expression using RNA-seq." PeerJ 5 (2017): e3091
Kallisto, Salmon
tximport()
> gene-countsRSEM, Kallisto, Salmon, Cufflinks2
Soneson, Charlotte, et al. "Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences." F1000Research 4 (2015)
Zhang, Chi, et al. "Evaluation and comparison of computational tools for RNA-seq isoform quantification." BMC genomics 18.1 (2017): 583
ENSG00000000003 140 242 188 143 287 344 438 280 253ENSG00000000005 0 0 0 0 0 0 0 0 0ENSG00000000419 69 98 77 55 52 94 116 79 69ENSG00000000457 56 75 104 79 157 205 183 178 153ENSG00000000460 33 27 23 19 27 42 69 44 40ENSG00000000938 7 38 13 17 35 76 53 37 24ENSG00000000971 545 878 694 636 647 216 492 798 323ENSG00000001036 79 154 74 80 128 167 220 147 72
Teng, Mingxiang, et al. "A benchmark for RNA-seq quantification pipelines." Genome biology 17.1 (2016): 74
Dillies, Marie-Agnes, et al. "A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis." Briefings in bioinformatics 14.6 (2013): 671-683
Evans, Ciaran, Johanna Hardin, and Daniel M. Stoebel. "Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions." Briefings in bioinformatics (2017)
Wagner, Gunter P., Koryu Kin, and Vincent J. Lynch. "Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples." Theory in biosciences 131.4 (2012): 281-285
Liu, Qian, and Marianthi Markatou. "Evaluation of methods in removing batch effects on RNA-seq data." Infectious Diseases and Translational Medicine 2.1 (2016): 3-9
Manimaran, Solaiappan, et al. "BatchQC: interactive software for evaluating sample and batch effects in genomic data." Bioinformatics 32.24 (2016): 3836-3838
~age+condition
estimateSizeFactors()
estimateDispersions()
nbinomWaldTest()
Seyednasrollah, Fatemeh, et al. "Comparison of software packages for detecting differential expression in RNA-seq studies." Briefings in bioinformatics 16.1 (2013): 59-70
results()
log2 fold change (MLE): type type2 vs controlWald test p-value: type type2 vs controlDataFrame with 1 row and 6 columns baseMean log2FoldChange lfcSE <numeric> <numeric> <numeric>ENSG00000000003 242.307796723287 -0.932926089608546 0.114285150312285 stat pvalue padj <numeric> <numeric> <numeric>ENSG00000000003 -8.16314356729037 3.26416150242775e-16 1.36240609998527e-14
summary()
out of 17889 with nonzero total read countadjusted p-value < 0.1LFC > 0 (up) : 4526, 25%LFC < 0 (down) : 5062, 28%outliers [1] : 25, 0.14%low counts [2] : 0, 0%(mean count < 3)[1] see 'cooksCutoff' argument of ?results[2] see 'independentFiltering' argument of ?results
plotMA()
plotCounts()
DAVID, clusterProfiler, ClueGO, ErmineJ, pathview
Conesa, Ana, et al. "A survey of best practices for RNA-seq data analysis." Genome biology 17.1 (2016): 13
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
OS: Ubuntu 18.04.5 LTS
Built on : 25-May-2021 at 13:35:25
2021 • SciLifeLab • NBIS
Main exercise
Bonus exercises
Data: /sw/courses/ngsintro/rnaseq/
Work: /proj/gXXXXXXX/nobackup/<user>/rnaseq/
/sw/courses/ngsintro/rnaseq/
rnaseq/+-- bonus/| +-- assembly/| +-- exon/| +-- funannot/| +-- plots/+-- documents/+-- main/ +-- 1_raw/ +-- 2_fastqc/ +-- 3_mapping/ +-- 4_qualimap/ +-- 5_dge/ +-- 6_multiqc/ +-- reference/ | +-- mouse_chr19_hisat2/ +-- scripts/
/proj/gXXXX/nobackup/<user>/
[user]/rnaseq/ +-- 1_raw/ +-- 2_fastqc/ +-- 3_mapping/ +-- 4_qualimap/ +-- 5_dge/ +-- 6_multiqc/ +-- reference/ | +-- mouse_chr19_hisat2/ +-- scripts/ +-- funannot/ +-- plots/
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |