This workflow runs transcript-level quantification using Kallisto followed by differential expression analysis using Sleuth.

1 Kallisto

Kallisto is an “alignment-free” RNA-Seq quantification method that runs very fast with a small memory footprint, so that it can be run on most laptops. It is a command-line program that can be downloaded as binary executables for Linux or Mac, or in source code format. For a first insight into the program, read here and for the published article, see here. There is also a preprint here.

Kallisto is geared towards quantification on the transcript (isoform) level, rather than the gene-level (although the latter can also be done by post-processing Kallisto output.) However, read assignment to transcript isoforms cannot (in general) be done unambiguously, so there is an intrinsic “quantification noise” or variability in this process. Kallisto can thus be run either in a single step (which is very fast) or in “bootstrap” mode (which takes longer, but can be done on several processors in parallel) in order to get uncertainty estimates for the expression levels - a kind of error bars for the quantification process. Running with bootstraps is mandatory if you want to perform differential expression analysis of isoforms with Sleuth (see below).

Kallisto is primarily meant for quantification of an existing set of FASTA sequences, that is, it does not perform transcript assembly and it cannot quantify the expression of novel transcripts that are not in the transcript index that you provide to it. With that said, you can of course use contigs from an assembly that you have produced in some other program in your Kallisto index. It would also be possible to use the software for eg: metagenomics or metatranscriptomics quantification.

1.1 Running Kallisto

To load the Kallisto module on UPPMAX, execute

sh
module load bioinfo-tools
module load kallisto/0.43.1

Now we will download and merge human cDNA and ncDNA files from ENSEMBL in order to build a Kallisto transcript index. Note that we can concatenate the two gzipped files without unpacking them first. We use both the protein-coding transcripts and the non-coding ones to be able to capture more of the transcriptome.

sh
wget ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
wget ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz
cat Homo_sapiens.GRCh38.cdna.all.fa.gz Homo_sapiens.GRCh38.ncrna.fa.gz > Homo_sapiens.GRCh38.rna.fa.gz

Now we can build the transcriptome index.

sh
kallisto index -i hsGRCh38_kallisto Homo_sapiens.GRCh38.rna.fa.gz

It should take less than 10 minutes.

Next, copy the FASTA files from this uppmax path: /proj/uppstore2017171/courses/RNAseqWorkshop/downloads/diffExp/FASTQ/. When that is done, it’s time for quantifying the FASTQ files against our Kallisto index with bootstrapping (for later use in Sleuth). You could do that one by one, with a command like

sh
kallisto quant -i hsGRCh38_kallisto -t 4 -b 100 7_111116_AD0341ACXX_137_1_index1_1.fastq.gz 7_111116_AD0341ACXX_137_1_index1_2.fastq.gz -o sample1

OR in a bash loop:

sh
for i in {1..12};
  do
    kallisto quant -i hsGRCh38_kallisto -t 4 -b 100 7_111116_AD0341ACXX_137_${i}_index${i}_1.fastq.gz 7_111116_AD0341ACXX_137_${i}_index${i}_2.fastq.gz -o sample${i};
  done

In this example, we put -t 4 so we can use up to four processors in the bootstrapping. You may want to modify this value according to the machine you are working on. If you wanted to run Kallisto without bootstraps and just get expression values on a pair of FASTQ files, you would run something like

sh
kallisto quant -i hsGRCh38_kallisto sample1_reads1.fastq sample1_reads2.fastq -o sample1

Running Kallisto on all the 12 samples with 100 bootstraps may take an hour or so, depending on your machine and settings.

2 Sleuth

Sleuth is a companion package for Kallisto which is used for differential expression analysis of transcript quantifications from Kallisto. While you could use other differential expression packages such as Limma or DESeq2 to analyse your Kallisto output, Sleuth also takes into consideration the inherent variability in transcript quantification as explained above. Sleuth also allows the modeling of covariates such as batch, individual, tissue type etc. in the same way as DESeq2/edgeR/Limma, which is useful for experimental designs with multiple varying factors.

Sleuth was designed to work on output from Kallisto (rather than count tables, like DESeq2, or BAM files, like CuffDiff2), so we need to run Kallisto first. (Note that the outputs from other RNA-seq quantifiers like Salmon or Sailfish can also be used with Sleuth via the wasabi package.)

Unlike Kallisto, Sleuth is an R package. At the end of a Sleuth analysis, it is possible to view a dynamical graphical presentation of the results where you can explore the differentially expressed transcripts in an intuitive way.

It is still early days for Sleuth and it has not been extensively benchmarked against other packages yet. Let’s try it on the same A431 data as in the DESeq2 lab!

2.1 Running Sleuth

Here we give an example workflow for a DE analysis in Sleuth based on the A431 data that we are using for all the DE analysis labs. Start by copy the results from uppmax /proj/uppstore2017171/courses/RNAseqWorkshop/downloads/diffExp/kallisto_results.tar.gz. Download and extract the whole folder and make a note of where it is. We use the path ./data/kallisto_results/ relative to this document.

The Sleuth analysis is done entirely in R, so start your R environment and begin by installing the dependencies. This only needs to be done the first time, of course.

R
source("http://bioconductor.org/biocLite.R")
biocLite(rhdf5)
install.packages("devtools")
library(devtools)
devtools::install_github("pachterlab/sleuth")

We start by specifying paths to the Kallisto directories. We have the directory structure like below:

kallisto_results\
  +-- sample1\
  |  +-- abundance.h5
  |  +-- abundance.tsv
  |  +-- run_info.json
  +-- sample2\
  +-- sample3\
...

And we use the script below to generate the paths and labels.

R
base_dir <- "./data/kallisto_results"
samples <- paste0("sample", 1:12)
kal_dirs <- sapply(samples, function(id) file.path(base_dir, id))
kal_dirs
                           sample1                            sample2
 "./data/kallisto_results/sample1"  "./data/kallisto_results/sample2"
                           sample3                            sample4
 "./data/kallisto_results/sample3"  "./data/kallisto_results/sample4"
                           sample5                            sample6
 "./data/kallisto_results/sample5"  "./data/kallisto_results/sample6"
                           sample7                            sample8
 "./data/kallisto_results/sample7"  "./data/kallisto_results/sample8"
                           sample9                           sample10
 "./data/kallisto_results/sample9" "./data/kallisto_results/sample10"
                          sample11                           sample12
"./data/kallisto_results/sample11" "./data/kallisto_results/sample12"

Now it’s time to fill in metadata about the samples. We can use a similar assignment as in the DESeq2 exercise.

R
s2c <- data.frame(path=kal_dirs,sample=samples,timepoint=rep(c("ctrl","t2h","t6h","t24h"),each=3),stringsAsFactors=FALSE)

Again, if there were other experimental factors involved, these could have been modelled here as well.

Now we read in the transcript-gene mappings file. This was previously downloaded.

R
t2g <- read.delim("./data/human_transcripts.txt",header=TRUE,sep="\t",stringsAsFactors=F)
t2g <- dplyr::rename(t2g,target_id=ensembl_transcript_id,
                     ens_gene=ensembl_gene_id,ext_gene=external_gene_name)

The next command will read the Kallisto output files, connect them with metadata, and set up a linear model for analyzing the expression data.

R
library("sleuth")
so <- sleuth_prep(s2c,~timepoint,target_mapping=t2g)
It appears that you are running Sleuth from within Rstudio.
Because of concerns with forking processes from a GUI, 'num_cores' is being set to 1.
If you wish to take advantage of multiple cores, please consider running sleuth from the command line.reading in kallisto results
dropping unused factor levels
............
normalizing est_counts
47352 targets passed the filter
normalizing tpm
merging in metadata
summarizing bootstraps
............

Next, we fit the linear model and test for one of the model coefficients. In this case we test the 24h time point versus the control.

R
so <- sleuth_fit(so)
so <- sleuth_wt(so,which_beta="timepointt24h")
fitting measurement error models
shrinkage estimation
1 NA values were found during variance shrinkage estimation due to mean observation values outside of the range used for the LOESS fit.
The LOESS fit will be repeated using exact computation of the fitted surface to extrapolate the missing values.
These are the target ids with NA values: ENST00000465860
computing variance of betas

Now we should be able to visualize the results:

R
sleuth_live(so)

There are lots of things to look at here - explore according to your interests! Some things you might try are e.g. the PCA and sample heatmap options in the map menu, the test table in the analyses menu (which contains a ranked list of the differentially expressed genes), or the gene view in the same menu.

If you want to delve further into time series analysis with Sleuth (after all, we have just compared two time points here, whereas we have four in all), you might want to read this excellent blog post by Valentine Svensson. Note that Sleuth is still under development, so some of the commands may be a bit different.

3 Session info

Output
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] fgsea_1.8.0                 Rcpp_1.0.2                 
##  [3] enrichR_2.1                 pvclust_2.0-0              
##  [5] rafalib_1.0.0               pheatmap_1.0.12            
##  [7] edgeR_3.24.3                limma_3.38.3               
##  [9] DESeq2_1.22.2               SummarizedExperiment_1.12.0
## [11] DelayedArray_0.8.0          BiocParallel_1.16.6        
## [13] matrixStats_0.54.0          Biobase_2.42.0             
## [15] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
## [17] IRanges_2.16.0              S4Vectors_0.20.1           
## [19] BiocGenerics_0.28.0         ggplot2_3.2.1              
## [21] dplyr_0.8.3                 leaflet_2.0.2              
## [23] captioner_2.2.3             bookdown_0.12              
## [25] knitr_1.24                  biomaRt_2.38.0             
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.4-1       ggsignif_0.6.0         rjson_0.2.20          
##  [4] htmlTable_1.13.1       XVector_0.22.0         base64enc_0.1-3       
##  [7] rstudioapi_0.10        ggpubr_0.2.3           bit64_0.9-7           
## [10] AnnotationDbi_1.44.0   splines_3.5.2          geneplotter_1.60.0    
## [13] zeallot_0.1.0          Formula_1.2-3          jsonlite_1.6          
## [16] annotate_1.60.1        cluster_2.1.0          shiny_1.3.2           
## [19] compiler_3.5.2         httr_1.4.1             backports_1.1.4       
## [22] assertthat_0.2.1       Matrix_1.2-17          lazyeval_0.2.2        
## [25] later_0.8.0            acepack_1.4.1          htmltools_0.3.6       
## [28] prettyunits_1.0.2      tools_3.5.2            gtable_0.3.0          
## [31] glue_1.3.1             GenomeInfoDbData_1.2.0 fastmatch_1.1-0       
## [34] vctrs_0.2.0            nlme_3.1-141           crosstalk_1.0.0       
## [37] xfun_0.8               stringr_1.4.0          mime_0.7              
## [40] XML_3.98-1.20          zlibbioc_1.28.0        scales_1.0.0          
## [43] hms_0.5.0              promises_1.0.1         RColorBrewer_1.1-2    
## [46] yaml_2.2.0             curl_4.0               memoise_1.1.0         
## [49] gridExtra_2.3          rpart_4.1-15           latticeExtra_0.6-28   
## [52] stringi_1.4.3          RSQLite_2.1.2          genefilter_1.64.0     
## [55] checkmate_1.9.4        rlang_0.4.0            pkgconfig_2.0.2       
## [58] bitops_1.0-6           evaluate_0.14          lattice_0.20-38       
## [61] purrr_0.3.2            htmlwidgets_1.3        labeling_0.3          
## [64] cowplot_1.0.0          bit_1.1-14             tidyselect_0.2.5      
## [67] magrittr_1.5           R6_2.4.0               Hmisc_4.2-0           
## [70] DBI_1.0.0              pillar_1.4.2           foreign_0.8-72        
## [73] withr_2.1.2            mgcv_1.8-28            survival_2.44-1.1     
## [76] RCurl_1.95-4.12        nnet_7.3-12            tibble_2.1.3          
## [79] crayon_1.3.4           rmarkdown_1.14         progress_1.2.2        
## [82] locfit_1.5-9.1         grid_3.5.2             data.table_1.12.2     
## [85] blob_1.2.0             digest_0.6.20          xtable_1.8-4          
## [88] tidyr_0.8.3            httpuv_1.5.1           munsell_0.5.0

End of document.