1 Kallisto

Kallisto is an “alignment-free” RNA-Seq quantification method that runs very fast with a small memory footprint, so that it can be run on most laptops. It is a command-line program that can be downloaded as binary executables for Linux or Mac, or in source code format. For a first insight into the program, read here and for the published article, see here.

Kallisto is geared towards quantification on the transcript (isoform) level, rather than the gene-level (although the latter can also be done by post-processing Kallisto output.) However, read assignment to transcript isoforms cannot (in general) be done unambiguously, so there is an intrinsic “quantification noise” or variability in this process. Kallisto can thus be run either in a single step (which is very fast) or in “bootstrap” mode (which takes longer, but can be done on several processors in parallel) in order to get uncertainty estimates for the expression levels - a kind of error bars for the quantification process. Running with bootstraps is mandatory if you want to perform differential expression analysis of isoforms with Sleuth (see below).

Kallisto is primarily meant for quantification of an existing set of FASTA sequences, that is, it does not perform transcript assembly and it cannot quantify the expression of novel transcripts that are not in the transcript index that you provide to it. With that said, you can of course use contigs from an assembly that you have produced in some other program in your Kallisto index. It would also be possible to use the software for eg: metagenomics or metatranscriptomics quantification.

1.1 Build index

Download the cDNA reference and create a index file to run on (This has already been done!). But if you want to do it yourself you can uncomment the code and run it yourselves.

sh
#mkdir references/ref_kallisto
   
#cd references/ref_kallisto

#wget ftp://ftp.ensembl.org/pub/release-101/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz



#kallisto index \
#  --index kallisto_mm10_transcriptome_index.idx \
#  Mus_musculus.GRCm38.cdna.all.fa.gz


# cd ../..

It takes less than 10 minutes.

1.2 Pseudoalign reads

Now you can pseudoalign all your reads to the Kalisto index using Kalisto quantification

sh


  mkdir kallisto_quantification

 for sample_name in $(ls fastq/*.clean.fq.gz.1M_reads.fq.gz); do

    file=${sample_name#fastq/}
    sample=${file%.clean.fq.gz.1M_reads.fq.gz}
   
   echo $sample

     
    mkdir kallisto_quantification/$sample

    kallisto quant \
      --index references/ref_kallisto/kallisto_mm10_transcriptome_index.idx \
      --output-dir kallisto_quantification/$sample \
      --threads 3 \
      --single \
      --fragment-length 200 \
      --sd 20 \
     $sample_name

  done

2 Quantification

2.1 Get gene quantifications

Since Kalisto uses transcripts estimates and many programs, like DEseq and EdgeR, works better with gene counts there are programs to get gene counts from transctipt estimates in our case we will use txImport. TxImport can import estimates from a lot of different sources. You can read more about it in the txImport vignette

For the program to work it needs a two column file that assigns each transcript to a gene. This can found using ensembl database or other public databases. In our case the relationship between the transcripts and gene is found in the sequencenames in the cDNA file that we downloaded. By using awk we can extract the transcript gene dependency and write it to a file.

sh
# Example of a sequence name in file 
#  >ENSMUST00000177564.1 cdna chromosome:GRCm38:14:54122226:54122241:1 gene:ENSMUSG00000096176.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:Trdd2 description:T cell receptor delta diversity 2 [Source:MGI Symbol;Acc:MGI:4439546]


# Extract all transcriptnames (1st) and genenames (4th) from  sequence names and write to a file.   
gzcat references/ref_kallisto/Mus_musculus.GRCm38.cdna.all.fa.gz| \
grep '>' |
awk '{FS= " "}BEGIN{ print "TXNAME,GENEID"};{print substr($1,2) "," substr($4,6)};' \
>references/ref_kallisto/tx2gene.mm.GRCm38.cdna.csv 

2.2 TxImport

The next steps are carried out in R so make sure you start R with the correct packages.

R
library(dplyr) # data wrangling
library(ggplot2) # plotting
library(DESeq2) # rna-seq
library(edgeR) # rna-seq
library(tximport) # importing kalisto transcript counts to geneLevels
library(readr) # Fast readr of files.
library(rhdf5) # read/convert kalisto output files.  
If tximport and rhdf5 are not installed you can install them now.
R
#if (!requireNamespace("BiocManager", quietly = TRUE))
#  install.packages("BiocManager")

#BiocManager::install("tximport")
#BiocManager::install("rhdf5")
Load the metadata for the samples.
R
# source download function
source("https://raw.githubusercontent.com/NBISweden/workshop-RNAseq/master/assets/scripts.R")

# download metadata

download_data("data/metadata_raw.csv")
mr <- read.csv("data/metadata_raw.csv",header=TRUE,stringsAsFactors=F,row.names=1)

mr

2.2.1 Convert to gene counts

Read in the Kalisto transcript abundances and and convert the TxImport.

TxImport is created to

R
setwd("~/RNAseq")
files <- paste("kallisto_quantification" ,  
               list.files(path = "kallisto_quantification",pattern = "abundance.tsv", recursive = TRUE),
               sep = "/")
names(files) <- mr$SampleName


tx2gene <- read_csv("references/ref_kallisto/tx2gene.mm.GRCm38.cdna.csv")


txi.kallisto.tsv <- tximport(files, type = "kallisto", tx2gene = tx2gene, ignoreAfterBar = TRUE)

2.2.2 Convert to DEseq2 object for analysis

R
mr = mr %>% mutate(Day = as.factor(Day))
dds <- DESeqDataSetFromTximport(txi.kallisto.tsv, mr, ~Day)


test = DESeq(dds)

and now you can go on with the DE similair as the one you did with the feature count table.

2.2.3 Convert to a edgeR object for analysis

R
cts <- txi.kallisto.tsv$counts
normMat <- txi.kallisto.tsv$length
normMat <- normMat/exp(rowMeans(log(normMat)))
library(edgeR)
o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat))
y <- DGEList(cts)
y$offset <- t(t(log(normMat)) + o)

# y is now ready for estimate dispersion functions see edgeR User's Guide

3 Session info

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scales_1.1.1                fgsea_1.16.0               
##  [3] enrichR_2.1                 rafalib_1.0.0              
##  [5] pvclust_2.2-0               pheatmap_1.0.12            
##  [7] biomaRt_2.46.0              edgeR_3.32.0               
##  [9] limma_3.46.0                DESeq2_1.30.0              
## [11] SummarizedExperiment_1.20.0 Biobase_2.50.0             
## [13] MatrixGenerics_1.2.0        matrixStats_0.57.0         
## [15] GenomicRanges_1.42.0        GenomeInfoDb_1.26.1        
## [17] IRanges_2.24.0              S4Vectors_0.28.0           
## [19] BiocGenerics_0.36.0         ggplot2_3.3.2              
## [21] formattable_0.2.0.1         kableExtra_1.3.1           
## [23] dplyr_1.0.2                 lubridate_1.7.9.2          
## [25] leaflet_2.0.3               yaml_2.2.1                 
## [27] fontawesome_0.1.0           captioner_2.2.3            
## [29] bookdown_0.21               knitr_1.30                 
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-149           bitops_1.0-6           bit64_4.0.5           
##  [4] webshot_0.5.2          RColorBrewer_1.1-2     progress_1.2.2        
##  [7] httr_1.4.2             tools_4.0.3            R6_2.5.0              
## [10] DBI_1.1.0              mgcv_1.8-33            colorspace_2.0-0      
## [13] withr_2.3.0            gridExtra_2.3          prettyunits_1.1.1     
## [16] tidyselect_1.1.0       curl_4.3               bit_4.0.4             
## [19] compiler_4.0.3         rvest_0.3.6            xml2_1.3.2            
## [22] DelayedArray_0.16.0    labeling_0.4.2         genefilter_1.72.0     
## [25] askpass_1.1            rappdirs_0.3.1         stringr_1.4.0         
## [28] digest_0.6.27          rmarkdown_2.5          XVector_0.30.0        
## [31] pkgconfig_2.0.3        htmltools_0.5.0        dbplyr_2.0.0          
## [34] htmlwidgets_1.5.2      rlang_0.4.9            rstudioapi_0.13       
## [37] RSQLite_2.2.1          generics_0.1.0         farver_2.0.3          
## [40] jsonlite_1.7.1         crosstalk_1.1.0.1      BiocParallel_1.24.1   
## [43] RCurl_1.98-1.2         magrittr_2.0.1         GenomeInfoDbData_1.2.4
## [46] Matrix_1.2-18          Rcpp_1.0.5             munsell_0.5.0         
## [49] lifecycle_0.2.0        stringi_1.5.3          zlibbioc_1.36.0       
## [52] BiocFileCache_1.14.0   grid_4.0.3             blob_1.2.1            
## [55] crayon_1.3.4           lattice_0.20-41        splines_4.0.3         
## [58] annotate_1.68.0        hms_0.5.3              locfit_1.5-9.4        
## [61] pillar_1.4.7           rjson_0.2.20           geneplotter_1.68.0    
## [64] fastmatch_1.1-0        XML_3.99-0.5           glue_1.4.2            
## [67] evaluate_0.14          data.table_1.13.2      vctrs_0.3.5           
## [70] openssl_1.4.3          gtable_0.3.0           purrr_0.3.4           
## [73] tidyr_1.1.2            assertthat_0.2.1       xfun_0.19             
## [76] xtable_1.8-4           survival_3.2-7         viridisLite_0.3.0     
## [79] tibble_3.0.4           AnnotationDbi_1.52.0   memoise_1.1.0         
## [82] ellipsis_0.3.1