Kallisto is an “alignment-free” RNA-Seq quantification method that runs very fast with a small memory footprint, so that it can be run on most laptops. It is a command-line program that can be downloaded as binary executables for Linux or Mac, or in source code format. For a first insight into the program, read here and for the published article, see here.
Kallisto is geared towards quantification on the transcript (isoform) level, rather than the gene-level (although the latter can also be done by post-processing Kallisto output.) However, read assignment to transcript isoforms cannot (in general) be done unambiguously, so there is an intrinsic “quantification noise” or variability in this process. Kallisto can thus be run either in a single step (which is very fast) or in “bootstrap” mode (which takes longer, but can be done on several processors in parallel) in order to get uncertainty estimates for the expression levels - a kind of error bars for the quantification process. Running with bootstraps is mandatory if you want to perform differential expression analysis of isoforms with Sleuth (see below).
Kallisto is primarily meant for quantification of an existing set of FASTA sequences, that is, it does not perform transcript assembly and it cannot quantify the expression of novel transcripts that are not in the transcript index that you provide to it. With that said, you can of course use contigs from an assembly that you have produced in some other program in your Kallisto index. It would also be possible to use the software for eg: metagenomics or metatranscriptomics quantification.
Download the cDNA reference and create a index file to run on (This has already been done!). But if you want to do it yourself you can uncomment the code and run it yourselves.
#cd ~/RNAseq/
#mkdir rawData/references/ref_kallisto
#cd rawData/references/ref_kallisto
#wget ftp://ftp.ensembl.org/pub/release-101/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz
#kallisto index \
# --index kallisto_mm10_transcriptome_index.idx \
# Mus_musculus.GRCm38.cdna.all.fa.gz
#cd ~/RNAseq/
It takes less than 10 minutes.
Now you can pseudoalign all your reads to the Kallisto index using Kallisto quantification
cd ~/RNAseq/
for sample_name in $(ls rawData/fastqFiles/*.clean.fq.gz.1M_reads.fq.gz); do
file=${sample_name#rawData/fastqFiles/}
sample=${file%.clean.fq.gz.1M_reads.fq.gz}
echo $sample
mkdir results/05-kallisto/$sample
kallisto quant \
--index rawData/references/ref_kallisto/kallisto_mm10_transcriptome_index.idx \
--output-dir results/05-kallisto/$sample \
--threads 3 \
--single \
--fragment-length 200 \
--sd 20 \
$sample_name
done
Since Kallisto uses transcripts estimates and many programs, like DEseq and EdgeR, works better with gene counts there are programs to get gene counts from transcript estimates in our case we will use txImport. TxImport can import estimates from a lot of different sources. You can read more about it in the txImport vignette
For the program to work, it needs a two-column-file that assigns each transcript to a gene. This can found using ensembl database or other public databases. In our case the relationship between the transcripts and gene is found in the sequencenames in the cDNA file that we downloaded. By using awk we can extract the transcript gene dependency and write it to a file.
# Example of a sequence name in file
# >ENSMUST00000177564.1 cdna chromosome:GRCm38:14:54122226:54122241:1 gene:ENSMUSG00000096176.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:Trdd2 description:T cell receptor delta diversity 2 [Source:MGI Symbol;Acc:MGI:4439546]
# Extract all transcriptnames (1st) and genenames (4th) from sequence names and write to a file. If "gzcat" does not work, you can try replacing it by "gunzip -c"
gzcat rawData/references/ref_kallisto/Mus_musculus.GRCm38.cdna.all.fa.gz| \
grep '>' |
awk '{FS= " "}BEGIN{ print "TXNAME,GENEID"};{print substr($1,2) "," substr($4,6)};' \
>rawData/references/ref_kallisto/tx2gene.mm.GRCm38.cdna.csv
The next steps are carried out in R so make sure you start R in the conda environment.
library(dplyr) # data wrangling
library(ggplot2) # plotting
library(DESeq2) # rna-seq
library(edgeR) # rna-seq
library(tximport) # importing kalisto transcript counts to geneLevels
library(readr) # Fast readr of files.
library(rhdf5) # read/convert kalisto output files.
#if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#BiocManager::install("tximport")
#BiocManager::install("rhdf5")
# source download function
source("https://raw.githubusercontent.com/NBISweden/workshop-RNAseq/master/assets/scripts.R")
# download metadata
download_data("data/metadata_raw.csv")
mr <- read.csv("data/metadata_raw.csv",header=TRUE,stringsAsFactors=F,row.names=1)
mr
Read in the Kallisto transcript abundances and convert the TxImport.
TxImport is created to
setwd("~/RNAseq")
files <- paste("results/05-kallisto" ,
list.files(path = "results/05-kallisto",
pattern = "abundance.tsv",
recursive = TRUE),
sep = "/")
names(files) <- mr$SampleName
tx2gene <- read_csv("rawData/references/ref_kallisto/tx2gene.mm.GRCm38.cdna.csv")
txi.kallisto.tsv <- tximport(files,
type = "kallisto",
tx2gene = tx2gene,
ignoreAfterBar = TRUE)
mr = mr %>% mutate(Day = as.factor(Day))
dds <- DESeqDataSetFromTximport(txi.kallisto.tsv, mr, ~Day)
test = DESeq(dds)
and now you can go on with the differential gene expression in a similar way as the one you did with the feature count table.
cts <- txi.kallisto.tsv$counts
normMat <- txi.kallisto.tsv$length
normMat <- normMat/exp(rowMeans(log(normMat)))
library(edgeR)
o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat))
y <- DGEList(cts)
y$offset <- t(t(log(normMat)) + o)
# y is now ready for estimate dispersion functions see edgeR User's Guide
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] fgsea_1.20.0 enrichR_3.1
## [3] rafalib_1.0.0 biomaRt_2.50.3
## [5] pvclust_2.2-0 pheatmap_1.0.12
## [7] edgeR_3.36.0 limma_3.50.3
## [9] DESeq2_1.34.0 SummarizedExperiment_1.24.0
## [11] Biobase_2.54.0 MatrixGenerics_1.6.0
## [13] matrixStats_0.63.0 GenomicRanges_1.46.1
## [15] GenomeInfoDb_1.30.1 IRanges_2.28.0
## [17] S4Vectors_0.32.4 BiocGenerics_0.40.0
## [19] ggplot2_3.4.1 formattable_0.2.1
## [21] kableExtra_1.3.4 dplyr_1.1.0
## [23] lubridate_1.9.2 yaml_2.3.7
## [25] fontawesome_0.5.0.9000 captioner_2.2.3
## [27] bookdown_0.33 knitr_1.42
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.1-0 rjson_0.2.21 ellipsis_0.3.2
## [4] XVector_0.34.0 rstudioapi_0.14 farver_2.1.1
## [7] bit64_4.0.5 AnnotationDbi_1.56.2 fansi_1.0.4
## [10] xml2_1.3.3 splines_4.1.3 cachem_1.0.7
## [13] geneplotter_1.72.0 jsonlite_1.8.4 annotate_1.72.0
## [16] dbplyr_2.3.1 png_0.1-8 compiler_4.1.3
## [19] httr_1.4.5 Matrix_1.4-0 fastmap_1.1.1
## [22] cli_3.6.0 htmltools_0.5.4 prettyunits_1.1.1
## [25] tools_4.1.3 gtable_0.3.1 glue_1.6.2
## [28] GenomeInfoDbData_1.2.7 rappdirs_0.3.3 fastmatch_1.1-3
## [31] Rcpp_1.0.10 jquerylib_0.1.4 vctrs_0.5.2
## [34] Biostrings_2.62.0 svglite_2.1.1 nlme_3.1-155
## [37] xfun_0.37 stringr_1.5.0 rvest_1.0.3
## [40] timechange_0.2.0 lifecycle_1.0.3 XML_3.99-0.13
## [43] zlibbioc_1.40.0 scales_1.2.1 hms_1.1.2
## [46] parallel_4.1.3 RColorBrewer_1.1-3 curl_5.0.0
## [49] gridExtra_2.3 memoise_2.0.1 sass_0.4.5
## [52] stringi_1.7.12 RSQLite_2.3.0 highr_0.10
## [55] genefilter_1.76.0 filelock_1.0.2 BiocParallel_1.28.3
## [58] rlang_1.0.6 pkgconfig_2.0.3 systemfonts_1.0.4
## [61] bitops_1.0-7 evaluate_0.20 lattice_0.20-45
## [64] purrr_1.0.1 htmlwidgets_1.6.1 labeling_0.4.2
## [67] bit_4.0.5 tidyselect_1.2.0 magrittr_2.0.3
## [70] R6_2.5.1 generics_0.1.3 DelayedArray_0.20.0
## [73] DBI_1.1.3 pillar_1.8.1 withr_2.5.0
## [76] mgcv_1.8-39 survival_3.2-13 KEGGREST_1.34.0
## [79] RCurl_1.98-1.10 tibble_3.2.0 crayon_1.5.2
## [82] utf8_1.2.3 BiocFileCache_2.2.1 rmarkdown_2.20
## [85] progress_1.2.2 locfit_1.5-9.7 grid_4.1.3
## [88] data.table_1.14.8 blob_1.2.3 digest_0.6.31
## [91] webshot_0.5.4 xtable_1.8-4 tidyr_1.3.0
## [94] munsell_0.5.0 viridisLite_0.4.1 bslib_0.4.2