De-novo Transcriptome Assembly

Overview

Exercises: 20 min

Questions

How to De-novo assemble my RNAseq?
What should I look in my assembly to go forward

Objectives

run Trinity

Prerequisites

For this exercise you need to be logged in to Uppmax.

Setup the folder structure:

source ~/git/GAAS/profiles/activate_rackham_env
export data=/proj/g2019006/nobackup/$USER/data
export RNAseq_assembly_path=/proj/g2019006/nobackup/$USER/RNAseq_assembly

Trinity

Trinity assemblies can be used as complementary evidence, particularly when trying to polish a gene build with Pasa. Before you start, check how big the raw read data is that you wish to assemble to avoid unreasonably long run times.

cd $RNAseq_assembly_path

mkdir trinity

cd trinity

module load trinity/2.4.0
module load samtools/1.9

Trinity --seqType fq --max_memory 32G --left $data/raw_computes/ERR305399_1.fastq.gz --right $data/raw_computes/ERR305399_2.fastq.gz --CPU 5 --output trinity --SS_lib_type RF

Trinity takes a long time to run (several hours), you can stop the program when you start it and have a look at the results, look in $data/RNAseq/trinity the output is Trinity.fasta

Tips: Using Trinity in genome annotation

Some advantages :

Really easy to run.
Many manual an tutorial available on internet eg : here.
Output can be annotated directly with Trinotate.
Widely use in the community.
One can concatenate several read files/libraries or run them separately.

Some disadvantage :

Can take several hours to run.
Can take a lot of resources to run and create a lot of data.
Input data are often really big files : As a “rule” in our team, if read files are >10Gb compressed, one should consider normalizing the data. If you have such big file, it is recommended to normalize them prior to assembly to avoid very long run times.
Results tend to be noisy with many shorts transcripts, might need to be filtered.