For this exercise you need to be logged in to Uppmax.
Setup the folder structure:
source ~/git/GAAS/profiles/activate_rackham_env
export data=/proj/g2019006/nobackup/$USER/data
export RNAseq_assembly_path=/proj/g2019006/nobackup/$USER/RNAseq_assembly
mkdir -p $RNAseq_assembly_path
This exercise is meant to get you acquainted with the type of data you would normally encounter in an annotation project. We will for all exercises use data for the fruit fly, Drosophila melanogaster, as that is one of the currently best annotated organisms and there is plenty of high quality data available.
You can create the folders where you want but I would suggest a folder organisation, if you do not follow this organisation remember to put the correct path to your data
cd $RNAseq_assembly_path
RNA-seq data is in general very useful in annotation projects as the data usually comes from the actual organism you are studying and thus avoids the danger of introducing errors caused by differences in gene structure between your study organism and other species.
Important remarks to remember before starting working with RNA-seq:
Check if RNAseq are paired or not. Last generation of sequenced short reads (since 2013) are almost all paired. Anyway, it is important to check that information, which will be useful for the tools used in the next steps.
Check if RNAseq are stranded. Indeed this information will be useful for the tools used in the next steps. (In general way we recommend to use stranded RNAseq to avoid transcript fusion during the transcript assembly process. That gives more reliable results. )
Left / L / forward / 1 are identical meaning. It is the same for Right / R /Reverse / 2
To check the technology used to sequences the RNAseq and get some extra information we have to use fastqc tool.
module load FastQC/0.11.5
mkdir fastqc_reports
fastqc $data/raw_computes/ERR305399_1.fastq.gz -o fastqc_reports/
Copy the html file resulting of fastqc in your local machine
scp __YOURLOGIN__@rackham.uppmax.uu.se:/proj/g2019006/nobackup/__YOURLOGIN__/RNAseq_assembly/fastqc_reports/YOURFILE .
what kind of results do you have?
Checking the fastq quality score format :
fastq_guessMyFormat.pl -i $data/raw_computes/ERR305399_1.fastq.gz
In the normal mode, it differentiates between Sanger/Illumina1.8+ and Solexa/Illumina1.3+/Illumina1.5+. In the advanced mode, it will try to pinpoint exactly which scoring system is used.
More test can be made and should be made on RNA-seq data before doing the assembly, we do not have time to do all of them during this course. Have a look here
The information regarding library type can be very useful for reads to be assembled into a transcriptome or mapped to a reference assembly. This is because the library type can help to discern where in the transcriptome shorter ambiguous reads belong by using the read’s relative orientation and from which strand it was sequenced. Unfortunately, this information regarding the library type used is not included in sequencing output files and may be lost before the assembly of the data.
Here a resume of the different library types:
In order to guess de-novo the library type we will use for this excercise: GUESSmyLT.
First link the data and load the necessary modules:
cd $RNAseq_assembly_path
ln -s $data/raw_computes/ERR305399_1.fastq.gz
ln -s $data/raw_computes/ERR305399_2.fastq.gz
ln -s $data/genome/genome.fa
module load trinity/2.8.2
module load BUSCO/3.0.2b
source $BUSCO_SETUP
module load bowtie2/2.3.4.3
module load samtools/1.2
module load python3/3.6.0
module load snakemake/5.4.5
Then install GUESSmyLT:
git clone https://github.com/NBISweden/GUESSmyLT.git
cd GUESSmyLT
python3 setup.py install --user
Check that the tool is working fine:
~/.local/bin/GUESSmyLT -h
Let’s know launch it on our data:
cd $RNAseq_assembly_path
~/.local/bin/GUESSmyLT --reads ERR305399_1.fastq.gz ERR305399_2.fastq.gz --reference genome.fa --threads 10 --subsample 100000 --output result
first strand
) or in other term RF unstranded (according to mRNA
).