For this exercise you need to be logged in to Uppmax.
Setup the folder structure:
source ~/git/GAAS/profiles/activate_rackham_env
export data=/proj/g2019006/nobackup/$USER/data
export RNAseq_assembly_path=/proj/g2019006/nobackup/$USER/RNAseq_assembly
There are different ways of assessing the quality of your assembly, you will find some of them here.
We will run busco to check the the quality of the assembly. BUSCO provides measures for quantitative assessment of genome assembly, gene set, and transcriptome completeness (what we are going to do here). Genes that make up the BUSCO sets for each major lineage are selected from orthologous groups with genes present as single-copy orthologs in at least 90% of the species in the chosen branch of tree of life.
cd $RNAseq_assembly_path
mkdir assembly_assessment
cd assembly_assessment
module load BUSCO/3.0.2b
source $BUSCO_SETUP
run_BUSCO.py -i $data/RNAseq/trinity/Trinity.fasta -o busco_trinity -l $BUSCO_LINEAGE_SETS/arthropoda_odb9 -m tran -c 5
Busco will take 30 min to run so you can check the results in $data/RNAseq/busco_trinity
You need first to extract the transcript sequences from the gtf transcript file :
ln -s $data/genome/genome.fa
gff3_sp_extract_sequences.pl --cdna -g $RNAseq_assembly_path/guided_assembly/stringtie/transcripts.gtf -f genome.fa -o $RNAseq_assembly_path/guided_assembly/stringtie/transcripts_stringtie.fa
Then you can run busco again :
run_BUSCO.py -i $RNAseq_assembly_path/guided_assembly/stringtie/transcripts_stringtie.fa -o busco_stringtie -l $BUSCO_LINEAGE_SETS/arthropoda_odb9 -m tran -c 5
Compare the two busco, what do you think happened for stringtie?
Now you are ready use the results of your De-novo assembly and guided assembly to do the genome annotation.
For the de-novo assembly you can use the Trinity.fasta file obtained. For the genome-guided assembly you can either use the Stringtie results transcripts.gtf but you will often need to reformat it into a gff file. If you have not done it please do :
gxf_to_gff3.pl -g $RNAseq_assembly_path/guided_assembly/stringtie/transcripts.gtf -o $RNAseq_assembly_path/guided_assembly/stringtie/transcript_stringtie.gff3
You are now ready to use the genome-guided assembly for your annotation.