Making an evidence based annotation with MAKER

Overview

Exercises: 45 min

Questions

how to create a structural annotation based on evidence only?

Objectives

Understand maker parameters files
run maker

Prerequisites

Setup the folder structure:

export data=/home/data/data_annotation/
export maker_evidence_path=~/annotation/structural_annotation/maker_evidence
cd
mkdir -p $maker_evidence_path
cd $maker_evidence_path

Overview

The first run of Maker will be done without ab-initio predictions. What are your expectations for the resulting gene build? In essence, we are attempting a purely evidence-based annotation, where the best protein- and EST-alignments are chosen to build the most likely gene models. The purpose of an evidence-based annotation is simple. Basically, you may try to annotate an organism where no usable ab-initio model is available. The evidence-based annotation can then be used to create a set of genes on which a new model could be trained on (using e.g. Snap or Augustus). Selection of genes for training can be based on the annotation edit distance (AED score), which says something about how great the distance between a gene model and the evidence alignments is. A score of 0.0 would essentially say that the final model is in perfect agreement with the evidence.

Let’s do this step-by-step:

Prepare the input data

Link the raw computes you want to use into your folder. The files you will need are:

the gff file of the pre-computed repeats (coordinates of repeatmasked regions)

ln -s $data/raw_computes/repeatmasker.genome.gff
ln -s $data/raw_computes/repeatrunner.genome.gff

In addition, you will also need the genome sequence.

ln -s $data/genome/genome.fa

Then you will also need EST and protein fasta file:

ln -s $data/evidence/est.genome.fa
ln -s $data/evidence/proteins.genome.fa

To finish you could use a transcriptome assembly. We will use a guided-assembly made with Stringtie:

ln -s $data/RNAseq/stringtie/transcript_stringtie.gff3 stringtie2genome.gff

/!\ Always check that the gff files you provides as protein or EST contains match/match_part (gff alignment type ) feature types rather than genes/transcripts (gff annotation type) otherwise MAKER will not use the contained data properly. Here we have to fix the stringtie gff file. can use the gtf righ away

conda activate agat
agat_sp_alignment_output_style.pl --gff stringtie2genome.gff -o stringtie2genome.ok.gff

You should now have 2 repeat files, 1 EST file, 1 protein file, 1 transcript file, and the genome sequence in the working directory.

Set the MAKER parameters

Let’s start by creating the three MAKER config files:

conda deactivate
conda activate maker
maker -CTL

What files did it create?

You can leave the two files controlling external software behaviors and the one controlling evm parameters untouched (I will encourage you to have a look at them and to look at this website to have a full description of those files). However, you need to provide the proper parameters in the file called maker_opts.ctl. Indeed, in that file, we tell MAKER what are the files to use, and the options to apply.

To edit the maker_opts.ctl file you can use the nano text editor:

nano maker_opts.ctl

In the maker_opts.ctl you will set:

name of the genome sequence (genome=)
name of the ‘EST’ file in fasta format (est=)
name of the ‘Transcript’ file in gff format (est_gff=)
name of the ‘Protein’ set file(s) (protein=)
name of the repeatmasker and repeatrunner files (rm_gff=)

You can list multiple files in one field by separating their names by a comma ‘,’.

This time, we do not specify a reference species to be used by augustus, which will disable ab-initio gene finding. Instead we set:

protein2genome=1
est2genome=1

This will enable gene building directly from the evidence alignments.

Be sure to have deactivated the parameters model_org= # and repeat_protein= # to avoid the heavy work of repeatmasker. Before running MAKER check you have modified the maker_opts.ctl file properly.

Click here to see the expected maker_opts.ctl.

#-----Genome (these are always required)
genome=genome.fa #genome sequence (fasta file or fasta embeded in GFF3 file)  
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

...

#-----EST Evidence (for best results provide a file for at least one)  
est=est.genome.fa #set of ESTs or assembled mRNA-seq in fasta format  
altest= #EST/cDNA sequence file in fasta format from an alternate organism  
est_gff=stringtie2genome.ok.gff #aligned ESTs or mRNA-seq from an external GFF3 file  
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

...

#-----Protein Homology Evidence (for best results provide a file for at least one)  
protein=proteins.genome.fa #protein sequence file in fasta format (i.e. from mutiple oransisms)  
protein_gff= #aligned protein homology evidence from an external GFF3 file

...

#-----Repeat Masking (leave values blank to skip repeat masking)<br/>
model_org= #select a model organism for RepBase masking in RepeatMasker  
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker   
repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner  
rm_gff=repeatmasker.genome.gff,repeatrunner.genome.gff #pre-identified repeat elements from an external GFF3 file  
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no  
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

...

#-----Gene Prediction  
snaphmm= #SNAP HMM file  
gmhmm= #GeneMark HMM file  
augustus_species= #Augustus gene prediction species model  
fgenesh_par_file= #FGENESH parameter file  
pred_gff= #ab-initio predictions from an external GFF3 file  
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)  
est2genome=1 #infer gene predictions directly from ESTs, 1 = yes, 0 = no  
protein2genome=1 #infer predictions from protein homology, 1 = yes, 0 = no  
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no  
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs  
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no

...

Run Maker

If your maker_opts.ctl is configured correctly, you should be able to run maker:

mpirun -n 8 maker --ignore_nfs_tmp

This will start Maker on 8 cores, if everything is configured correctly. This will take a little while and process a lot of output to the screen. Luckily, much of the heavy work - such as repeat masking - are already done, so the total running time is quite manageable, even on a small number of cores. (it should take around 5-10 minutes to run)

Once the run is finished, check that everything went properly. If problems are detected, launch MAKER again.

conda deactivate
conda activate gaas
gaas_maker_check_progress.sh

Inspect the output (optional)

Here you can find details about the MAKER output.

Compile the output

Once Maker is finished, compile the annotation:

gaas_maker_merge_outputs_from_datastore.pl --output maker_evidence

We have specified a name for the output directory since we will be creating more than one annotation and need to be able to tell them apart.

This should create a maker_evidence folder containing all computed data including maker_annotation.gff which is the maker annotation file and maker_annotation.proteins.fasta which is the protein fasta file of this annotation. Those two files are the most important outputs from this analysis.

=> You could sym-link the maker_annotation.gff and maker_annotation.proteins.fasta files to another folder called e.g. dmel_results, so everything is in the same place in the end. Just make sure to call the links with specific names, since any maker output will be called similarly.

Inspect the gene models

To get some statistics of your annotation you can read the maker_annotation_stat.txt file from the maker_evidence folder or launch the agat_sp_statistics.pl (you used it before) that work on any gff file :

Click to see the solution .

conda deactivate
conda activate agat
agat_sp_statistics.pl --gff maker_evidence/maker_annotation.gff

How many genes do you get? what kind of statistics do you see?

We could now also visualise the annotation in the Webapollo genome browser (like we did for the augustus exercises).