Making an ab initio evidence-driven annotation with MAKER

Overview

Exercises: 1h00 min

Questions

How to create structural annotation with evidence and abinitio?

Objectives

run maker with augustus
understand the parameters files and the output

Prerequisites

Setup the folder structure:

export data=/home/data/data_annotation/
export maker_evidence_path=~/annotation/structural_annotation/maker_evidence
export maker_abinitio_path=~/annotation/structural_annotation/maker_abinitio
cd
mkdir -p $maker_abinitio_path
cd $maker_abinitio_path

Introduction

The recommended way of running Maker is in combination with one or more ab-initio profile models. Maker natively supports input from several tools, including augustus, snap and genemark. The choice of tool depends a bit on the organism that you are annotating - for example, GeneMark-ES is mostly recommended for fungi, whereas augustus and snap have a more general use.

The biggest problem with ab-initio models is the process of training them. It is usually recommended to have somewhere around 500-1000 curated gene models for this purpose. Naturally, this is a bit of a contradiction for a not-yet annotated genome.

However, if one or more good ab-initio profiles are available, they can potentially greatly enhance the quality of an annotation by filling in the blanks left by missing evidence. Interestingly, Maker even works with ab-initio profiles from somewhat distantly related species since it can create so-called hints from the evidence alignments, which the gene predictor can take into account to fine-tune the predictions.

Usually when no close ab-initio profile exists for the investigated species, we use the first round of annotation (evidence based) to create one. We first filter the best gene models from this annotation, which are used then to train the ab-initio tools of our choice.

In order to compare the performance of Maker with and without ab-initio predictions in a real-world scenario, we have first run a gene build without ab-initio predictions. Now, we run a similar analysis but enable ab-initio predictions through augustus.

Prepare the input data

First we need as usual the genome

ln -s $data/genome/genome.fa

We will use the same lines of evidence used for the evidence-based annotation, but we do not need to re-compute anything for them. Indeed, this time consuming task to compute/align them has already been performed during the evidence-based annotation. So we just need to re-use the results previously produced by MAKER.

To simplify the re-use of the whole heterogenous data generated by the previous MAKER run, we only need to link the maker_mix.gff file that contains:

- repeatmasker.genome.gff
- repeatrunner.genome.gff
- stringtie2genome.gff
- est2genome.gff
- protein2genome.gff
- maker_annotation.gff
- ...

ln -s $maker_evidence_path/maker_evidence/maker_mix.gff

Set the MAKER parameters

Let’s start by creating the three MAKER config files:

conda activate maker
maker -CTL

You can leave the two files controlling external software behaviors and the one controlling evm parameters untouched (I will encourage you to have a look at them and to look at this website to have a full description of those files). However, you need to provide the proper parameters in the file called maker_opts.ctl. Indeed, in that file, we tell MAKER what are the files to use, and the options to apply.

To edit the maker_opts.ctl file you can use the nano text editor:

nano maker_opts.ctl

This time, we do specify a reference species to be used by augustus, which will enable ab-initio gene finding.

To make sure you will find your config file for augustus do :

export AUGUSTUS_CONFIG_PATH=~/annotation/structural_annotation/augustus_training/config

With these settings, Maker will run augustus to predict gene loci, but inform these predictions with information from the protein, transcripts and est alignments.

Before running MAKER check you have modified the maker_opts.ctl file properly.

Click here to see the expected maker_opts.ctl.

#-----Genome (these are always required)  
genome=genome.fa #genome sequence (fasta file or fasta embeded in GFF3 file)  
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

...

#-----Re-annotation Using MAKER Derived GFF3
maker_gff=maker_mix.gff #MAKER derived GFF3 file
est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=1 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no

...

#-----Repeat Masking (leave values blank to skip repeat masking)  
model_org= #select a model organism for RepBase masking in RepeatMasker  
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker   
repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner  
rm_gff= #pre-identified repeat elements from an external GFF3 file  
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no  
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

...

#-----Gene Prediction  
snaphmm= #SNAP HMM file  
gmhmm= #GeneMark HMM file  
augustus_species=dmel_<$USER> #Augustus gene prediction species model  
fgenesh_par_file= #FGENESH parameter file  
pred_gff= #ab-initio predictions from an external GFF3 file  
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)  
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no  
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no  
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no  
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs  
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no

To better understand the different parameters you can have a look here

Be sure to have deactivated the parameters model_org= # and repeat_protein= # to avoid the heavy work of repeatmasker.
Replace <$USER> by your username.
For your information: evidence base prediction and abinitio prediction cannot work together! In case you enable the evidence base prediction (protein2genome=1 and/or est2genome=1) and the abinitio prediction, no warning will be display but only the abinitio prediction will run.

Run Maker with ab-initio predictions

With everything configured, run Maker as you did for the previous analysis:

mpirun -n 8 maker --ignore_nfs_tmp

We probably expect this to take a little bit longer than before (between 10-20min), since we have added another step to our analysis.

Once the run is finished, check that everything went properly. If problems are detected, launch MAKER again.

conda deactivate
conda activate gaas
gaas_maker_check_progress.sh

Compile the output

When Maker has finished, compile the output:

gaas_maker_merge_outputs_from_datastore.pl --output maker_abinitio

And again, it is probably best to link the resulting output (maker.gff) to a result folder (the same as defined in the previous exercise e.g. dmel_results), under a descriptive name.

Inspect the gene models

To get some statistics of your annotation you could read the maker_annotation_stat.txt file from the maker_abinitio folder or launch this script that work on any gff file :

conda deactivate
conda activate agat
agat_sp_statistics.pl --gff maker_abinitio/maker_annotation.gff

How many genes do you get? is it different than the maker_evidence results?

We could now also visualise the annotation in the Webapollo genome browser(like we did for the augustus exercises).