Setup the folder structure:
export data=/home/data/data_annotation/
export maker_evidence_path=~/annotation/structural_annotation/maker_evidence
cd
mkdir -p $maker_evidence_path
cd $maker_evidence_path
The first run of Maker will be done without ab-initio predictions. What are your expectations for the resulting gene build? In essence, we are attempting a purely evidence-based annotation, where the best protein- and EST-alignments are chosen to build the most likely gene models. The purpose of an evidence-based annotation is simple. Basically, you may try to annotate an organism where no usable ab-initio model is available. The evidence-based annotation can then be used to create a set of genes on which a new model could be trained on (using e.g. Snap or Augustus). Selection of genes for training can be based on the annotation edit distance (AED score), which says something about how great the distance between a gene model and the evidence alignments is. A score of 0.0 would essentially say that the final model is in perfect agreement with the evidence.
Let’s do this step-by-step:
Link the raw computes you want to use into your folder. The files you will need are:
ln -s $data/raw_computes/repeatmasker.genome.gff
ln -s $data/raw_computes/repeatrunner.genome.gff
In addition, you will also need the genome sequence.
ln -s $data/genome/genome.fa
Then you will also need EST and protein fasta file:
ln -s $data/evidence/est.genome.fa
ln -s $data/evidence/proteins.genome.fa
To finish you could use a transcriptome assembly. We will use a guided-assembly made with Stringtie:
ln -s $data/RNAseq/stringtie/transcript_stringtie.gff3 stringtie2genome.gff
/!\ Always check that the gff files you provides as protein or EST contains match/match_part (gff alignment type ) feature types rather than genes/transcripts (gff annotation type) otherwise MAKER will not use the contained data properly. Here we have to fix the stringtie gff file. can use the gtf righ away
conda activate agat
agat_sp_alignment_output_style.pl --gff stringtie2genome.gff -o stringtie2genome.ok.gff
You should now have 2 repeat files, 1 EST file, 1 protein file, 1 transcript file, and the genome sequence in the working directory.
conda deactivate
conda activate maker
maker -CTL
What files did it create?
You can leave the two files controlling external software behaviors and the one controlling evm parameters untouched (I will encourage you to have a look at them and to look at this website to have a full description of those files). However, you need to provide the proper parameters in the file called maker_opts.ctl. Indeed, in that file, we tell MAKER what are the files to use, and the options to apply.
To edit the maker_opts.ctl file you can use the nano text editor:
nano maker_opts.ctl
In the maker_opts.ctl you will set:
You can list multiple files in one field by separating their names by a comma ‘,’.
This time, we do not specify a reference species to be used by augustus, which will disable ab-initio gene finding. Instead we set:
protein2genome=1
est2genome=1
This will enable gene building directly from the evidence alignments.
Be sure to have deactivated the parameters model_org= # and repeat_protein= # to avoid the heavy work of repeatmasker.
Before running MAKER check you have modified the maker_opts.ctl file properly.
#-----Genome (these are always required)
genome=genome.fa #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic
...
#-----EST Evidence (for best results provide a file for at least one)
est=est.genome.fa #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff=stringtie2genome.ok.gff #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format
...
#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=proteins.genome.fa #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff= #aligned protein homology evidence from an external GFF3 file
...
#-----Repeat Masking (leave values blank to skip repeat masking)<br/>
model_org= #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff=repeatmasker.genome.gff,repeatrunner.genome.gff #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
...
#-----Gene Prediction
snaphmm= #SNAP HMM file
gmhmm= #GeneMark HMM file
augustus_species= #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=1 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=1 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
...
If your maker_opts.ctl is configured correctly, you should be able to run maker:
mpirun -n 8 maker --ignore_nfs_tmp
This will start Maker on 8 cores, if everything is configured correctly. This will take a little while and process a lot of output to the screen. Luckily, much of the heavy work - such as repeat masking - are already done, so the total running time is quite manageable, even on a small number of cores. (it should take around 5-10 minutes to run)
Once the run is finished, check that everything went properly. If problems are detected, launch MAKER again.
conda deactivate
conda activate gaas
gaas_maker_check_progress.sh
Here you can find details about the MAKER output.
Once Maker is finished, compile the annotation:
gaas_maker_merge_outputs_from_datastore.pl --output maker_evidence
We have specified a name for the output directory since we will be creating more than one annotation and need to be able to tell them apart.
This should create a maker_evidence folder containing all computed data including maker_annotation.gff which is the maker annotation file and maker_annotation.proteins.fasta which is the protein fasta file of this annotation. Those two files are the most important outputs from this analysis.
=> You could sym-link the maker_annotation.gff and maker_annotation.proteins.fasta files to another folder called e.g. dmel_results, so everything is in the same place in the end. Just make sure to call the links with specific names, since any maker output will be called similarly.
To get some statistics of your annotation you can read the maker_annotation_stat.txt file from the maker_evidence folder or launch the agat_sp_statistics.pl (you used it before) that work on any gff file :
conda deactivate conda activate agat agat_sp_statistics.pl --gff maker_evidence/maker_annotation.gff
How many genes do you get? what kind of statistics do you see?
We could now also visualise the annotation in the Webapollo genome browser (like we did for the augustus exercises).