Setup the folder structure:
export data=/home/data/data_annotation/
export augustus_training_path=~/annotation/structural_annotation/augustus_training
export maker_evidence_path=~/annotation/structural_annotation/maker_evidence
cd
mkdir -p $augustus_training_path
cd $augustus_training_path
From the maker run evidence based, we can train our ab-initio predictors and then use them for the second run of annotation. You will need a set of genomic sequences with gene structures (sequence coordinates of starts and ends of exons and genes) and the most important part is to select the right set of genes. In many cases, or as a first step towards modelling complete genes, it is sufficient to have only the coding parts of the gene structure (CDS). We will only train augustus today as it is one the best ab-initio predictor and one of the hardest to train. Maker also support SNAP (Works good, easy to train, not as good as others ab-initio especially on longer intron genomes), GeneMark (Self training, no hints, buggy, not good for fragmented genomes or long introns), FGENESH (Works great, costs money even for training) and now EVM.
You will need to symlink the evidence-based annotation (the gff annotation file from the first run of maker) and the genome fasta sequence.
ln -s $maker_evidence_path/maker_evidence/maker_annotation.gff maker_evidence.gff
ln -s $data/genome/genome.fa
mkdir filter
mkdir protein
mkdir nonredundant
mkdir blast_recursive
mkdir gff2genbank
conda deactivate
conda activate agat
agat_sp_separate_by_record_type.pl -g maker_evidence.gff -o maker_results_noAbinitio_clean
ln -s maker_results_noAbinitio_clean/mrna.gff
agat_sp_filter_feature_by_attribute_value.pl --gff mrna.gff --value 0.3 -a _AED -t ">" -o filter/codingGeneFeatures.filter.gff
agat_sp_keep_longest_isoform.pl -f filter/codingGeneFeatures.filter.gff -o filter/codingGeneFeatures.filter.longest_cds.gff
agat_sp_filter_incomplete_gene_coding_models.pl --gff filter/codingGeneFeatures.filter.longest_cds.gff -f genome.fa -o filter/codingGeneFeatures.filter.longest_cds.complete.gff
agat_sp_filter_by_locus_distance.pl --gff filter/codingGeneFeatures.filter.longest_cds.complete.gff -o filter/codingGeneFeatures.filter.longest_cds.complete.good_distance.gff
agat_sp_extract_sequences.pl -o protein/codingGeneFeatures.filter.longest_cds.complete.good_distance_proteins.fa -f genome.fa -p -cfs -cis -ct 1 --g filter/codingGeneFeatures.filter.longest_cds.complete.good_distance.gff
conda deactivate
conda activate blast
makeblastdb -in protein/codingGeneFeatures.filter.longest_cds.complete.good_distance_proteins.fa -dbtype prot -out protein/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins
blastp -query protein/codingGeneFeatures.filter.longest_cds.complete.good_distance_proteins.fa -db protein/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins -outfmt 6 -out blast_recursive/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa.blast_recursive
conda deactivate
conda activate agat
agat_sp_filter_by_mrnaBlastValue.pl --gff filter/codingGeneFeatures.filter.longest_cds.complete.good_distance.gff --blast blast_recursive/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa.blast_recursive --outfile nonredundant/codingGeneFeatures.nr.gff
conda deactivate
conda activate bioinfo
/opt/anaconda3/envs/bioinfo/scripts/gff2gbSmallDNA.pl nonredundant/codingGeneFeatures.nr.gff $data/genome/genome.fa 500 gff2genbank/codingGeneFeatures.nr.gbk
/opt/anaconda3/envs/bioinfo/scripts/randomSplit.pl gff2genbank/codingGeneFeatures.nr.gbk 100
What happened? how can you solve it? what might be the consequences of it?
/opt/anaconda3/envs/bioinfo/scripts/randomSplit.pl gff2genbank/codingGeneFeatures.nr.gbk 20
Now that you have created a set of gene to train augustus, let’s train it!
You need to copy the config folder of augustus in your folder (to be able to write in it)
cp -r /opt/anaconda3/envs/bioinfo/config/ .
export AUGUSTUS_CONFIG_PATH=~/annotation/structural_annotation/augustus_training/config
Replace <$USER> by your username.
/opt/anaconda3/envs/bioinfo/scripts/new_species.pl --species=dmel_$USER
etraining --species=dmel_$USER gff2genbank/codingGeneFeatures.nr.gbk.train
augustus --species=dmel_$USER gff2genbank/codingGeneFeatures.nr.gbk.test | tee run.log