For this exercise you need to be logged in to the VM.
Setup the folder structure:
cd ~/annotation/ export data=/home/data/byod/Annotation/data export functional_annotation_path=~/annotation/functional_annotation export structural_annotation_path=~/annotation/structural_annotation mkdir -p $functional_annotation_path
You need to have the right to write in the folder blastdb/ for the next exercises so you will copy the folder:
cp -r /home/data/byod/Annotation/data/blastdb . chmod +w blastdb/uniprot_dmel/
Functional annotation is the process during which we try to put names to faces - what do genes that we have annotated and curated? Basically all existing approaches accomplish this by means of similarity. If a translation product has strong similarity to a protein that has previously been assigned a function, the function in this newly annotated transcript is probably the same. Of course, this thinking is a bit problematic (where do other functional annotations come from…?) and the method will break down the more distant a newly annotated genome is to existing reference data. A complementary strategy is to scan for more limited similarity - specifically to look for the motifs of functionally characterized protein domains. It doesn’t directly tell you what the protein is doing exactly, but it can provide some first indication.
In this exercise we will use an approach that combines the search for full-sequence similarity by means of ‘Blast’ against large public databases with more targeted characterization of functional elements through the InterproScan pipeline. Interproscan is a meta-search engine that can compare protein queries against numerous databases. The output from Blast and Interproscan can then be used to add some information to our annotation.
Since we do not wish to spend too much time on this, we will again limit our analysis to chromosome 4. It is also probably best to choose the analysis with ab-initio predictions enabled (unless you found the other build to be more convincing). Maker produces a protein fasta file (called “annotations.proteins.fa”) together with the annotation and this file should be located in your maker directory.
Move in the proper folder:
Now you can link the annotation you have done during the structural annotation or use the one provided below. The command will looks like:
ln -s ~/annotation/structural_annotation/assessment/complement/maker_abinitio_cplt_by_evidence.gff maker_final.gff
ln -s $data/annotation/maker_with_abinitio.gff maker_final.gff
ln -s $data/genome/genome.fa
In order to do the functional annotation, you need to retrieve the proteins from the structural annotation.
gff3_sp_extract_sequences.pl -g maker_final.gff -f genome.fa -p -o maker_final_prot.fa
Interproscan combines a number of searches for conserved motifs and curated data sets of protein clusters etc. This step may take fairly long time. It is recommended to paralellize it for huge amount of data by doing analysis of chunks of tens or hundreds proteins.
InterproScan can be run through a website or from the command line on a linux server. Here we are interested in the command line approach. Interproscan allows to look up pathways, families, domains, sites, repeats, structural domains and other sequence features.
Launch Interproscan with the option -h if you want have a look about all the parameters.
- The ‘-app’ option allows defining the database used. Here we will use the PfamA,ProDom and SuperFamily databases.
- Interproscan uses an internal database that related entries in public databases to established GO terms. By running the ‘-goterms’ option, we can add this information to our data set.
- If you enable the InterPro lookup (‘-iprlookup’), you can also get the InterPro identifier corresponding to each motif retrieved: for example, the same motif is known as PF01623 in Pfam and as IPR002568 in InterPro.
- The option ‘-pa’ provides mappings from matches to pathway information (MetaCyc,KEGG,Reactome).
interproscan.sh -i maker_final_prot.fa -t p -dp -pa -appl Pfam,ProDom-2006.1,SuperFamily-1.75 --goterms --iprlookup
This analyse will fail.
Why? What is the error message displaying?
If you did not have a look at the maker_final.faa, please have look and find a solution to make interproscan run.
gff3_sp_extract_sequences.pl -g maker_final.gff -f genome.fa --cfs --cis -p -o maker_final_prot.fa
sed -e 's/*//g' maker_final_prot.fa > maker_final_prot_fixed.fa
Rerun the previous interproscan command.
The analysis should take 2-3 secs per protein request - depending on how many sequences you have submitted, you can make a fairly deducted guess regarding the running time.
You will obtain 3 result files with the following extension ‘.gff3’, ‘.tsv’ and ‘.xml’. Explanation of these output are available »here«.
Next, you could write scripts of your own to merge interproscan output into your annotation. Incidentally, Maker comes with utility scripts that can take InterProscan output and add it to a Maker annotation file (you need to load maker).
- ipr_update_gff: adds searchable tags to the gene and mRNA features in the GFF3 files.
- iprscan2gff3: adds physical viewable features for domains that can be displayed in JBrowse, Gbrowse, and Web Apollo.
We also created a script that can do the merging between the structural annotation and the interpro results :
gff3_sp_manage_functional_annotation.pl --gff maker_final.gff -i maker_final_prot.fa.tsv -o maker_final.interpro
Where a match is found, the new file will now include features called Dbxref and/or Ontology_term in the gene and transcript feature field (9th column). The improved annotation is the gff file inside the maker_final.interpro folder.
Blast searches provide an indication about potential homology to known proteins. A ‘full’ Blast analysis can run for several days and consume several GB of Ram. Consequently, for a huge amount of data it is recommended to parallelize this step doing analysis of chunks of tens or hundreds proteins. This approach can be used to give a name to the genes and a function to the transcripts.
To run Blast on your data, use the Ncbi Blast+ package against a Drosophila-specific database (included in the folder we have provided for you, under ~/annotation/blastdb/uniprot_dmel/uniprot_dmel.fa) - of course, any other NCBI database would also work:
/etc/ncbi-blast-2.9.0+-src/c++/ReleaseMT/bin/blastp -db ~/annotation/blastdb/uniprot_dmel/uniprot_dmel.fa -query maker_final_prot.fa -outfmt 6 -out blast.out -num_threads 4
Against the Drosophila-specific database, the blast search takes about 2 secs per protein request - depending on how many sequences you have submitted, you can make a fairly deducted guess regarding the running time.
Now you should be able to use the following script:
gff3_sp_manage_functional_annotation.pl -f maker_final.interpro/maker_final.gff -b blast.out --db ~/annotation/blastdb/uniprot_dmel/uniprot_dmel.fa -o maker_final.interpro.blast
That will add the name attribute to the “gene” feature and the description attribute (corresponding to the product information) to the “mRNA” feature into you annotation file. The improved annotation is the gff file inside the maker_final.interpro.blast folder.
How many genes do not have any names?
grep -P "\tgene" maker_final.interpro.blast/maker_final.gff | grep -v "Name" | wc -l
Choose one gene in your gff file with annotation with at least one GO terms (they look like GO:XXXX) and some domains annotated and have a look a them in the Gene Ontology webpage and Interproscan webpage.
The purpose is to modify the ID value by something more convenient (i.e FLYG00000001 instead of maker-4-exonerate_protein2genome-gene-8.41).
gff3_sp_manage_functional_annotation.pl -f maker_final.interpro.blast/maker_final.gff --ID FLY -o maker_final.interpro.blast.ID
The improved annotation is the gff file inside the maker_final.interpro.blast.ID folder.
For a nice display of a gff file within Webapollo some modification might be needed. As example the attribute product is not displayed in Webapollo, whereas renaming it description will work out.
gff3_sp_webApollo_compliant.pl -gff maker_final.interpro.blast.ID/maker_final.gff -o final_annotation.gff
Transfer the final_annotation.gff file to your computer using scp in a new terminal:
scp -P <port-number> <username>@<host>:~/annotation/functional_annotation/final_annotation.gff .
Load the file in into the genome portal called drosophila_melanogaster_chr4 in the Webapollo genome browser available at the address http://annotation-prod.scilifelab.se:8080/NBIS_course/. Here find the WebApollo instruction
Wonderfull ! isn’t it ?
Because of Makers’ compatibility with GMOD standards, a functional annotation created in one or both of this way can be loaded into e.g. WebApollo and will save annotators a lot of work when e.g. adding meta data to transcript models.