Copy from web.archive.org.
GFF (General Feature Format) Specifications Document |
2000-9-29 The default version for GFF files is now Version 2. This document has been changed to show version 2 as default, with version one alternatives shown where appropriate. The main change from Version 1 to Version 2 is the requirement for a tag-value type structure (essentially semicolon-separated .ace format) for any additional material on the line, following the mandatory fields. Version 2 also allows '.' as a score, for features for which there is no score. Dumping in version 2 format is implemented in ACEDB.
Essentially all current approaches to feature finding in higher organisms use a variety of recognition methods that give scores to likely signals (starts, splice sites, stops, motifs, etc.) or to extended regions (exons, introns, protein domains etc.), and then combine these to give complete gene, RNA transcript or protein structures. Normally the combination step is done in the same program as the feature detection, often using dynamic programming methods. To enable these processes to be decoupled, a format called GFF ('Gene-Finding Format' or 'General Feature Format') was proposed as a protocol for the transfer of feature information. It is now possible to take features from an outside source and add them in to an existing program, or in the extreme to write a dynamic programming system which only took external features.
GFF allows people to develop features and have them tested without having to maintain a complete feature-finding system. Equally, it would help those developing and applying integrated gene-finding programs to test new feature detectors developed by others, or even by themselves.
We want the GFF format to be easy to parse and process by a variety of programs in different languages. e.g. it would be useful if Unix tools like grep, sort and simple perl and awk scripts could easily extract information out of the file. For these reasons, for the primary format, we propose a record-based structure, where each feature is described on a single line, and line order is not relevant.
We do not intend GFF format to be used for complete data management of the analysis and annotation of genomic sequence. Systems such as Acedb, Genotator etc. that have much richer data representation semantics have been designed for that purpose. The disadvantages in using their formats for data exchange (or other richer formats such as ASN.1) are (1) they require more complexity in parsing/processing, (2) there is little hope on achieving consensus on how to capture all information. GFF is intentionally aiming for a low common denominator.
With the changes taking place to version 2 of the format, we also allow for feature sets to be defined over RNA and Protein sequences, as well as genomic DNA. This is used for example by the EMBOSS project to provide standard format output for all features as an option. In this case the <strand> and <frame> fields should be set to '.'. To assist this transition in specification, a new #Type Meta-Comment has been added.
Here are some example records:
SEQ1 EMBL atg 103 105 . + 0 SEQ1 EMBL exon 103 172 . + 0 SEQ1 EMBL splice5 172 173 . + . SEQ1 netgene splice5 172 173 0.94 + . SEQ1 genie sp5-20 163 182 2.3 + . SEQ1 genie sp5-10 168 177 2.1 + . SEQ2 grail ATG 17 19 2.1 - 0
Back to Table of Contents
We would like to enforce a standard nomenclature for
common GFF features. This does not forbid the use of other features,
rather, just that if the feature is obviously described in the standard
list, that the standard label should be used. For this standard table
we propose to fall back on the international public standards for genomic
database feature annotation, specifically, the
DDBJ/EMBL/GenBank feature table documentation).
seq1 BLASTX similarity 101 235 87.1 + 0 Target "HBA_HUMAN" 11 55 ; E_value 0.0003
dJ102G20 GD_mRNA coding_exon 7105 7201 . - 2 Sequence "dJ102G20.C1.1"
The semantics of tags in attribute field tag-values pairs has
intentionally not been formalized. Two useful guidelines are to use
DDBJ/EMBL/GenBank feature 'qualifiers' (see
DDBJ/EMBL/GenBank
feature table documentation), or the features that ACEDB generates
when it dumps GFF.
Version 1 note In version 1 the attribute field was called the
group field, with the following specification:
An optional string-valued field that can be used as a name to
group together a set of records. Typical uses might be to group the
introns and exons in one gene prediction (or experimentally verified
gene structure), or to group multiple regions of match to another
sequence, such as an EST or a protein.
Version 1 note In version 1 each string had to be under 256 characters long, and the whole line should under 32k long. This was to make things easier for guaranteed conforming parsers, but seemed unnecessary given modern languages.
Back to Table of Contents
Current proposed ## lines are:
##gff-version 2
##source-version <source> <version text>
##date <date>
##Type <type> [<seqname>]
##DNA <seqname> ##acggctcggattggcgctggatgatagatcagacgac ##... ##end-DNA
##RNA <seqname> ##acggcucggauuggcgcuggaugauagaucagacgac ##... ##end-RNA
##Protein <seqname> ##MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSF ##... ##end-Protein
##sequence-region <seqname> <start> <end>
Back to Table of Contents
In the example given above the feature "splice5" indicates that there is a candidate 5' splice site between positions 172 and 173. The "sp5-20" feature is a prediction based on a window of 20 bp for the same splice site. To use either of these, you must know the position within the feature of the predicted splice site. This only needs to be given once, possibly in comments at the head of the file, or in a separate document.
Another example is the scoring scheme; we ourselves would like the score to be a log-odds likelihood score in bits to a defined null model, but that is not required, because different methods take different approaches. Avoiding a prespecified feature set also leaves open the possibility for GFF to be used for new feature types, such as CpG islands, hypersensitive sites, promoter/enhancer elements, etc.
Back to Table of Contents
Back to Table of Contents
seq1 BLASTX similarity 101 235 87.1 + 0 Target "HBA_HUMAN" 11 54 ; E_value 0.0003The proposed tag-value structure for gapped alignments is
Align <seq_start> <target_start> [<length>] ;to define each ungapped block in the alignment, with multiple Align tags to give a full gapped alignment. The <length> field is optional because in its absence a block is presumed to extend until it reaches the next specified block, or the end of the complete similarity. This corresponds to the standard case with alignments that they don't have simultaneous gaps on both strands. For example, for the above HBA_HUMAN similarity, the Align information could be
Align 101 11 ; Align 179 36 ;which leaves the DNA triplet from 176 to 178 aligned to a gap in the protein sequence.
Back to Table of Contents
Let us assume that the score of an exon can be decomposed into three parts: the score of the 5' splice site, the score of the 3' splice site, and the sum of the scores of all the codons in between. In such a case it can be much more efficient to use the GFF format to report separate scores for the splice site sensors and for the individual codons in all three (or six, including reverse strand) frames, and let the program that interprets this file assemble the exon scores. The exon scores can be calculated efficiently by first creating three arrays, each of which contains in its [i]th position a value A[i] that is the partial sum of the codon scores in a particular frame for the entire sequence from position 1 up to position i. Then for any positions i < j, the sum of the scores of all codons from i to j can be obtained as A[j] - A[i]. Using these arrays, along with the candidate splice site scores, a very large number of scores for overlapping exons are implicitly defined in a data structure that takes only linear space with respect to the number of positions in the sequence, and such that the score for each exon can be retrieved in constant time.
When the GFF format is used to transmit scores that can be summed for efficient retrieval as in the case of the codon scores above, we ask that the provider of the scores indicate that these scores are summable in this manner, and provide a recipe for calculating the scores that are to be derived from these summable scores, such as the exon scores described above. We place no limit on the complexity of this recipe, nor do we provide a standard protocol for such assembly, other than providing examples. It behooves the sensor score provider to keep the recipe simple enough that others can easily implement it.
Back to Table of Contents
There is a mailing list to which you can send comments, enquiries, complaints etc. about GFF. If you want to be added to the mailing list, please send mail to Majordomo@sanger.ac.uk with the following command in the body of your email message:
subscribe gff-list
Back to Table of Contents
000929 rd: make version 2 default and propose Align tag-value syntax
0003022 rbsk: small clarification to #comment rules
991711 rbsk: (overdue changes as per September '99 gff-list commentaries)
990816 rbsk: standard list of features and group tags (first attempt at clarification)
990317 rbsk:
990226 rbsk: incorporated amendments to the version 2 specification as follows:
981216 rd: introduced version 2 changes.
980909 ihh: fixed some small things and put this page on the Sanger GFF site.
971113 rd: added section on mailing list.
971113 rd: added extra "source" field as discussed at Newton Institute meeting 971029. There are two main reasons. First, to help prevent name space clashes -- each program would have their own source designation. Second, to help reuse feature names, so one could have "exon" for exon predictions from each prediction program.
971108 rd: added ## line proposals - moved them into main text 971113.
971028 rd: I added the section about name space.
971028 rd: I considered switching from start-end notation to start-length notation, on the suggestion of Anders Krogh. This seems nicer in many cases, but is a debatable point. I then switched back!
971028 rd: We also now allow extra text after <group> without a comment character, because this immediately proved useful.
971028 rd: I changed the comment initiator to '#' from '//' because a single symbol is easier for simple parsers.
Back to Table of Contents
GFF Protocol Specification initially proposed by: Richard Durbin and David Haussler
with amendments proposed by: Lincoln Stein, Suzanna Lewis, Anders Krogh and others.
Back to Table of Contents
![]() |
|
![]() |
|
last modified 04-Dec-2000, 01:01 PM | webmaster@sanger.ac.uk |
![]() |
|
![]() |