GFF (General Feature Format) Specifications Document |
Essentially all current approaches to feature finding in higher organisms use a variety of recognition methods that give scores to likely signals (starts, splice sites, stops, motifs, etc.) or to extended regions (exons, introns, protein domains etc.), and then combine these to give complete gene, RNA transcript or protein structures. Normally the combination step is done in the same program as the feature detection, often using dynamic programming methods. To enable these processes to be decoupled, a format called GFF ('Gene-Finding Format') was proposed as a protocol for the transfer of feature information. It is now possible to take features from an outside source and add them in to an existing program, or in the extreme to write a dynamic programming system which only took external features.
17/11/99 Advisory: 'Gene Feature Finding' Version 2 format has now been conceptually
generalized to be the 'General Feature Format' to accommodate RNA and Protein feature files,
while retaining the same acronym, overall syntax and semantics, with the
notable exception of ignoring the <strand> and <frame>
fields (setting them to '.') in RNA and protein features. To assist this transition
in specification, a new #Type Meta-Comment is now proposed.
GFF allows people to develop features and have them tested without having to maintain a complete feature-finding system. Equally, it would help those developing and applying integrated gene-finding programs to test new feature detectors developed by others, or even by themselves.
We want the GFF format to be easy to parse and process by a variety of programs in different languages. e.g. it would be useful if Unix tools like grep, sort and simple perl and awk scripts could easily extract information out of the file. For these reasons, for the primary format, we propose a record-based structure, where each feature is described on a single line, and line order is not relevant.
We do not intend GFF format to be used for complete data management of the analysis and annotation of genomic sequence. Systems such as Acedb, Genotator etc. that have much richer data representation semantics have been designed for that purpose. The disadvantages in using their formats for data exchange (or other richer formats such as ASN.1) are (1) they require more complexity in parsing/processing, (2) there is little hope on achieving consensus on how to capture all information. GFF is intentionally aiming for a low common denominator.
Here are some example records:
SEQ1 EMBL atg 103 105 . + 0 SEQ1 EMBL exon 103 172 . + 0 SEQ1 EMBL splice5 172 173 . + . SEQ1 netgene splice5 172 173 0.94 + . SEQ1 genie sp5-20 163 182 2.3 + . SEQ1 genie sp5-10 168 177 2.1 + . SEQ2 grail ATG 17 19 2.1 - 0
Back to Table of Contents
ALERT 98/12/16: Following discussions with Lincoln Stein and others, we propose the Version 2 format of GFF, as specifically described in this document. The Version 2 specification has not yet been frozen and is presented as a "work-in-progress" at this time, open to user feedback on the proposed changes (plus other suggestions for improvement). The main change from Version 1 to Version 2 is the requirement for a tag-value type structure (essentially .ace format) for any additional material on the line, following the mandatory fields. We also now allow '.' as a score, for features for which there is no score. Dumping in version 2 format is implemented in ACEDB. Changes in the remainder of this document are described and marked as (Version 2 changes).
Back to Table of Contents
(Version 2 change: Standard Table of Features -
we would like to enforce a standard nomenclature for
common GFF features. This does not forbid the use of other features,
rather, just that if the feature is obviously described in the standard
list, that the standard label should be used. For this standard table
we propose to fall back on the international public standards for genomic
database feature annotation, specifically, the
DDBJ/EMBL/GenBank feature table).
Standard Table of Attribute Tag Identifiers
The semantics of tags in attribute field tag-values pairs has not yet been
completely formalized, however a useful constraint is that they be
equivalent, where appropriate, to DDBJ/EMBL/GenBank feature 'qualifiers'
of given features (see
EMBL feature descriptions).
In addition to these, ACEDB typically dumps GFF with specific tag-value pairs for given feature types. These tag-value pairs may be considered 'standard' GFF tag-values with respect to ACEDB databases. (rbsk: These will be summarized in a table here in the near future)
Version 2 change: In version 2, the optional [group] field is renamed to [attribute] (09/99) and must have an tag value structure following the syntax used within objects in a .ace file, flattened onto one line by semicolon separators. Tags must be standard identifiers ([A-Za-z][A-Za-z0-9_]*). Free text values must be quoted with double quotes. Note: all non-printing characters in such free text value strings (e.g. newlines, tabs, control characters, etc) must be explicitly represented by their C (UNIX) style backslash-escaped representation (e.g. newlines as '\n', tabs as '\t'). As in ACEDB, multiple values can follow a specific tag. The aim is to establish consistent use of particular tags, corresponding to an underlying implied ACEDB model if you want to think that way (but acedb is not required). Examples of these would be:seq1 BLASTX similarity 101 235 87.1 + 0 Target "HBA_HUMAN" 11 55 ; E_value 0.0003 dJ102G20 GD_mRNA coding_exon 7105 7201 . - 2 Sequence "dJ102G20.C1.1"
All of the above described fields should be separated by TAB characters ('\t'). Version 2 note: previous Version 2 permission to use arbitrary whitespace as field delimiters is now revoked! (99/02/26)
Back to Table of Contents
We also permit extra information to be given on the line following the attribute field without a '#' character (Version 2 change: this extra information must be delimited by the '#' comment delimiter OR by another tab field delimiter character, following any and all [attribute] field tag-value pairs).
This allows extra method-specific information to be transferred with the line. However, we discourage overuse of this feature: better to find a way to do it with more true feature lines, and perhaps groups.
(Version 2 change: we gave in and defined a structured way of passing
additional information, as described above under [attribute]. But the
sentiment of this paragraph still applies - don't overuse the
tag-value syntax. The use of tag-value pairs (with whitespace) renders problematic the parsing of
Version 1 style comments (following the attribute field, without a '#' character), so in Version 2,
such [attribute] trailing comments must either start with the "#" as noted above,
or with at least one additional tab character. Moreover, '#' characters embedded
within quoted text string values of [attribute] tag-values
should not be parsed as the beginning of a comment.
Current proposed ## lines are:
Back to Table of Contents
In the example given above the feature "splice5" indicates that there
is a candidate 5' splice site between positions 172 and 173. The
"sp5-20" feature is a prediction based on a window of 20 bp for the
same splice site. To use either of these, you must know the position
within the feature of the predicted splice site. This only needs to
be given once, possibly in comments at the head of the file, or in a
separate document.
Another example is the scoring scheme; we ourselves would like the
score to be a log-odds likelihood score in bits to a defined null
model, but that is not required, because different methods take
different approaches.
Avoiding a prespecified feature set also leaves open the possibility
for GFF to be used for new feature types, such as CpG islands,
hypersensitive sites, promoter/enhancer elements, etc.
Back to Table of Contents
Back to Table of Contents
Version 2 change: In version 2 this has been formalised using
the tag Target which expects to be followed by the name of the target,
followed (optionally) by start and end point in the target as
integers, as in
Back to Table of Contents
Let us assume that the score of an exon can be decomposed into three
parts: the score of the 5' splice site, the score of the 3' splice
site, and the sum of the scores of all the codons in between. In such
a case it can be much more efficient to use the GFF format to report
separate scores for the splice site sensors and for the individual
codons in all three (or six, including reverse strand) frames, and let
the program that interprets this file assemble the exon scores. The
exon scores can be calculated efficiently by first creating three
arrays, each of which contains in its [i]th position a value A[i] that
is the partial sum of the codon scores in a particular frame for the
entire sequence from position 1 up to position i. Then for any
positions i < j, the sum of the scores of all codons from i to j can
be obtained as A[j] - A[i]. Using these arrays, along with the
candidate splice site scores, a very large number of scores for
overlapping exons are implicitly defined in a data structure that
takes only linear space with respect to the number of positions in the
sequence, and such that the score for each exon can be retrieved in
constant time.
When the GFF format is used to transmit scores that can be summed for
efficient retrieval as in the case of the codon scores above, we ask
that the provider of the scores indicate that these scores are
summable in this manner, and provide a recipe for calculating the
scores that are to be derived from these summable scores, such as the
exon scores described above. We place no limit on the complexity of
this recipe, nor do we provide a standard protocol for such assembly,
other than providing examples. It behooves the sensor score provider
to keep the recipe simple enough that others can easily implement it.
Back to Table of Contents
There is a mailing list
to which you can send comments, enquiries, complaints etc. about GFF.
If you want to be added to the mailing list, please send
mail to Majordomo@sanger.ac.uk with the
following command in the body of your email message:
Back to Table of Contents
## comment lines for meta information
There is a set of standardised (i.e. parsable) ## line types that can
be used optionally at the top of a gff file. The philosophy is a
little like the special set of %% lines at the top of postscript
files, used for example to give the BoundingBox for EPS files.
Please feel free to propose new ## lines.
The ## line proposal came out of some discussions including Anders
Krogh, David Haussler, people at the Newton Institute on 1997-10-29
and some email from Suzanna Lewis. Of course, naive programs can
ignore all of these...
##gff-version 1
##source-version {source} {version text}
##date {date}
##Type <type> [<name>]
The type of host sequence described by the features. Standard types
are 'DNA', 'Protein' and 'RNA'. The optional <name> allows multiple
##Type definitions describing multiple GFF sets in one file, each
which have a distinct type. If the name is not provided,
then all the features in the file are of the given type. Thus, with this
meta-comment, a single file could contain DNA, RNA and Protein features,
for example, representing a single genomic locus or 'gene', alongside type-specific
features of its transcribed mRNA and translated protein sequences.
If no ##Type meta-comment is provided for a given GFF file, then the type
is assumed to be DNA.
##DNA {seqname}
##acggctcggattggcgctggatgatagatcagacgac
##...
##end-DNA
##sequence-region {seqname} {start} {end}
File Naming
We propose that the format is called "GFF", with conventional file
name ending ".gff".
Semantics
We have intentionally avoided overspecifying the semantics of the
format. For example, we have not restricted the items expressible in
GFF to a specified set of feature types (splice sites, exons etc.)
with defined semantics. Therefore, in order for the information in a
gff file to be useful to somebody else, the person producing the
features must describe the meaning of the features.
Ways to use GFF
Here are a few suggestions on how the GFF format might be used.
Complex Examples
Similarities to Other Sequences
A major source of information about a sequence comes from similarities
to other sequences. For example, BLAST hits to protein sequences help
identify potential coding regions. We can represent these as a set of
"homology gene features", grouping hits to the same target as follows:
seq1 BLASTX similarity 101 136 87.1 + 0 HBA_HUMAN
seq1 BLASTX similarity 107 133 72.4 + 0 HBB_HUMAN
seq1 BLASTX similarity 290 343 67.1 + 0 HBA_HUMAN
If further information is needed about where in the target protein
each match occurs, it can be given after the protein name, e.g.
as the start coordinate in the target.
seq1 BLASTX similarity 101 235 87.1 + 0 Target "HBA_HUMAN" 11 55 ; E_value 0.0003
We need to finalise on a tag model for gapped alignments...
Cumulative Score Arrays
One issue that comes up with a record-based format such as the GFF
format is how to cope with large numbers of overlapping segments. For
example, in a long sequence, if one tries to include a separate record
giving the score of every candidate exon, where a candidate exon is
defined as a segment of the sequence that begins and ends at candidate
splice sites and consists of an open reading frame in between, then
one can have an infeasibly large number of records. The problem is
that there can be a huge number of highly overlapping exon
candidates.
Mailing list
subscribe gff-list
Edit History
0302200 - rbsk: small clarification to #comment rules
991711 rbsk: (overdue changes as per September '99 gff-list commentaries)
Back to Table of Contents
GFF Protocol Specification initially proposed by: Richard Durbin and David Haussler
with amendments proposed by: Lincoln Stein, Suzanna Lewis, Anders Krogh and others.
The GFF specification now maintained at the Sanger Centre by Richard Bruskiewich
Back to Table of Contents