gff/gtf

Overview

Exercises: 30 min
Questions
  • How to recognize the format?
  • How to fix it?
  • What information could I extract from those files?
Objectives
  • Understand gff and gtf formats

Prerequisites

For this exercise you need to be logged in to Uppmax.

Setup the folder structure:

source ~/git/GAAS/profiles/activate_rackham_env
export data=/proj/g2019006/nobackup/$USER/data
export work_with_gff=/proj/g2019006/nobackup/$USER/work_with_gff
mkdir -p $work_with_gff

Recognize the format

GFF and GTF format are close and could be difficult to differentiate. To a complete overview of the format you can have a look in the cheat sheet section.

cd $work_with_gff
cp $data/annotation/augustus.xxx .
less augustus.xxx

:question:

  1. Is it a GFF of GTF file?
  2. Do you see any problem in the 3rd colum?
  3. Which version of the format it is?
  4. Do you see any problem in the 9th colum?
:key: Click to see the solution .
  1. This is a GTF format. You can see that last column where tag and value are separated by a space (would be a '=' in gf format). Another detail that could help it's the last semi-colon that does not exist within gff format.
  2. gene and transcript are features allowed only in GTF2.5 while intron feature exists only in GTF1. tss feature do not exist officialy in any version.
  3. Tricky question, it looks like GTF2.5 but it's actually a flavor specific to augustus.
  4. The gene and transcript features have wrong attributes. It is missing the tag, they only contain the value. It is suppose to look like tag value

Now edit the file to fix the 9th column:

  chmod +w augustus.xxx
  nano augustus.xxx
:key: Click to see the solution . The two first line must be like that: 4 AUGUSTUS gene 386 13142 0.01 + . gene_id g1;
4 AUGUSTUS transcript 386 13142 0.01 + . transcript_id g1.t1;

Now your file has at least a correct structure!
Let’s convert it to GFF3 format:

  gxf_to_gff3.pl --gff augustus.xxx -o augustus.gff3 

The script gxf_to_gff3.pl can be your friend when dealing with GFF/GTF format files. It can deal with any kind of GFF/GTF format (even mixed formats) and errors. It allows to create a standardized GFF3 format file.

Extract information from a GFF file

The GFF fomat has been developed to be easy to parse and process by a variety of programs in different languages (e.g Unix tools as grep and sort, perl, awk, etc). For these reasons, they decided that each feature is described on a single line.

Download human gff annotation v96 from Ensembl:

 wget ftp://ftp.ensembl.org/pub/release-96/gff3/homo_sapiens/Homo_sapiens.GRCh38.96.chr.gff3.gz

:question: What is the size of this file?

Now uncompress it:

 gunzip Homo_sapiens.GRCh38.96.chr.gff3.gz 

:question: What is the size of the uncompressed file?

The gff/gtf format has a good compression ratio.

Let’s now compute some statistics on this file.

:question:

  1. How many line are there?
  2. How many gene are there?
  3. How many mRNA are there?
  4. How many gene are there on chrmosome 1?
  5. How many types of feature (3rd column) are there?
:key: Click to see the solution .
  1. wc -l Homo_sapiens.GRCh38.96.chr.gff3
  2. awk '{if($3=="gene") print $0}' Homo_sapiens.GRCh38.96.chr.gff3 | wc -l
  3. awk '{if($3=="mRNA") print $0}' Homo_sapiens.GRCh38.96.chr.gff3 | wc -l
  4. awk '{if($3=="gene" && $1=="1") print $0}' Homo_sapiens.GRCh38.96.chr.gff3 | wc -l
  5. awk '{if($0 !~ /^#/)print $3}' Homo_sapiens.GRCh38.96.chr.gff3 | sort -u