For this exercise you need to be logged in to Uppmax.
Setup the folder structure:
source ~/git/GAAS/profiles/activate_rackham_env
export data=/proj/g2019006/nobackup/$USER/data
export work_with_gff=/proj/g2019006/nobackup/$USER/work_with_gff
mkdir -p $work_with_gff
GFF and GTF format are close and could be difficult to differentiate. To a complete overview of the format you can have a look in the cheat sheet section.
cd $work_with_gff
cp $data/annotation/augustus.xxx .
less augustus.xxx
tag value
Now edit the file to fix the 9th column:
chmod +w augustus.xxx
nano augustus.xxx
4 AUGUSTUS gene 386 13142 0.01 + . gene_id g1;
4 AUGUSTUS transcript 386 13142 0.01 + . transcript_id g1.t1;
Now your file has at least a correct structure!
Let’s convert it to GFF3 format:
gxf_to_gff3.pl --gff augustus.xxx -o augustus.gff3
The script gxf_to_gff3.pl can be your friend when dealing with GFF/GTF format files. It can deal with any kind of GFF/GTF format (even mixed formats) and errors. It allows to create a standardized GFF3 format file.
The GFF fomat has been developed to be easy to parse and process by a variety of programs in different languages (e.g Unix tools as grep and sort, perl, awk, etc). For these reasons, they decided that each feature is described on a single line.
Download human gff annotation v96 from Ensembl:
wget ftp://ftp.ensembl.org/pub/release-96/gff3/homo_sapiens/Homo_sapiens.GRCh38.96.chr.gff3.gz
What is the size of this file?
Now uncompress it:
gunzip Homo_sapiens.GRCh38.96.chr.gff3.gz
What is the size of the uncompressed file?
The gff/gtf format has a good compression ratio.
Let’s now compute some statistics on this file.
wc -l Homo_sapiens.GRCh38.96.chr.gff3
awk '{if($3=="gene") print $0}' Homo_sapiens.GRCh38.96.chr.gff3 | wc -l
awk '{if($3=="mRNA") print $0}' Homo_sapiens.GRCh38.96.chr.gff3 | wc -l
awk '{if($3=="gene" && $1=="1") print $0}' Homo_sapiens.GRCh38.96.chr.gff3 | wc -l
awk '{if($0 !~ /^#/)print $3}' Homo_sapiens.GRCh38.96.chr.gff3 | sort -u