Background: For many diseases with known causative mutations, screening methods have been developed to detect whether people have a high risk of becoming sick, even before the onset of the actual disease.
Over the last few years, the cost of full genome sequencing has gone down so that, in some cases, it might be cheaper to collect the complete genome sequence of patients with a high risk of carrying variants associated with the disease, rather than using targeted screening procedures.
Cystic fibrosis is a complex disease, where patients often manifest the following symptoms: problems with lung functions, diabetes and infertility. From a genetic point of view, there are several mutations associated with this disease. In particular, the CFTR gene (short for Cystic Fibrosis Transmembrane Conductance Regulator) encodes an ion channel protein acting in epithelial cells, and carries several non-synonymous genetic variants, with alterations leading to premature stop codons, that are known to cause the disease.
Goal: In this assignment, you have access to the human reference genome as well as the genome annotation. In addition, you have full genome sequence data from five individuals from a family at risk of carrying mutations related to the disease.
Your task is to write a Python program that will extract the correct transcript from the CFTR gene, translate the gene sequence to its corresponding amino-acid sequence and based on the reference amino-acid sequence determine whether any of the five given individuals is affected.
Fetch the appropriate files
The main task is divided in several steps. The first step is to fetch
the reference sequence file (in fasta
format) and the appropriate reference annotation file (in GTF
format) from the Ensembl database.
The CFTR gene is located on chromosome 7
. After downloading the files, read up on how the files are structured.
Human reference DNA for chromosome 7 (fasta):
- Homo_sapiens.GRCh38.dna_sm.chromosome.7.fa.gz
Human GTF annotation file:
- Homo_sapiens.GRCh38.93.gtf.gz
Many of the tasks involves outputting long sequences. To make sure they are correct, use the utils.check_answers
package:
from utils import check_answers
Warmup
-
What is the length of chr7 on the reference sequence??
Tip
Open the reference fasta file and read it line by line.
Ignore the first line and, in a loop, get the length of each line (from which you remove the trailing newline character).
Sum up all the lengths you found.
Answer
Chromosome 7 has 159.345.973 base pairs.
-
How many genes are annotated in the GTF file?
Tip
You need to understand the structure of a GTF-formatted file.
The GTF format uses several tab-delimited fields, for which we give you a short a description.
Alternatively, you can search online.
Then, only count entries of type gene
Answer
There are 58.395 genes annotated in the GTF file
Architect a method
All the following tasks are now related to the CTFR gene.
In the annotation file (the GTF file), that gene has the
id ENSG00000001626
on chromosome 7
.
-
How many transcripts can this gene generate?
Tip
Again, think about the structure of the GTF file
Answer
This gene can produce 11 different transcripts -
What is the longest transcript in nucleotides?
Tip
Use start and stop positions for each transcript of the gene
Answer
The transcript with id ENST00000003084 is the longest among 11 other transcripts, and spans 188.703 bases
-
Fetch the DNA sequence for that transcript
Tip
Similarly to step 1 from the Warmup, open the reference file.
Ignore the first line and, in a loop, append each line to a list.
Remember to strip the trailing newline character.
Outside the loop, use the
join
function to concatenate the lines from the list.Avoid concatenation inside the loop, as it is slow and wasting memory
Use the start and stop positions extracted from the transcript, but think about where the index starts from
Answer
Write your results to file and compare with
check_answers.ex3(resultsFile)
The entire sequence can be found here
-
Fetch all the exons for that transcript, spliced together to one sequence
Tip
First you need to save the start and stop positions of all exons of that transcript.
Answer
Write your results to file and compare with
check_answers.ex4(resultsFile)
The correct sequence can be found here
-
What are the position and sequence of the
start_codon
andstop_codon
from that transcript?Note
Check that the
start_codon
isATG
, and that thestop_codon
corresponds to a proper stop codonMake your program print a warning message in case the transcript does not begin with a start-codon and end with a stop-codon
Answer
Position of start codon is 117.480.095
Position of stop codon is 117.667.106
-
Translate the above sequence into amino-acids, using an implementation of the translation table from
utils.rna
package.Tip
Use start codon position to start translating from
Answer
Write your results to file and compare with
check_answers.ex6(resultsFile)
The correct sequence can be found here
Find the patients at risk
We are reaching the goal for this assignment!
A mutation in the transcript ENST00000003084 causes a premature stop codon to be introduced into the aminoacid sequence. This creates a truncated protein, causing cystic fibrosis. Using the python program you have designed above, compare the reference sequence to the sequences of the following 5 patients (patient-1, patient-2, patient-3, patient-4, and patient-5) to determine which one is carrying a mutation in the CFTR gene, causing a truncated protein.
There might be several.
Extra task
-
Use BioPython for (some of) the above tasks
Procedure
Start by parsing a fasta file with BioPython.
and the translation step using the built-in translation tables.
Solution
Here are some possible solutions to the assignment. There are of course many correct solutions, we only present one of the alternatives.