Background: For many diseases with known causative mutations, screening methods have been developed to detect whether people have a high risk of becoming sick, even before the onset of the actual disease.

Over the last few years, the cost of full genome sequencing has gone down so that, in some cases, it might be cheaper to collect the complete genome sequence of patients with a high risk of carrying variants associated with the disease, rather than using targeted screening procedures.

Cystic fibrosis is a complex disease, where patients often manifest the following symptoms: problems with lung functions, diabetes and infertility. From a genetic point of view, there are several mutations associated with this disease. In particular, the CFTR gene (short for Cystic Fibrosis Transmembrane Conductance Regulator) encodes an ion channel protein acting in epithelial cells, and carries several non-synonymous genetic variants, with alterations leading to premature stop codons, that are known to cause the disease.

Goal: In this assignment, you have access to the human reference genome as well as the genome annotation. In addition, you have full genome sequence data from five individuals from a family at risk of carrying mutations related to the disease.

Your task is to write a Python program that will extract the CFTR gene, translate the gene sequence to its corresponding amino-acid sequence and based on the inferred amino-acid sequence determine whether any of the five given individuals is affected.

Fetch the appropriate files

The main task is divided in several steps. The first step is to fetch the sequence file (in fasta format) and the appropriate annotation file (in GTF format) from the Ensembl database.

The CTFR gene is chromosome 7.

Warmup

What is the length of the chosen DNA sequence?

Tip

Open the DNA file with the with statement and read it line by line.

Ignore the first line and, in a loop, get the length of each line (from which you remove the trailing newline character).

Sum up all the lengths you found.
How many genes are annotated in the GTF file?

Note

You need to understand the structure of a GTF-formatted file.

The GTF format uses several tab-delimited fields, for which we give you a short a description.

Alternatively, you can search online.
What fraction of the chromosome is annotated as genes?

Architect a method

All the following tasks are now related to the CTFR gene.

In the annotation file (from the Ensembl database), that gene has the id ENSG00000001626 on chromosome 7.

How many transcripts can this gene generate?

Answer
11
What is the longest transcript in nucleotides?

Answer

The transcript with id ENST00000003084 has 6132 bp and is the longest among 11 other transcripts

Check its Ensembl data

Notice that the last column in the GTF on the line defining that transcript should contain protein_coding.
Fetch the DNA sequence for that gene

Tip

Similarly to step 1 from the Warmup, open the file with the `with` statement.

Ignore the first line and, in a loop, append each line to a list.

Remember to strip the trailing newline character.

Outside the loop, use the join function to concatenate the lines from the list.

Avoid concatenation inside the loop, as it is slow and wasting memory
Fetch all the exons for that transcript (splicing)

Answer

Your answer can be output to a file and compare to that given result (also available online)
What are the position of the start_codon and stop_codon from that transcript?

Tip

Check that the start_codon is ATG, and that the stop_codon corresponds to a proper stop codon

Make your program throws a warning in case the transcript you are currently translating does not begin with a start-codon and end with a stop-codon
Translate into amino-acids, using an implementation of the translation table from utils.rna package.
Tip
You can output your results in different files and check the difference with the given result or online here or here.
```
diff filename-1 filename-2
```
will output nothing when the files are identical.
Moreover, have a look at the range function, which can take an extra step parameter.
Use BioPython for (some of) the above tasks

Procedure

Start by parsing a fasta file with BioPython.

Have a look at the transcription step,

and the translation step using the built-in translation tables.

Find the patients at risk

We are reaching the goal for this assignment!

Using the python program you have designed above, find which one of the following 5 patients (patient-1, patient-2, patient-3, patient-4, and patient-5) carries a mutation on the CFTR gene, that can cause cystic fibrosis.

There might be several.

Extra tasks

What if the sequence was on the reverse strand?
You need implement that as well!
So …no! Use the BioPython module, it does that job!

About your main assignment

Introduction to PythonHT17

Fetch the appropriate files

Warmup

Architect a method

Find the patients at risk

Extra tasks