Introduction to


with Application to Bioinformatics

- Day 4


  • Keyword arguments
  • Loops with break and continue
  • "Code structure": comments and documentation
  • Importing modules: using libraries
  • Pandas - explore your data!


  • In what ways does the type of an object matter? Explain the output of:
row = 'sofa|2000|buy|Uppsala'
fields = row.split('|')
price = fields[1]
if price == 2000:
    print('The price is a number!')
if price == '2000':
    print('The price is a string!')
The price is a string!
print(sorted([ 2000,   30,   100 ]))
print(sorted(['2000', '30', '100']))
# Hint: is `'30' > '2000'`?
[30, 100, 2000]
['100', '2000', '30']
  • How can you convert an object to a different type?
    • Convert to number: '2000' and '0.5' and '1e9'
    • Convert to boolean: 1, 0, '1', '0', '', {}
  • We have seen these container types: lists, sets, dictionaries. What is their difference and when should you use which?
  • What is a function? Write a function that counts the number of occurrences of 'C' in the argument string.

In what ways does the type of an object matter?

row = 'sofa|2000|buy|Uppsala'
fields = row.split('|')
price = int(fields[1])
if price == 2000:
    print('The price is a number!')
if price == '2000':
    print('The price is a string!')
The price is a number!
print(sorted([ 2000,   30,   100 ]))
print(sorted(['2000', '30', '100']))
# Hint: is `'30' > '2000'`?
[30, 100, 2000]
['100', '2000', '30']

In what ways does the type of an object matter?

  • Each type store a specific type of information

    • int for integers,
    • float for floating point values (decimals),
    • str for strings,
    • list for lists,
    • dict for dictionaries.
  • Each type supports different operations, functions and methods.

  • Each type supports different operations, functions and methods
30 > 2000
'30' > '2000'
30 > int('2000')
  • Each type supports different operations, functions and methods
[1, 2, 3].lower()
AttributeError                            Traceback (most recent call last)
<ipython-input-10-4e1a84c0439c> in <module>
----> 1 [1, 2, 3].lower()

AttributeError: 'list' object has no attribute 'lower'
  • Convert to number: '2000' and '0.5' and '1e9'
ValueError                                Traceback (most recent call last)
<ipython-input-12-6d0b04c882d1> in <module>
----> 1 int('0.5')

ValueError: invalid literal for int() with base 10: '0.5'
ValueError                                Traceback (most recent call last)
<ipython-input-13-cb568d180cc9> in <module>
----> 1 int('1e9')

ValueError: invalid literal for int() with base 10: '1e9'
  • Convert to boolean: 1, 0, '1', '0', '', {}
  • Python and the truth: true and false values
values = [1, 0, '', '0', '1', [], [0]]
for x in values:
    if x:
        print(repr(x), 'is true!')
        print(repr(x), 'is false!')
1 is true!
0 is false!
'' is false!
'0' is true!
'1' is true!
[] is false!
[0] is true!
  • Converting between strings and lists
['h', 'e', 'l', 'l', 'o']
str(['h', 'e', 'l', 'l', 'o'])
"['h', 'e', 'l', 'l', 'o']"
'_'.join(['h', 'e', 'l', 'l', 'o'])

Container types, when should you use which?

  • lists: when order is important
  • dictionaries: to keep track of the relation between keys and values
  • sets: to check for membership. No order, no duplicates.
genre_list = ["comedy", "drama", "drama", "sci-fi"]
['comedy', 'drama', 'drama', 'sci-fi']
genres = set(genre_list)
'drama' in genres
genre_counts = {"comedy": 1, "drama": 2, "sci-fi": 1}
{'comedy': 1, 'drama': 2, 'sci-fi': 1}
movie = {"rating": 10.0, "title": "Toy Story"}
{'rating': 10.0, 'title': 'Toy Story'}

What is a function?

  • A named piece of code that performs a specific task
  • A relation (mapping) between inputs (arguments) and output (return value)

  • Write a function that counts the number of occurences of 'C' in the argument string.

  • Function for counting the number of occurences of 'C'
def cytosine_count(nucleotides):
    count = 0
    for x in nucleotides:
        if x == 'c' or x == 'C':
            count += 1
    return count

count1 = cytosine_count('CATATTAC')
count2 = cytosine_count('tagtag')
print(count1, count2)
2 0
  • Functions that return are easier to repurpose than those that print their result
cytosine_count('catattac') + cytosine_count('tactactac')
def print_cytosine_count(nucleotides):
    count = 0
    for x in nucleotides:
        if x == 'c' or x == 'C':
            count += 1

print_cytosine_count('catattac') + print_cytosine_count('tactactac')
TypeError                                 Traceback (most recent call last)
<ipython-input-33-5bbd47c30b94> in <module>
      6     print(count)
----> 8 print_cytosine_count('catattac') + print_cytosine_count('tactactac')

TypeError: unsupported operand type(s) for +: 'NoneType' and 'NoneType'
  • Objects and references to objects
list_A = ['red', 'green']
list_B = ['red', 'green']
print(list_A, list_B)
['red', 'green'] ['red', 'green', 'blue']
list_A = ['red', 'green']
list_B = list_A            # another name to the SAME list. Aliasing
print(list_A, list_B)
['red', 'green', 'blue'] ['red', 'green', 'blue']
list_A = ['red', 'green']
list_B = list_A
list_A = []
print(list_A, list_B)
[] ['red', 'green']
  • Objects and references to objects, cont.
list_A = ['red', 'green']
lists = {'A': list_A, 'B': list_A}
{'A': ['red', 'green'], 'B': ['red', 'green']}
{'A': ['red', 'green', 'blue'], 'B': ['red', 'green', 'blue']}
list_A = ['red', 'green']
lists = {'A': list_A, 'B': list_A}
lists['B'] = lists['B'] + ['yellow']
{'A': ['red', 'green'], 'B': ['red', 'green']}
{'A': ['red', 'green'], 'B': ['red', 'green', 'yellow']}

Scope: global variables and local function variables

movies = ['Toy story', 'Home alone']
def some_thriller_movies():
    return ['Fargo', 'The Usual Suspects']

movies = some_thriller_movies()
['Fargo', 'The Usual Suspects']
def change_to_drama(movies):
    movies = ['Forrest Gump', 'Titanic']

['Fargo', 'The Usual Suspects']
def change_to_scifi(movies):
    movies += ['Terminator II', 'The Matrix']

['Terminator II', 'The Matrix']

Keyword arguments

  • A way to give a name explicitly to a function for clarity
sorted(list('file'), reverse=True)
['l', 'i', 'f', 'e']
attribute = 'gene_id "unknown gene"'
attribute.split(sep=' ', maxsplit=1)
['gene_id', '"unknown gene"']
# print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
print('x=', end='')

Keyword arguments

  • Order of keyword arguments do not matter
open(file, mode='r', encoding=None) # some arguments omitted
  • These mean the same:
open('files/recipes.txt', 'w', encoding='utf-8')

open('files/recipes.txt', mode='w', encoding='utf-8')

open('files/recipes.txt', encoding='utf-8', mode='w')

Defining functions taking keyword arguments

  • Just define them as usual:
def format_sentence(subject, value, end):
    return 'The ' + subject + ' is ' + value + end

print(format_sentence('lecture', 'ongoing', '.'))

print(format_sentence('lecture', 'ongoing', end='.'))

print(format_sentence(subject='lecture', value='ongoing', end='...'))
The lecture is ongoing.
The lecture is ongoing.
The lecture is ongoing...
print(format_sentence(subject='lecture', 'ongoing', '.'))
  File "<ipython-input-47-8916632389ec>", line 1
    print(format_sentence(subject='lecture', 'ongoing', '.'))
SyntaxError: positional argument follows keyword argument
  • Positional arguments comes first, keyword arguments after!

Defining functions with default arguments

def format_sentence(subject, value, end='.'):
    return 'The ' + subject + ' is ' + value + end

print(format_sentence('lecture', 'ongoing'))

print(format_sentence('lecture', 'ongoing', '...'))
The lecture is ongoing.
The lecture is ongoing...

Defining functions with optional arguments

  • Convention: use the object None
def format_sentence(subject, value, end='.', second_value=None):
    if second_value is None:
        return 'The ' + subject + ' is ' + value + end
        return 'The ' + subject + ' is ' + value + ' and ' + second_value + end

print(format_sentence('lecture', 'ongoing'))

print(format_sentence('lecture', 'ongoing',
                      second_value='self-referential', end='!'))
The lecture is ongoing.
The lecture is ongoing and self-referential!

Small detour: Python's value for missing values: None

  • Default value for optional arguments
  • Implicit return value of functions without a return
  • Something to initialize variable with no value yet
  • Argument to a function indicating use the default value
None == False, None == 0
(False, False)

Comparing None

  • To differentiate None to the other false values such as 0, False and '' use is None:
counts = {'drama': 2, 'romance': 0}

counts.get('romance'), counts.get('thriller')
(0, None)
counts.get('romance') is None
counts.get('thriller') is None
  • Python and the truth, take two
values = [None, 1, 0, '', '0', '1', [], [0]]
for x in values:
    if x is None:
        print(repr(x), 'is None')
    if not x:
        print(repr(x), 'is false')
    if x:
        print(repr(x), 'is true')
None is None
None is false
1 is true
0 is false
'' is false
'0' is true
'1' is true
[] is false
[0] is true

Controlling loops - break

for x in lines_in_a_big_file:
    if x.startswith('>'):  # this is the only line I want!

...waste of time!

for x in lines_in_a_big_file:
    if x.startswith('>'):  # this is the only line I want!
        break  # break the loop



Controlling loops - continue

for x in lines_in_a_big_file:
    if x.startswith('>'):  # irrelevant line
        # just skip this! don't do anything
for x in lines_in_a_big_file:
    if x.startswith('>'):  # irrelevant line
        continue  # go on to the next iteration
for x in lines_in_a_big_file:
    if not x.startswith('>'):  # not irrelevant!



Another control statement: pass - the placeholder

def a_function():
    # I have not implemented this just yet
  File "<ipython-input-56-a7f30ec71867>", line 2
    # I have not implemented this just yet
SyntaxError: unexpected EOF while parsing
def a_function():
    # I have not implemented this just yet

Exercise 1

  • Notebook Day_4_Exercise_1 (~30 minutes)

A short note on code structure

  • functions
  • modules (files)
  • documentation

Why functions?

  • Cleaner code
  • Better defined tasks in code
  • Re-usability
  • Better structure

Why modules?

  • Cleaner code
  • Better defined tasks in code
  • Re-usability
  • Better structure
  • Collect all related functions in one file
  • Import a module to use its functions
  • Only need to understand what the functions do, not how

Example: sys

import sys



import pprint

Python standard modules

Check out the module index

How to find the right module?

How to understand it?

How to find the right module?

  • look at the module index
  • search PyPI
  • ask your colleagues
  • search the web!

How to understand it?

import math

Help on built-in function acosh in module math:

acosh(x, /)
    Return the inverse hyperbolic cosine of x.

# install packages using: pip
# Dimitris' protip: install packages using conda
Help on built-in function sqrt in module math:

sqrt(x, /)
    Return the square root of x.

import math

import math as m
from pprint import pprint

Documentation and commenting your code

Remember help()?

Works because somebody else has documented their code!

def process_file(filename, chrom, pos):
    Read a vcf file, search for lines matching
    chromosome chrom and position pos.

    Print the genotypes of the matching lines.
    for line in open(filename):
        if not line.startswith('#'):
            col = line.split('\t')
            if col[0] == chrom and col[1] == pos:
Help on function process_file in module __main__:

process_file(filename, chrom, pos)
    Read a vcf file, search for lines matching
    chromosome chrom and position pos.
    Print the genotypes of the matching lines.

Help on function process_file in module __main__:

process_file(filename, chrom, pos)
    Read a vcf file, search for lines matching
    chromosome chrom and position pos.
    Print the genotypes of the matching lines.

Your code may have two types of users:

  • library users
  • maintainers (maybe yourself!)

Write documentation for both of them!

  • library users (docstrings):
    What does this function do?
  • maintainers (comments):
    # implementation details


  • At the beginning of the file

     This module provides functions for...
  • For every function

    def make_list(x):
          """Returns a random list of length x."""


  • Wherever the code is hard to understand
my_list[5] += other_list[3]  # explain why you do this!


title = 'Toy Story'
rating = 10
print('The result is: ' + title + ' with rating: ' + str(rating))
The result is: Toy Story with rating: 10
# f-strings (since python 3.6)
print(f'The result is: {title} with rating: {rating}')
The result is: Toy Story with rating: 10
# format method
print('The result is: {} with rating: {}'.format(title, rating))
The result is: Toy Story with rating: 10
# the ancient way (python 2)
print('The result is: %s with rating: %s' % (title, rating))
The result is: Toy Story with rating: 10

Exercise 2

pick_movie(year=1996, rating_min=8.5)
The Bandit
pick_movie(rating_max=8.0, genre="Mystery")
Twelve Monkeys
  • Notebook Day_4_Exercise_2


  • Library for working with tabular data
  • Data analysis:
    • filter
    • transform
    • aggregate
    • plot
  • Main hero: the DataFrame type:


Creating a small DataFrame

import pandas as pd
df = pd.DataFrame({
    'age': [1,2,3,4],
    'circumference': [2,3,5,10],
    'height': [30, 35, 40, 50]
age circumference height
0 1 2 30
1 2 3 35
2 3 5 40
3 4 10 50

Pandas can import data from many formats

  • pd.read_table: tab separated values .tsv
  • pd.read_csv: comma separated values .csv
  • pd.read_excel: Excel spreadsheets .xlsx

  • For a data frame df: df.write_table(), df.write_csv(), df.write_excel()


Orange tree data

!cat ../downloads/Orange_1.tsv
age	circumference	height
1	2	30
2	3	35
3	5	40
4	10	50
df = pd.read_table('../downloads/Orange_1.tsv') 
age circumference height
0 1 2 30
1 2 3 35
2 3 5 40
3 4 10 50
  • One implict index (0, 1, 2, 3)
  • Columns: age, circumference, height
  • Rows: one per data point, identified by their index

Selecting columns from a dataframe



Index(['age', 'circumference', 'height'], dtype='object')
df[['height', 'age']]
height age
0 30 1
1 35 2
2 40 3
3 50 4
0    30
1    35
2    40
3    50
Name: height, dtype: int64

Calculating aggregated summary statistics


df[['age', 'circumference']].describe()
age circumference
count 4.000000 4.000000
mean 2.500000 5.000000
std 1.290994 3.559026
min 1.000000 2.000000
25% 1.750000 2.750000
50% 2.500000 4.000000
75% 3.250000 6.250000
max 4.000000 10.000000
In [80]:
import math
df['radius'] = df['circumference'] / 2.0 / math.pi
age circumference height radius
0 1 2 30 0.318310
1 2 3 35 0.477465
2 3 5 40 0.795775
3 4 10 50 1.591549

Selecting rows from a dataframe by index



age circumference height radius
1 2 3 35 0.477465
2 3 5 40 0.795775

Slightly bigger data frame of orange trees

!head -n 6 ../downloads/Orange.tsv
Tree	age	circumference
1	118	30
1	484	58
1	664	87
1	1004	115
1	1231	120
df = pd.read_table('../downloads/Orange.tsv') # , index_col=0)
df.iloc[0:5]  # can also use .head()
Tree age circumference
0 1 118 30
1 1 484 58
2 1 664 87
3 1 1004 115
4 1 1231 120
array([1, 2, 3])
type(pd.DataFrame({"genre": ['Thriller', 'Drama'], "rating": [10, 9]}).rating.iloc[0])
#young = df[df.age < 200]
df[df.age < 1000]
Tree age circumference
0 1 118 30
1 1 484 58
2 1 664 87
7 2 118 33
8 2 484 69
9 2 664 111
14 3 118 30
15 3 484 51
16 3 664 75

Finding the maximum and then filter by it

df.loc[ df.age < 200 ]
Tree age circumference
0 1 118 30
1 1 484 58
2 1 664 87
3 1 1004 115
4 1 1231 120
max_c = df.circumference.max()
df[df.circumference == max_c]
Tree age circumference
12 2 1372 203
13 2 1582 203


small_df = pd.read_table('../downloads/Orange_1.tsv')
small_df.plot(x='age', y='height')
<matplotlib.axes._subplots.AxesSubplot at 0x7f3f43b912e0>


What if no plot shows up?

%pylab inline   # jupyter notebooks


import matplotlib.plot as plt

Plotting - many trees

  • Plot a bar chart
df[['circumference', 'age']].plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x7f3f4348c820>
df[['circumference', 'age']].plot(kind='bar', figsize=(12, 8), fontsize=16)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3f433e1ee0>


df.plot(kind="scatter", x="column_name", y="other_column_name")
df.plot(kind='scatter', x='age', y='circumference',
        figsize=(12, 8), fontsize=14)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3f43419a90>

Line plot

dataframe.plot(kind="line", x=..., y=...)
tree1 = df[df['Tree'] == 1]
tree1.plot(x='age', y='circumference',
           fontsize=14, figsize=(12,8))
<matplotlib.axes._subplots.AxesSubplot at 0x7f3f43295a00>

Line plot of all trees

  • Let's plot all the trees
    dataframe.plot(kind="line", x="..", y="...")
df.plot(kind='line', x='age', y='circumference',
        figsize=(12, 8), fontsize=14)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3f431f2c40>


import matplotlib.pyplot as plt
fig, ax = plt.subplots()

tree_names = df['Tree'].unique()
print('tree_names:', tree_names)

for tree_name in tree_names:
    sub_df = df[df['Tree'] == tree_name]
    sub_df.plot(x='age', y='circumference', kind='line',
                ax=ax, fontsize=14, figsize=(8,6))
tree_names: [1 2 3]

Exercise 5

  • Read the Orange_1.tsv

    • Print the height column
    • Print the data for the tree at age 2
    • Find the maximum circumference
    • What tree reached that circumference, and how old was it at that time?
  • Use Pandas to read IMDB

    • Explore it by making graphs
  • Extra exercises:

    • Read the pandas documentation :)
    • Look at seaborn for a more feature-rich plotting lib