Introduction to

title

with Application to Bioinformatics

- Day 3

Review Day 2

  • Give an example of a tuple
  • What is the difference between a tuple and a list?
  • How would you approach a complicated coding task?
  • What is the different syntax between a function and a method?
  • Calculate the average of the list [1,2,3.5,5,6.2] to one decimal
  • Take the list ['i','know','python'] as input and output the string 'I KNOW PYTHON'

Tuples

Give an example of a tuple:

In [3]:
myTuple = (1,2,3,'a','b',[4,5,6])
myTuple
Out[3]:
(1, 2, 3, 'a', 'b', [4, 5, 6])

What is the difference between a tuple and a list?
A tuple is immutable while a list is mutable

How to structure code

  • Decide on what output you want
  • What input files do you have?
  • How is the input structured, can you iterate over it?
  • Where is the information you need located?
  • Do you need to save a lot of information while iterating?
    • Lists are good for ordered data
    • Sets are good for non-duplicate single entry information
    • Dictionaries are good for a lot of structured information
  • When you have collected the data needed, decide on how to process it
  • Are you writing your results to a file?

Always start with writing pseudocode!

Functions and methods

What is the different syntax between a function and a method?
functionName()     <object>.methodName()

Calculate the average of the list [1,2,3.5,5,6.2] to one decimal

In [4]:
myList = [1,2,3,5,6]
round(sum(myList)/len(myList),1)
Out[4]:
3.4

Take the list ['i','know','python'] as input and output the string 'I KNOW PYTHON'

In [5]:
' '.join(['i','know','python']).upper()
Out[5]:
'I KNOW PYTHON'

Day 3

  • Sets
  • Dictionaries
  • Functions
  • sys.argv

IMDb

Find the number of genres

Drawing

Answer

Watch out for the upper/lower cases!

The correct answer is 22

In [1]:
fh     = open('../downloads/250.imdb', 'r', encoding = 'utf-8')
genres = []

for line in fh:
    if not line.startswith('#'):
        cols  = line.strip().split('|')
        genre = cols[5].strip()
        glist = genre.split(',')
        for entry in glist:
            if entry.lower() not in genres:   # only add genre if not already in list
                genres.append(entry.lower())   
fh.close()
print(genres)
print(len(genres))
['drama', 'war', 'adventure', 'comedy', 'family', 'animation', 'biography', 'history', 'action', 'crime', 'mystery', 'thriller', 'fantasy', 'romance', 'sci-fi', 'western', 'musical', 'music', 'historical', 'sport', 'film-noir', 'horror']
22

New data type: set

  • A set contains an unordered collection of unique and immutable objects

Syntax:
For empty set:
setName = set()

For populated sets:
setName = {1,2,3,4,5}

Common operations on sets

set.add(a)
len(set)
a in set

In [7]:
x = set()
x.add(100)
x.add(25)
x.add(3)
x.add('3.0')
#for i in x:
#    print(type(i))
type(x) 
##mySet = {2,5,1,3}
#mySet.add(5)
#mySet.add(4)
#print(mySet)
Out[7]:
set

Find the number of genres

Drawing

Modify your code to use sets

In [9]:
fh     = open('../downloads/250.imdb', 'r', encoding = 'utf-8')
genres = set()

for line in fh:
    if not line.startswith('#'):
        cols  = line.strip().split('|')
        genre = cols[5].strip()
        glist = genre.split(',')      
        for entry in glist:
            genres.add(entry.lower())    # set only adds entry if not already in
fh.close()
print(len(genres))
sorted(list(genres))
22
Out[9]:
['action',
 'adventure',
 'animation',
 'biography',
 'comedy',
 'crime',
 'drama',
 'family',
 'fantasy',
 'film-noir',
 'historical',
 'history',
 'horror',
 'music',
 'musical',
 'mystery',
 'romance',
 'sci-fi',
 'sport',
 'thriller',
 'war',
 'western']

IMDb

How to find the number of movies per genre?

Drawing

... Hm, starting to be difficult now...

New data type: dictionary

  • A dictionary is a mapping of unique keys to values
  • Dictionaries are mutable

Syntax:
a = {} (create empty dictionary)
d = {'key1':1, 'key2':2, 'key3':3}

In [10]:
myDict = {'drama': 4,
          'thriller': 2,
          'romance': 5}
myDict
Out[10]:
{'drama': 4, 'romance': 5, 'thriller': 2}

Operations on Dictionaries

Drawing

In [11]:
myDict = {'drama': 4,
          'thriller': 2,
          'romance': 5}
len(myDict)
myDict['drama']
myDict['horror'] = 2
#myDict
#del myDict['horror']
#myDict
'drama' in myDict
myDict.keys()
myDict.items()
myDict.values()
Out[11]:
dict_values([4, 2, 5, 2])

Exercise

In [ ]:
myDict = {'drama': 182, 
          'war': 30, 
          'adventure': 55, 
          'comedy': 46, 
          'family': 24, 
          'animation': 17, 
          'biography': 25}
  • How many entries are there in this dictionary?
  • How do you find out how many movies are in the genre 'comedy'?
  • You're not interested in biographies, delete this entry
  • You are however interested in fantasy, add that we have 29 movies of the genre fantasy to the list
  • What genres are listed in this dictionary?
  • You remembered another comedy movie, increase the number of comedies by one
In [ ]:
 

Find the number of movies per genre

Drawing

Hint! If the genre is not already in the dictionary, you have to add it first

Answer

Drawing

In [ ]:
fh        = open('../downloads/250.imdb', 'r', encoding = 'utf-8')
genreDict = {}     # create empty dictionary

for line in fh:
    if not line.startswith('#'):
        cols  = line.strip().split('|')
        genre = cols[5].strip()
        glist = genre.split(',')
        for entry in glist:
            if not entry.lower() in genreDict: # check if genre is not in dictionary, add 1
                genreDict[entry.lower()] = 1
            else:
                genreDict[entry.lower()] += 1   # if genre is in dictionary, increase count with 1
fh.close()
print(genreDict)

What is the average length of the movies (hours and minutes) in each genre?

Drawing

Answer

Drawing

Tip!
Here you have to loop twice

In [ ]:
fh        = open('../downloads/250.imdb', 'r', encoding = 'utf-8')
genreDict = {}

for line in fh:
    if not line.startswith('#'):
        cols    = line.strip().split('|')
        genre   = cols[5].strip()
        glist   = genre.split(',')
        runtime = cols[3]      # length of movie in seconds
        for entry in glist:
            if not entry.lower() in genreDict:
                genreDict[entry.lower()] = [int(runtime)]   # add a list with the runtime
            else:
                genreDict[entry.lower()].append(int(runtime))   # append runtime to existing list
fh.close()
                
for genre in genreDict:      # loop over the genres in the dictionaries
    average = sum(genreDict[genre])/len(genreDict[genre])  # calculate average length per genre
    hours   = int(average/3600)                                 # format seconds to hours
    minutes = (average - (3600*hours))/60             # format seconds to minutes
    print('The average length for movies in genre '+genre\
          +' is '+str(hours)+'h'+str(round(minutes))+'min')

NEW TOPIC: Functions

Drawing

A lot of ugly formatting for calculating hours and minutes from seconds...

In [12]:
def FormatSec(genre):   # input a list of seconds
    average   = sum(genreDict[genre])/len(genreDict[genre])
    hours     = int(average/3600)
    minutes   = (average - (3600*hours))/60   
    return str(hours)+'h'+str(round(minutes))+'min'


fh        = open('../downloads/250.imdb', 'r', encoding = 'utf-8')
genreDict = {}

for line in fh:
    if not line.startswith('#'):
        cols    = line.strip().split('|')
        genre   = cols[5].strip()
        glist   = genre.split(',')
        runtime = cols[3]      # length of movie in seconds
        for entry in glist:
            if not entry.lower() in genreDict:
                genreDict[entry.lower()] = [int(runtime)]   # add a list with the runtime
            else:
                genreDict[entry.lower()].append(int(runtime))   # append runtime to existing list
fh.close()
                
for genre in genreDict:
    print('The average length for movies in genre '+genre\
          +' is '+FormatSec(genre))
The average length for movies in genre drama is 2h14min
The average length for movies in genre war is 2h30min
The average length for movies in genre adventure is 2h13min
The average length for movies in genre comedy is 1h53min
The average length for movies in genre family is 1h44min
The average length for movies in genre animation is 1h40min
The average length for movies in genre biography is 2h30min
The average length for movies in genre history is 2h47min
The average length for movies in genre action is 2h18min
The average length for movies in genre crime is 2h11min
The average length for movies in genre mystery is 2h3min
The average length for movies in genre thriller is 2h11min
The average length for movies in genre fantasy is 2h2min
The average length for movies in genre romance is 2h2min
The average length for movies in genre sci-fi is 2h6min
The average length for movies in genre western is 2h11min
The average length for movies in genre musical is 1h57min
The average length for movies in genre music is 2h24min
The average length for movies in genre historical is 2h38min
The average length for movies in genre sport is 2h17min
The average length for movies in genre film-noir is 1h43min
The average length for movies in genre horror is 1h59min

Function structure

Drawing

Function structure

Drawing

In [13]:
def addFive(number):    
    final = number + 5
    return final

addFive(4)
Out[13]:
9
In [14]:
from datetime import datetime

def whatTimeIsIt():
    time = 'The time is: ' + str(datetime.now().time())
    return time

whatTimeIsIt()
Out[14]:
'The time is: 19:16:35.696575'
In [15]:
def addFive(number):
    final = number + 5
    return final

addFive(4)
#final

final = addFive(4)
final
Out[15]:
9

Scope

  • Variables within functions
  • Global variables
In [16]:
def someFunction():
#    s = 'a string'
    print(s)
    
s = 'another string'
someFunction()
print(s)
another string
another string

Why use functions?

  • Cleaner code
  • Better defined tasks in code
  • Re-usability
  • Better structure

Importing functions

  • Collect all your functions in another file
  • Keeps main code cleaner
  • Easy to use across different code

Example:

  1. Create a file called myFunctions.py, located in the same folder as your script
  2. Put a function called formatSec() in the file
  3. Start writing your code in a separate file and import the function
In [17]:
from myFunctions import formatSec

seconds = 32154

formatSec(seconds)
Out[17]:
'8h56min'
In [18]:
from myFunctions import  formatSec, toSec

seconds = 21154
print(formatSec(seconds))

days    = 0
hours   = 21
minutes = 56
seconds = 45

print(toSec(days, hours, minutes, seconds))
5h53min
79005s

myFunctions.py

Drawing

Summary

  • A function is a block of organized, reusable code that is used to perform a single, related action
  • Variables within a function are local variables
  • Functions can be organized in separate files and imported to the main code

→ Notebook Day_3_Exercise_1 (~30 minutes)

NEW TOPIC AGAIN: sys.argv

  • Avoid hardcoding the filename in the code
  • Easier to re-use code for different input files
  • Uses command-line arguments
  • Input is list of strings:
    • Position 0: the program name
    • Position 1: the first argument

The `sys.argv` function

Python script called print_argv.py:

Drawing

Running the script with command line arguments as input:

Drawing

Instead of:

Drawing

do:

Drawing

Run with:

Drawing

IMDb

Re-structure and write the output to a new file as below

Drawing

Note:

  • Use a text editor, not notebooks for this
  • Use functions as much as possible
  • Use sys.argv for input/output

Answer - Example

Drawing

Run with: Drawing