Dictionaries and Error Handling

Draft Version 590 (Thu Dec 1 09:18:34 2005)

Problem: you want to count how often different people send messages to a mailing list
- Input: a file with one email address per line
  - Each address may appear many times
- Output: a table of addresses, sorted by frequency of appearance

Solution: keep track of counts in a list of pairs

Each entry in the list is [address, countSoFar]
Each time an address is read in, search the list, and either:
- Increment the count of an existing entry, or
- Create a new entry with a count of 1
Once everything has been read, sort the entries
- Note: you can pass your own comparison function to list.sort

import sys

def countFind(counts, address):
    for x in counts:
        if x[0] == address:
            return x
    return None

def countFreq(lines):
    counts = []
    for line in lines:
        line = line.strip()
        existing = countFind(counts, line)
        if existing:
            existing[1] += 1
        else:
            counts.append([line, 1])
    return counts

def compareByFrequency(a, b):
    if a[1] < b[1]:
        return 1
    if a[1] == b[1]:
        return 0
    return -1

if __name__ == '__main__':
    instream = open(sys.argv[1], 'r')
    lines = instream.readlines()
    instream.close()

    result = countFreq(lines)
    result.sort(compareByFrequency)

    for address, frequency in result:
        print frequency, ":", address

This is clumsy and inefficient
- Clumsy: writing a lot of code to do something that ought to be simple
- Inefficient: repeatedly searching the entire list
There's an easier way…
Comment on this slide

String Formatting

But before we start, let's tidy up a few loose ends
Python's % operator is used to format strings
- Left side is a string specifying how things are to be formatted
- Right side is the values to be formatted
  - One value on its own…
  - …or a tuple if there are many values
'here %s go' % 'we' creates a new string "here we go"
- The "%s" in the left string means “insert a string here”
- So the string on the right is inserted
- Since strings are immutable, this creates a new string (does not modify the format string in place)
'left %d right %d' % (-1, 1) creates "left -1 right 1"
- "%d" stands for “decimal integer”
- Format specifiers are matched with values left to right
- 'do %s %d times' % ('this', 99) creates "do this 99 times"
'[%4d]' % 13 creates "[ 13]"
- The integer between the "%" and "d" specifies a width
- '[%-4d]' % 13 creates "[13 ]", because negative widths mean “left justify”
'[%6.2f %%]' % 37.2 creates "[ 37.20 %]"
- "%6.2f" means “a floating point number, six characters wide, with two after the decimal point”
- "%%" is translated into a single "%" character
  - Just as "\\" is how you represent a single "\" in a string
'[%6.2e %%]' % 37.2 creates "[3.72e+001 %]"
- "%e" is scientific (exponential) notation
- Python will use more characters than it's been told to if it has to
Comment on this slide

Dictionaries

A sequence (such as a string or list) associates one value with each integer from 0 to N-1
A dictionary associates one value with each of its keys
- Also called maps, hashes, and associative arrays
- Keys don't have to be integers
  - In fact, they are usually strings
Often visualized as a two-column table
- Each key in the left column must be unique
- Values can occur several times
- Example planet names as keys, and the lengths of their years as values
  - Mercury 87.97
    Venus 224.70
    ⋮ ⋮
    Pluto 90550.0
Comment on this slide

The Mechanics

Create a dictionary by putting key/value pairs inside {}

{'Newton':1642, 'Darwin':1809}
Empty dictionary written {}
Get a value from an existing dictionary using [] (just like everything else in Python)

birthday = {
    'Newton' : 1642,
    'Darwin' : 1809
}
print "Darwin's birthday:", birthday['Darwin']
print "Newton's birthday:", birthday['Newton']

Darwin's birthday: 1809
Newton's birthday: 1642

Assigning to a dictionary key:

Creates a new entry if the key is not already in dictionary
Overwrites the previous value if the key is already present

birthday = {}
birthday['Darwin'] = 1809
birthday['Newton'] = 1942  # oops
birthday['Newton'] = 1642
print birthday

{'Darwin': 1809, 'Newton': 1642}

Can only access keys that are present

Just as you can't index elements of a list that aren't there

birthday = {
    'Newton' : 1642,
    'Darwin' : 1809
}
print birthday['Turing']

Traceback (most recent call last):
  File "key_error.py", line 5, in ?
    print birthday['Turing']
KeyError: 'Turing'

Remove an entry using del d[k]

Can only remove entries that are actually present

birthday = {
    'Newton' : 1642,
    'Darwin' : 1809,
    'Turing' : 1912
}

print 'Before deleting Turing:', birthday
del birthday['Turing']
print 'After deleting Turing:', birthday

Before deleting Turing: {'Turing': 1912, 'Newton': 1642, 'Darwin': 1809}
After deleting Turing: {'Newton': 1642, 'Darwin': 1809}

for k in d loops over the dictionary's keys (rather than its values)
- Note: this is different from lists, where for loops over the values, rather than the “keys” (indices)
- ```
birthday = {
    'Newton' : 1642,
    'Darwin' : 1809,
    'Turing' : 1912
}

for name in birthday:
  print name, birthday[name]
```
```
Turing 1912
Newton 1642
Darwin 1809
```

Test whether a key k is in a dictionary d using k in d

Once again, inconsistent with behavior of lists, but useful

birthday = {
    'Newton' : 1642,
    'Darwin' : 1809
}

for name in ['Newton', 'Turing']:
    if name in birthday:
        print name, birthday[name]
    else:
        print 'Who is', name, '?'

Newton 1642
Who is Turing ?

Comment on this slide

Dictionary Methods

Yes, dictionaries are objects too…

Method	Purpose	Example	Result
`clear`	Empty the dictionary.	`d.clear()`	Returns `None`, but `d` is now empty.
`get`	Return the value associated with a key, or a default value if the key is not present.	`d.get('x', 99)`	Returns `d['x']` if `"x"` is in `d`, or 99 if it is not.
`keys`	Return the dictionary's keys as a list. Entries are guaranteed to be unique.	`birthday.keys()`	`['Turing', 'Newton', 'Darwin']`
`items`	Return a list of (key, value) pairs.	`birthday.items()`	`[('Turing', 1912), ('Newton', 1642), ('Darwin', 1809)]`
`values`	Return the dictionary's values as a list. Entries may or may not be unique.	`birthday.values()`	`[1912, 1642, 1809]`
`update`	Copy keys and values from one dictionary into another.	See the example below.

Table 10.1: Dictionary Methods

Example:

birthday = {
    'Newton' : 1642,
    'Darwin' : 1809,
    'Turing' : 1912
}

print 'keys:', birthday.keys()
print 'values:', birthday.values()
print 'items:', birthday.items()
print 'get:', birthday.get('Curie', 1867)

temp = {
    'Curie'    : 1867,
    'Hopper'   : 1906,
    'Franklin' : 1920
}
birthday.update(temp)
print 'after update:', birthday

birthday.clear()
print 'after clear:', birthday

keys: ['Turing', 'Newton', 'Darwin']
values: [1912, 1642, 1809]
items: [('Turing', 1912), ('Newton', 1642), ('Darwin', 1809)]
get: 1867
after update: {'Curie': 1867, 'Darwin': 1809, 'Franklin': 1920, 'Turing': 1912, 'Newton': 1642, 'Hopper': 1906}
after clear: {}

Comment on this slide

Counting Frequency

Revisit the problem of counting how often things appear in a file

Use the things (email addresses) as keys in a dictionary
The value associated with each key is the number of times it has been seen so far

# Data to count.
names = ['Be','Mg','Mg','Ca','Be','Mg', 'Be']

# Build a dictionary of frequencies.
freq = {}
for name in names:

    # Already seen, so increment count by one.
    if name in freq:
        freq[name] = freq[name] + 1

    # Never seen before, so add to dictionary.
    else:
        freq[name] = 1

# Display.
print freq

{'Be': 3, 'Mg': 3, 'Ca': 1}

Figure 10.1: Counting Frequency

Note: dictionaries do not store keys in order
- No way to predict where a dictionary will insert a new value
- Paradoxically, this is part of what makes dictionaries so fast

So, to print frequency in alphabetic order by address:

Get the list of keys
Sort that list
Loop over it

# Build the frequency dictionary as before.
names = ['Be','Mg','Mg','Ca','Be','Mg', 'Be']
freq = {}
for name in names:
    if name in freq:
        freq[name] = freq[name] + 1
    else:
        freq[name] = 1

# Print in alphabetical order by key.
keys = freq.keys()
keys.sort()
for k in keys:
    print k, freq[k]

Be 3
Ca 1
Mg 3

Can simplify this code using dict.get

Get either the count associated with the key, or 0, then add one to it

# Use 'get' to simplify the counting loop.
names = ['Be','Mg','Mg','Ca','Be','Mg', 'Be']
freq = {}
for name in names:
    freq[name] = freq.get(name, 0) + 1

# Print.
keys = freq.keys()
keys.sort()
for k in keys:
    print k, freq[k]

Be 3
Ca 1
Mg 3

But how to print in order of frequency?

Need to invert the dictionary (i.e., swap keys and values)
Problem: there might be collisions, since values aren't guaranteed to be unique
- What is the inverse of {'a':1, 'b':1, 'c':1}?
Solution: store a list of values instead of just a single value

# Count values as before.
names = ['Be','Mg','Mg','Ca','Be','Mg', 'Be']
freq = {}
for name in names:
    freq[name] = freq.get(name, 0) + 1

# Invert.
inverse = {}
for (key, value) in freq.items():
    seen = inverse.get(value, [])
    seen.append(key)
    inverse[value] = seen

# Print.
keys = inverse.keys()
keys.sort()
for k in keys:
    print k, inverse[k]

1 ['Ca']
3 ['Be', 'Mg']

Figure 10.2: Tracing Execution of Dictionary Inversion

Comment on this slide

Formatting Strings with Dictionaries

Complex string formatting can be hard to understand
- Especially if one value needs to be used several times
- E.g., producing an email form letter, and need to put the person's name in three places
Instead of a tuple, "%" can take a dictionary as its right argument
- Use "%(varname)s" or "%(varname)4d" inside format string to identify what's to be substituted
- '%(word)s, t%(word)s, and everyw%(word)s' % {'word' : 'here'} creates "here, there, and everywhere"
Comment on this slide

Catching Errors

Like other modern languages, Python uses exceptions to handle errors
- Separating normal (expected) operation from exceptional cases makes code easier to read
Structured like if/else
- Introduce a block of statements with try
- Then introduce error handling code with except
- When something goes wrong in the try block, Python raises an exception
- Python finds the matching except block, and executes whatever instructions it contains
- ```
print 'before try/except'

try:
    print '..first line of try block'
    print 1/0
    print '..last line of try block'
except:
    print '..handling error!'

print 'after try/except'
```
```
before try/except
..first line of try block
..handling error!
after try/except
```
- Figure 10.3: Flow of Control in Try/Except

You can include code to execute when there isn't an error using else

for val in (0.0, 1.0):
    print "val is", val
    try:
        x = 1.0/val
    except:
        print '..handling error!'
    else:
        print '...in the else block'

val is 0.0
..handling error!
val is 1.0
...in the else block

Comment on this slide

Exception Objects

When an error occurs, Python actually creates an object to store information about it
- Typically an error message

Can choose to handle specific errors by specifying an exception type in the except statement

E.g., handle division by zero, but not out-of-bounds list index
Syntax is except ExceptionType, variable
- ExceptionType is what you want to catch
- If something of that type is raised, the exception object is assigned to variable

values = [-1.0, 0.0, 1.0]
for i in range(4):    # note: top index will be out of bounds
    try:
        x = 1.0 / values[i]
    except ZeroDivisionError, e:
        print 'divide by zero:', e

divide by zero: float division

Traceback (most recent call last):
  File "except_with_type.py", line 4, in ?
    x = 1.0 / values[i]
IndexError: list index out of range

You can associate any number of except blocks with a single try
- They're tested in order—whichever matches first, wins
- If a “naked” except appears, it must be the last one (since it catches everything)
  - Better to use except Exception, e so that you have the exception object

Exceptions are organized in a hierarchy

E.g., ZeroDivisionError, OverflowError, and FloatingPointError are all specialized versions of ArithmeticError
And IndexError (list/string index out of bounds) and KeyError (non-existent dictionary key) are specializations of LookupError

for i in range(2):
    try:
        if i == 0:
            vals = ['a', 'b']
            x = vals[99]
        else:
            vals = {20 : 'a', 30 : 'b'}
            x = vals[40]
    except KeyError, e:
        print 'loop %d, handling key errors:' % i, e
    except LookupError, e:
        print 'loop %d, handling generic lookup errors:' % i, e

loop 0, handling generic lookup errors: list index out of range
loop 1, handling key errors: 40

We'll see in a later lecture how this hierarchy is implemented
- Hint: it has something to do with objects…

Common exception types

Name	Purpose
`Exception`	Root of exception hierarchy.
`ArithmeticError`	Illegal arithmetic operation
`IndexError`	Bad index to sequence (out of bounds or illegal type)
`KeyError`	Nonexistent index to dictionary
`TypeError`	Illegal type (e.g., trying to add integer and string)
`ValueError`	Illegal value (e.g., `math.sqrt(-1)`)
`IOError`	Unable to create or open file, read data, etc.

Table 10.2: Common Exception Types

Comment on this slide

Functions and Exceptions

So if a function raises an exception, who gets to handle it?

Each time Python enters a try/except block, it pushes the except handlers on a stack
- Just like the function call stack
When an exception is raised, Python look in this stack, and jumps to the first matching handler

def invert(vals, index):
    try:
        vals[index] = 1/vals[index]
    except ArithmeticError, e:
        print 'inner exception handler:', e

def each(func, vals, indices):
    try:
        for i in indices:
            func(vals, i)
    except IndexError, e:
        print 'outer exception handler:', e

# Note: top index will be out of bounds
allValues = [-1, 0, 1]
allIndices = [0, 1, 2, 3]
each(invert, allValues, allIndices)

calling func for index 0
calling func for index 1
inner exception handler: integer division or modulo by zero
calling func for index 2
calling func for index 3
outer exception handler: list index out of range

Figure 10.4: Stacking Exception Handlers

Comment on this slide

Raising Exceptions

Use raise to trigger exception processing

Obviously, have to specify the type of exception you're raising
E.g., raise Exception('this is an error message')
- Please make your error messages more informative…

for i in range(4):
    try:
        # Raise exception if value is odd.
        if (i % 2) == 1:
            raise ValueError('index is odd')
        else:
            print 'not raising exception for %d' % i
    except ValueError, e:
        print 'caught exception for %d' % i, e

not raising exception for 0
caught exception for 1 index is odd
not raising exception for 2
caught exception for 3 index is odd

Raise exceptions to report errors in functions
- Rather than returning None or -1 or something like that
- list.find (which returns -1 if something can't be found) is unusual in this respect
  - But it's more convenient than list.index, which throws an exception

General rule: throw low, catch high

I.e., throw lots of very specific exceptions, but only catch them in a few places, high in your code, where you can take corrective action
Reason #1: every application handles errors differently
- If someone is using your library in a GUI, you don't want to be printing to stderr
Reason #2: error handling code gets in the way of understanding the algorithm, so disentangle the two

import sys

# Look for the first matching line in a file.
def findFirstMatchingLine(filename, word):
    infile = open(filename, 'r')
    lines = infile.readlines()
    infile.close()
    for line in lines:
        if line.find(word) >= 0:
            return line.rstrip()
    errMsg = '%s does not contain %s' % (filename, word)
    raise ValueError(errMsg)

if __name__ == '__main__':
    try:
        word = sys.argv[1]
        for filename in sys.argv[2:]:
            line = findFirstMatchingLine(filename, word)
            print '%s / %s => %s' % (filename, word, line)
    except ValueError, e:
        print >> sys.stderr, e
    except IOError, e:
        print >> sys.stderr, e
    except Exception, e:
        print >> sys.stderr, e

import sys

# Look for the first matching line in a file.
def findFirstMatchingLine(filename, word):
    infile = open(filename, 'r')
    lines = infile.readlines()
    infile.close()
    for line in lines:
        if line.find(word) >= 0:
            return line.rstrip()
    errMsg = '%s does not contain %s' % (filename, word)
    raise ValueError(errMsg)

if __name__ == '__main__':
    try:
        word = sys.argv[1]
        for filename in sys.argv[2:]:
            line = findFirstMatchingLine(filename, word)
            print '%s / %s => %s' % (filename, word, line)
    except ValueError, e:
        print >> sys.stderr, e
    except IOError, e:
        print >> sys.stderr, e
    except Exception, e:
        print >> sys.stderr, e

Comment on this slide

Assertions

An assertion is a claim about the state of a program that must be true if the program is correct
- E.g., “this list is not empty”, or “this value is greater than that one”
Three kinds of assertions:
- A pre-condition is something that must be true in order for something else to work
  - E.g., “argument is non-negative” is a pre-condition for math.sqrt
- A post-condition is something that is guaranteed to be true after something else happens
  - E.g., “result is either -1 or a legal index to the string” is a post-condition of string.find
- Anything else is simply called a state assertion
  - E.g., “this loop always executes at least once”

Embed assertions in programs using the assert statement

First argument is a Boolean condition
Second (optional) is an error message
If the condition isn't satisfied, Python raises an AssertionError exception

def findRange(values):
    assert (values != None) and (len(values) > 0), 'Illegal input'
    left = min(values)
    right = max(values)
    assert left < right, 'Empty range'
    return (left, right)

Good programmers practice defensive programming by putting lots of assertions in their code
- “Executable documentation”
  - Describes how the program is supposed to behave, in a way that the computer can check
  - Can't ever be out of step with the actual code, since it is code
- Fail early, fail often
  - The less time that elapses between when an error occurs, and when its effects show up, the easier it is to fix
- Bugs grow up to be assertions
  - If you made an error once, there's a good chance you'll make it again
  - Adding an assertion helps you find and fix it faster the second time
Assertions should only be used to check the program's correctness, not the correctness of its inputs
- I.e., don't use assert to check that your program managed to open a file
- It isn't the program's fault if it doesn't have read permission…
Complex assertions are more likely to contain bugs than to catch them
- If it won't fit comfortably on one line, it probably shouldn't be an assertion
Comment on this slide

Running Other Programs

One program can run another by spawning a new process
- Make and the shell both do this
- So can your programs
os.popen runs a command, and lets you either:
- Write to its standard input, or
- Read from its standard output
- Figure 10.5: Running Another Program

Example: read the output of the ls command into your program

import os

instream = os.popen('ls', 'r')
lines = instream.readlines()
instream.close()
for line in lines[:4]:
    line = line.rstrip()
    print '%4d: %s' % (len(line), line)

  13: addresses.txt
  13: assertions.py
  14: create_dict.py
  18: create_dict.py.out

Example: send the output of your program to gzip for compression:

import os

# Create compressed output and report its size.
outstream = os.popen('gzip -c > temp.txt.gz', 'w')
for i in range(1000):
    for word in 'this is a test'.split():
        print >> outstream, word
outstream.close()
print 'output is %d bytes' % os.stat('temp.txt.gz').st_size

# See how big it would have been.
instream = os.popen('gunzip -c temp.txt.gz', 'r')
total = 0
for line in instream.readlines():
    total += len(line)
instream.close()
print 'output would have been %d bytes' % total

output is 79 bytes
output would have been 15000 bytes

You can use os.popen2 to connect to the other program's input and output simultaneously

import os

childIn, childOut = os.popen2('sort')
for word in 'this is a test'.split():
    print >> childIn, word
childIn.close()
for line in childOut.readlines():
    print line.rstrip()
childOut.close()

a
is
test
this

But you must be very careful about deadlock
- If the child is trying to write to the parent while the parent is trying to write to the child, both are blocked
  - The operating system can only hold on to a limited amount of input or output data at a time, so increasing the buffer size only delays the problem
  - Figure 10.6: Deadlock
- Things are even trickier if you use os.popen3, which gives you the child process's stdin, stdout, and stderr
  - If the parent is trying to read from the child's stdout while it's trying to write an error message to stderr, both programs will be blocked
Solutions are outside the scope of this course
- Use os.popen (with temporary files if necessary) until you're comfortable enough to move on
Comment on this slide

Exercises

Exercise 10.1:

Suppose you wanted to sort entries with the same frequency alphabetically. What changes would you have to make to compareByFrequency?

Mercury	87.97
Venus	224.70
⋮	⋮
Pluto	90550.0