A Mini-Project

Draft Version 398 (Thu Dec 1 09:18:46 2005)

Been talking since the first lecture about the importance of tools, and about building your own to automate repetitive tasks
This lecture takes a look at some of the tools used to manage the raw material for this course
- Show technologies like XML and regular expressions in action
- Show how to design something simple, and grow it over time
Starting point: lecture slides are written in a simple XML format
- Root of each lecture is a <lec/> element with title and id attributes
- May contain one or more <topic/> elements
  - Must have title attribute
  - May optionally have a summary attribute (used to construct the syllabus)
- Topics contain one or more <slide/> elements
  - These contain <b1/> (for “bullet level 1”), which contain <b2/>, and so on
  - May also contain tables, images, and code fragments
Our task is to validate these files to make sure that:
- They contain only printable characters
- Tabs haven't been used for indentation
- The ID in the root <lec/> element matches the filename
- All the external files the lecture references (such as images and sample code) exist
Solution: write a command-line utility that:
- Takes a list of filenames, along with
- Some command-line flags specifying which checks to run (or omit), and
- Reports any problems
You probably won't ever need to do this for lecture slides…
- …but at some point, you probably will want to check the integrity of data files, experimental results, etc.
Comment on this slide

Checking for Tabs

Start with the simplest task: checking for tabs in files

Open each file in turn
Check each line for tabs
- Note: could also read entire contents of file into a string, and search for tabs in that
If any found, print message

#!/usr/bin/env python

'''Check for tabs in one or more files.'''

import sys

def checkTabs(filename):
    '''Look for tabs.'''
    infile = open(filename, 'r')
    for line in infile.readlines():
        if line.find('\t') >= 0:
            print '%s contains tabs' % filename
            break
    infile.close()

if __name__ == '__main__':
    for filename in sys.argv[1:]:
        checkTabs(filename)

Great—except it only works on files
- Can't pipe things to it, because it doesn't know how to read standard input

Implement the standard Unix convention: if no filenames provided, read from standard input

Hm…sys.stdin is an already-open file, not a filename
Change signature of checkTabs to take both the filename, and an open file
- Move the open and close to the main body

#!/usr/bin/env python

'''Check for tabs in one or more files, or on standard input.'''

import sys

def checkTabs(filename, infile):
    '''Look for tabs.'''
    for line in infile.readlines():
        if line.find('\t') >= 0:
            print '%s contains tabs' % filename
            break

if __name__ == '__main__':
    if len(sys.argv) == 1:
        checkTabs('<stdin>', sys.stdin)
    else:
        for filename in sys.argv[1:]:
            infile = open(filename, 'r')
            checkTabs(filename, infile)
            infile.close()

Great—except it doesn't report errors like missing or unreadable files
- Printing a stack trace doesn't count

Fix by wrapping the code in an exception handler

Only catch the kinds of exceptions we think are reasonable to expect
- In this case, IOError
Don't want the error handling to mask errors that we didn't anticipate

#!/usr/bin/env python

'''Check for tabs in one or more files, or on standard input, and
report errors.'''

import sys

def checkTabs(filename, infile):
    '''Look for tabs.'''
    for line in infile.readlines():
        if line.find('\t') >= 0:
            print '%s contains tabs' % filename
            break

if __name__ == '__main__':
    try:
        if len(sys.argv) == 1:
            checkTabs('<stdin>', sys.stdin)
        else:
            for filename in sys.argv[1:]:
                infile = open(filename, 'r')
                checkTabs(filename, infile)
                infile.close()
    except IOError, e:
        print >> sys.stderr, e

Note: could equally well put the exception handler:
- Inside the else (since we don't think I/O errors can happen while reading standard input)
- Inside the for (so that if an error occurs while reading one file, the program continues on to the next)

Comment on this slide

Running Tools

Now, how to run the validation tool?
- python check_tabs.py file1 fil2 file… will work…
- …but typing in a bunch of filenames every time would be annoying
  - Which means that we wouldn't do it as often as we should
- And we know we're going to have other validation tools to run as well
Put the command in a Makefile
- If it's worth doing again, it's worth automating
Directory structure of the course:
- A root directory
  - The Makefile goes here
- lec for lecture notes (in .swc files)
- util for utility programs (like the validation tools)
- img for images
  - Images for lec/xyz.swc go in the img/xyz directory
- src for source code
  - src/xyz holds sample files for the XYZ lecture
Makefile runs tab checker on all .swc files
- ```
# Re-make everything used in the Software Carpentry course.

all :
	@echo 'options: clean validate'

clean :
	@rm -f *~ */*~ */*/*~

validate :
	@python util/check_tabs.py lec/*.swc
```
- Note: also added a clean target that gets rid of editor backup files ending in ~
- And a default target called all, that lists the things the Makefile can do
- Remember, the "@" in front of the commands means, “Don't echo the command before running it”
Comment on this slide

Checking for Printable Characters

Next on the list: make sure that files only contain printable characters
- I.e., whitespace, alphanumeric, and punctuation
Some editors insert “smart” (curly) quotes, automatically convert "---" into "—", etc.
- But other editors can't display these
- So disallow them, and require authors to use XML escape sequences for anything special
Solution: make sure every character in the file is in string.printable
- Contains letters, digits, spaces, and common punctuation
- Easier to use this than to write our own regular expression
Should we add this function to the existing validation program, or create a second program?
- The former lets the checking functions share the code that opens files, handles errors, etc.
- The latter makes it easier to re-use the pieces separately
Solution: put them in the same program, and provide command-line flags to disable certain tests
- Might be philosophically purer to have the flags turn tests on…
- …but the normal case is going to be run them all
Parse command-line options using the getopt module
- First argument is the list of command-line arguments to parse
  - Not including sys.argv[0], which is the name of the program
- Second is a string telling it what flags to look for, and whether they take arguments
  - "a:bcd:" means “-a and -d have an argument, -b and -c don't”
- ```
    doPrintable = True
    doTabs = True
    settings, filenames = getopt.getopt(sys.argv[1:], 'pt')
    for (opt, arg) in settings:
        if opt == '-p':
            doPrintable = False
        elif opt == '-t':
            doTabs = False
```

Have to make another change to checkTabs

It and checkPrintable need to process the same data
So neither can read that data in from the file
- Unless we want to open and read the file twice
Solution: separate input, processing, and output
- Main body reads data and calls validation functions

    try:
        if not filenames:
            lines = sys.stdin.readlines()
            checkTabs('<stdin>', lines)
            checkPrintable('<stdin>', lines)
        else:
            for filename in filenames:
                infile = open(filename, 'r')
                lines = infile.readlines()
                infile.close()
                checkTabs(filename, lines)
                checkPrintable(filename, lines)
    except IOError, e:
        print >> sys.stderr, e

Change to checkTabs is easy

def checkTabs(filename, lines):
    '''Look for tabs.'''
    for line in lines:
        if line.find('\t') >= 0:
            print '%s contains tabs' % filename
            break

And checkPrintable is simple as well

def checkPrintable(filename, lines):
    '''Look for non-printable characters.'''
    for line in lines:
        for c in line:
            if c not in string.printable:
                print '%s contains non-printable characters' % filename
                print line
                break

Comment on this slide

Checking Glossary Entries

Course has a glossary that defines new or unusual terms

Entries in lectures are formatted as <d ref="immutable">Immutable</d>
- <d/> for “definition”
- ref attribute is the term that appears in the glossary
- Contained text is displayed in-line

Glossary is structured like this:

<glossary>
  <glosssec title="A">
    <glossitem id="absolute_path" term="absolute path">...definition...</glossitem>
    <glossitem id="abstract_data_types" term="abstract data types">...</glossitem>
    <glossitem id="access_control" term="access control">...</glossitem>
    ...
    <glossitem id="automatic_variables" term="automatic variables (in Make)">...</glossitem>
  </glosssec>
  ...
</glossary>

Goal: make sure every term is defined once, and only once
- Read in the glossary
- Record the item IDs
- Read a set of files, marking items as they're seen
  - If an item has already been marked, report the duplication
- At the end, look for items that haven't been marked off

Hm…what happens if we're only checking one file?

Want to be able to suppress the check for all items being marked off
So need two flags:
- One to specify the name of the glossary file (so we don't try to read it as a lecture file)
- One to turn off the check for all items being marked off

    glossary = None
    doGlossaryComplete = True
    doPrintable = True
    doTabs = True
    settings, filenames = getopt.getopt(sys.argv[1:], 'G:gpt')
    for (opt, arg) in settings:
        if opt == '-G':
            glossary = arg
        elif opt == '-g':
            doGlossaryComplete = False
        elif opt == '-p':
            doPrintable = False
        elif opt == '-t':
            doTabs = False

Note: if no glossary file specified, don't check glossary items at all

Now have enough logic that it's worth reorganizing the main processing loop

If no filenames specified, set the list of filenames to ['<stdin>']
Write a function readFile to open and read a file
- If the function name is "<stdin>", it reads from sys.stdin
- Have function function return both a list of lines, and the XML DOM tree
- We need the first to check for tabs and printable characters, and the second to look for glossary items

Modified processing code includes checks for what to do with the glossary

    try:
        if glossary:
            glossary = readGlossaryFile(glossary)
        if not filenames:
            filenames = ['<stdin>']
        for filename in filenames:
            lines, doc = readFile(filename)
            checkTabs(filename, lines)
            checkPrintable(filename, lines)
            if glossary:
                checkGlossary(filename, doc, glossary)
        if glossary and doGlossaryComplete:
            checkGlossaryComplete(glossary)
    except IOError, e:
        print >> sys.stderr, e
    except xml.parsers.expat.ExpatError, e:
        print >> sys.stderr, e

Then write readFile

def readFile(filename):
    if filename == '<stdin>':
        data = sys.stdin.read()
    else:
        infile = open(filename, 'r')
        data = infile.read()
        infile.close()
    infile = cStringIO.StringIO(data)
    lines = infile.readlines()
    doc = xml.dom.minidom.parseString(data)
    return lines, doc

Three functions left to write

readGlossaryFile builds a dictionary whose keys are the terms defined in the glossary

Values are None for now—see why in a moment

def readGlossaryFile(filename):
    if filename is None:
        return None
    infile = open(filename, 'r')
    doc = xml.dom.minidom.parse(infile)
    terms = doc.getElementsByTagName('glossitem')
    result = {}
    for term in terms:
        t = str(term.getAttribute('id'))
        result[t] = None
    return result

checkGlossary processes uses of glossary terms in a single lecture file

Records the name of the file in which the term appears in the glossary dictionary
If some other filename is already there, reports duplicate definition

def checkGlossary(filename, doc, glossary):
    defns = doc.getElementsByTagName('d')
    for defn in defns:
        d = str(defn.getAttribute('ref'))
        if d not in glossary:
            print 'term %s in %s missing from glossary' % (d, filename)
        elif glossary[d] is not None:
            print 'term %s defined in %s and %s' % (d, filename, glossary[d])
        else:
            glossary[d] = filename

checkGlossaryComplete looks for glossary entries without associated filenames

I.e., terms that are in the glossary, but aren't highlighted anywhere in the lectures

def checkGlossaryComplete(glossary):
    unused = []
    for g in glossary:
        if glossary[g] is None:
            unused.append(g)
    if unused:
        unused.sort()
        print 'unused terms'
        for u in unused:
            print '\t%s' % u

Comment on this slide

Checking Cross-References

Last task (for now): check external files
- Every code fragment and image that's referenced must exist
- Every code file and image must be referenced
Obvious opportunities for abstraction
- Get the set of files in a directory
- Get the set of files referenced in the lecture file
- Report differences in both directions
Follow the pattern used for checking the glossary
- -I dir specifies the root directory for images
  - If none provided, don't check images
- -C dir specifies the root directory for code fragments
  - Ditto
- Use the value of the lecture's id attribute to determine which particular subdirectory to search
  - No point having a “don't bother to check completeness” option, since each lecture's files are stored separately

Add options to getopt.getopt string, and four more lines to the main processing loop

            if codeRootDir:
                checkFiles(filename, doc, codeRootDir, 'code', 'src')
            if imageRootDir:
                checkFiles(filename, doc, imageRootDir, 'img', 'src')

Write checkFiles

Construct the path to the directory
Find out what it contains
- Subtract things like the .svn directory
- Note how we've left room for other things to be excluded?
Find out what's used in the document
- Using a Set automatically handles multiple references to a single source file
Rely on set subtraction to find differences

def checkFiles(filename, doc, rootDir, eltName, attrName):

    # What should we ignore?
    Excludes = ['.svn']

    # Find out where we're supposed to look.
    docId = str(doc.documentElement.getAttribute('id'))
    dir = os.path.join(rootDir, docId)
    if not os.path.isdir(dir):
        print >> sys.stderr, 'Missing directory: %s' % dir
        return

    # Find out what's there that we care about.
    actual = Set(os.listdir(dir))
    for e in Excludes:
        actual.discard(e)

    # Find what's used in the document.
    elts = doc.getElementsByTagName(eltName)
    referenced = Set()
    for e in elts:
        if e.hasAttribute(attrName):
            referenced.add(str(e.getAttribute(attrName)))

    # Show differences (if any).
    showDiff(filename, dir, 'not found', referenced - actual)
    showDiff(filename, dir, 'unused', actual - referenced)

Then write the helper function showDiff

def showDiff(filename, dir, title, values):
    if len(values):
        print '%s (for file %s and directory %s):' % (title, filename, dir)
        for v in values:
            print '\t%s' % v

Comment on this slide

Summary

It took an hour to write and debug this code
- Started with something simple
- Tested everything as I wrote it
  - Found and fixed 31 errors in the lecture notes
Now type make clean and make validate before doing a commit
- Ensures that what goes into the repository has at least some chance of being sensible
- Ensures that what other people add to the course will conform to style and usage rules
Course materials include several other tools
- Re-run sample programs and check that the output stored in the lectures is still correct
- Translate lecture notes, glossary, and bibliography into HTML
- Extract summary values from each <topic/> element to create HTML version of course syllabus
- Etc.
[Clark 2004] discusses other things you can do to automate routine project maintenance tasks
- Prevents your project materials from rusting
  - Which makes those materials easier to share
  - And gives you higher confidence that they're working correctly
- Gives you more time to concentrate on things that actually require human attention
Comment on this slide

Exercises

Exercise 19.1:

What does getopt do when it encounters an argument it doesn't recognize? Write a short program that demonstrates this behavior, that can be run on its own without the user passing in any command-line arguments.