Regular Expressions

Draft Version 560 (Thu Dec 1 09:18:36 2005)

How to count the blank lines in a file?
- Most people consider a line with just spaces and tabs to be blank
- But examining characters one by one is tedious
- More complex patterns (like telephone numbers or email addresses) are hard to get right
Solution: use regular expressions (REs) instead
- Represent patterns as strings
- Just like the "*" in the shell's *.txt
Warning: the notation is ugly
- Have to use what's on the keyboard, instead of inventing new symbols the way mathematicians do
- Can't use superscripts and subscripts
Comment on this slide

A Simple Example

Load the re module, then use re.search(pattern, text)
- pattern is a regular expression that describes what you're looking for
- text is the string you're searching in

So what goes in pattern?

Letters, digits, and whitespace character match against themselves

Punctuation characters have special meaning

Pattern	Matches	Doesn't Match	Explanation
`⌈a*⌋`	`""`, `"a"`, `"aa"`, …	`"A"`, `"b"`	`⌈*⌋` means “zero or more” matching is case sensitive
`⌈b+⌋`	`"b"`, `"bb"`, …	`""`	`⌈+⌋` means “one or more”
`⌈ab?c⌋`	`"ac"`, `"abc"`	`"a"`, `"abbc"`	`⌈?⌋` means “optional” (zero or one)
`⌈[abc]⌋`	`"a"`, `"b"`, or `"c"`	`"ab"`, `"d"`	`⌈[…]⌋` means “one character from a set”
`⌈[a-c]⌋`	`"a"`, `"b"`, or `"c"`	Character ranges can be abbreviated
`⌈[abc]*⌋`	`""`, `"ac"`, `"baabcab"`, …	Operators can be combined: zero or more choices from `"a"`, `"b"`, or `"c"`

Table 17.1: Basic Regular Expression Operators

Figure 17.1: Matching

re.search looks for a match anywhere in the text
- Doesn't have to match the entire target string
- ```
import re

pattern = 'a[bc]*'
for text in ['b', 'ab', 'accb', 'mad']:
    if re.search(pattern, text):
        print '"%s" matches "%s"' % (pattern, text)
    else:
        print '"%s" does not match "%s"' % (pattern, text)
```
```
"a[bc]*" does not match "b"
"a[bc]*" matches "ab"
"a[bc]*" matches "accb"
"a[bc]*" matches "mad"
```
- ⌈a[bc]*⌋ matches an "a", followed by zero or more of either "b" or "c"
  - Doesn't match "b" because there's no leading "a"
  - Matches "ab" and "accb"
- Why does it match "mad"?
  - re.search looks for a match anywhere in text
  - Skips the "m", then ⌈a⌋ matches "a", and ⌈[bc]*⌋ matches the empty string
Comment on this slide

Anchoring

If re.search looks anywhere in the line, how to find blank lines?
- Don't consider "x \n" or " x\n" blank
Constrain what the RE can match using anchors
- ⌈^⌋ matches the beginning of the string
- ⌈$⌋ matches the end
- Neither consumes any characters
- Figure 17.2: Anchoring Matches

Examples

Pattern	Text	Result
`⌈b+⌋`	`"abbc"`	Matches
`⌈^b+⌋`	`"abbc"`	Fails (string doesn't start with `b`)
`⌈c$⌋`	`"abbc"`	Matches (string ends with `c`)
`⌈^a*$⌋`	`aabaa`	Fails (something other than `"a"` between start and end of string)

Table 17.2: Anchoring Regular Expressions

Can now count blank lines in a file

import sys, re

# Nothing but space, tab, carriage return, newline from start to end
pattern = '^[ \t\r\n]*$'

# Count matches in one file/stream.
def count(filename, instream):
    count = 0
    for line in instream:
        if re.search(pattern, line):
            count += 1
    print '%s %d' % (filename, count)

# Only standard instream?
if len(sys.argv) == 1:
    count('<stdin>', sys.stdin)
else:
    for filename in sys.argv[1:]:
        instream = open(filename, 'r')
        count(filename, instream)
        instream.close()

Note: always behave like a polite command line filter
- If no arguments given, read from standard input

Comment on this slide

Escape Sequences

How to match against a literal "^" or "*"?
- Escape from the character's special meaning by putting "\" in front of it
- ⌈\$⌋ matches a literal "$", and ⌈\\⌋ matches a literal "\"
Must write these in Python as "\\$" and "\\\\"
- Two layers of compilation:
  - Python turn double backslashes into single backslash character
  - Regular expression library then compiles single backslash plus something into special operation
  - Figure 17.3: Compiling Regular Expressions
- This can get very confusing, very quickly
  - "\t" is a tab character, which matches a tab character
  - But "\\t" is the two-character sequence ⌈\t⌋, which also matches a tab character
"\" is also used in shorthand notation for common character sets
- Sequence Equivalent Explanation
  ⌈\d⌋ ⌈[0-9]⌋ Digits
  ⌈\w⌋ ⌈[a-zA-Z0-9_]⌋ Word characters (i.e., those allowed in variable names)
  ⌈\s⌋ ⌈[ \t\r\n]⌋ Whitespace
  Table 17.3: Regular Expression Escape Sequences
A couple of useful special cases:
- The notation ⌈[^abc]⌋ means “anything except the characters in this set”
- ⌈.⌋ means “any character except the end of line”
  - Equivalent to ⌈[^\n]⌋
- ⌈\b⌋ anchors the match to a break between word and non-word characters
  - Like ⌈^⌋ and ⌈$⌋, doesn't consume any actual characters
  - Figure 17.4: Word/Non-Word Breaks
Comment on this slide

Sequence	Equivalent	Explanation
`⌈\d⌋`	`⌈[0-9]⌋`	Digits
`⌈\w⌋`	`⌈[a-zA-Z0-9_]⌋`	Word characters (i.e., those allowed in variable names)
`⌈\s⌋`	`⌈[ \t\r\n]⌋`	Whitespace

Extracting Matches

Problem: to check for duplicate line numbers in an assembly language file
- Line numbers are the first thing on the line, terminated by a colon
  - ```
          load    D       1
  10 :    sub     A       B
          jlt     A       20
```

First try: search each line with a regular expression

If it succeeds, extract the line number

import sys, re

# start of line, optional spaces, digits, more optional spaces, colon
numbered = '^\\s*\\d+\\s*:'

seen = {}
for line in sys.stdin:
    if re.search(numbered, line):
        num = line.split()[0]
        if num in seen:
            print num
        else:
            seen[num] = True

But what if there is no space between the line number and the colon?
- '2 :'.split() gives ['2', ':'], but '2:'.split()' gives ['2:']
- Don't want to search the string to find the longest leading sequence of digits
- Need a better way to extract the text that matched sub-patterns

Result of re.search is actually a match object that records what what matched, and where

mo.group() returns the whole string that matched the RE
mo.start() and mo.end() are the indices of the match's location

import re

text = 'abbcb'
for pattern in ['b+', 'bc*', 'b+c+']:
    mo = re.search(pattern, text)
    print '%s / %s => "%s" (%d, %d)' % \
          (pattern, text, mo.group(), mo.start(), mo.end())

b+ / abbcb => "bb" (1, 3)
bc* / abbcb => "b" (1, 2)
b+c+ / abbcb => "bbc" (1, 4)

Every parenthesized subexpression in the RE is a group
- Group 0 is the entire match
- Text that matched N^th parentheses (counting from left) is group N
- mo.group(3) is the text that matched the third subexpression, m.start(3) is where it started

Extracting line numbers is now easy:

import sys, re

# start of line, optional spaces, digits, more optional spaces, colon
numbered = '^\\s*(\\d+)\\s*:'

seen = {}
for line in sys.stdin:
    mo = re.search(numbered, line)
    if mo:
        num = mo.group(1)
        if num in seen:
            print num
        else:
            seen[num] = True

So is reversing two columns of numbers:

# optional spaces, number, required spaces, number, optional spaces
def reverse(instream, outstream):
    cols = '^\\s*(\\d+)\\s+(\\d+)\s*$'
    for line in instream:
        mo = re.match(cols, line)

        # If match, reverse numbers
        if mo:
            a, b = mo.group(1), mo.group(2)
            print >> outstream, '%s\t%s' % (b, a)

        # If no match, echo line (without adding extra newline at end)
        else:
            print >> outstream, line,

Let's not forget to test it:

if __name__ == '__main__':

    fixture = '''\
# Leading comment followed by blank line

10 20
 30\t40\t
50
60 70 80
\t90 100
'''

    expected = '''\
# Leading comment followed by blank line

20\t10
40\t30
50
60 70 80
100\t90
'''

    from cStringIO import StringIO
    instream = StringIO(fixture)
    outstream = StringIO()
    reverse(instream, outstream)
    assert outstream.getvalue() == expected

Comment on this slide

Compiling

The regular expression library compiles patterns into a more concise form for matching
- Each regular expression becomes a finite state machine
- Library follows the arcs in the FSM as it reads characters
- Drawing FSMs is a good way to debug REs
- Figure 17.5: Regular Expressions as Finite State Machines
You can improve a program's performance by compiling the RE once, and re-using the compiled form
- Use re.compile(pattern) to get the compiled RE
- Its methods have the same names and behavior as the functions in the re module
- E.g., matcher.search(text) searches text for matches to the RE that was compiled to create matcher

Example: find all Title Case words in a document

def findAll(instream, outstream):
    matcher = re.compile('\\b([A-Z][a-z]*)\\b(.*)')
    for line in instream:
        mo = matcher.search(line)
        while mo:
            print >> outstream, mo.group(1)
            mo = matcher.search(mo.group(2))

Notice how the function gets all matches:
- Pattern captures what we want in group 1, and everything else on the line in group 2
- Each time there's a match, continue the search in the remainder captured in group 2
Tests are straightforward

if __name__ == '__main__':

    fixture = '''\
This has several "Title Case" words
on Each Line (Some in parentheses).
'''

    expected = '''\
This
Title
Case
Each
Line
Some
'''

    from cStringIO import StringIO
    instream = StringIO(fixture)
    outstream = StringIO()
    findAll(instream, outstream)
    assert outstream.getvalue() == expected

    print 'INPUT'
    print fixture
    print 'OUTPUT'
    print expected

import re

#- start:findAll
def findAll(instream, outstream):
    matcher = re.compile('\\b([A-Z][a-z]*)\\b(.*)')
    for line in instream:
        mo = matcher.search(line)
        while mo:
            print >> outstream, mo.group(1)
            mo = matcher.search(mo.group(2))
#- end:findAll

# start:test
if __name__ == '__main__':

    fixture = '''\
This has several "Title Case" words
on Each Line (Some in parentheses).
'''

    expected = '''\
This
Title
Case
Each
Line
Some
'''

    from cStringIO import StringIO
    instream = StringIO(fixture)
    outstream = StringIO()
    findAll(instream, outstream)
    assert outstream.getvalue() == expected

    print 'INPUT'
    print fixture
    print 'OUTPUT'
    print expected
# end:test

Compiled REs have many other useful methods

Including one that finds all matches, so you don't have to write the loop
All are also available as top-level functions in the re module

Method	Purpose	Example	Result
`split`	Split a string on a pattern.	`re.split('\\s,\\s', 'a, b ,c , d')`	`['a', 'b', 'c', 'd']`
`findall`	Find all matches for a pattern.	`re.findall('\\b[A-Z][a-z]*', 'Some words in Title Case.')`	`['Some', 'Title', 'Case']`
`sub`	Replace matches with new text.	`re.sub('\\d+', 'NUM', 'If 123 is 456')`	`"If NUM is NUM"`

Table 17.4: Regular Expression Object Methods

Comment on this slide

Using REs in Other Languages

Like sine and sorting, regular expressions are independent of language
- RE libraries exist for C/C++, Java, Perl, Ruby, MATLAB, …
- Syntax varies slightly, but the ideas are the same

Example: Java's java.util.regex package contains two classes:

Pattern: a compiled regular expression
Matcher: the result of a match

Typical usage:

public static String matchMiddle(String data) {
    String result = null;
    Pattern p = Pattern.compile("a(b|c)d");
    Matcher m = p.matcher(data);
    if (m.matches()) {
        result = m.group(1);
    }
    return result;
}

REs are actually built into languages like Perl and Ruby

Just as dictionaries are built into Python, and arrays into MATLAB

Typical Perl usage:

open MAIL, 'mail.txt'
while (<MAIL>) {
    if (($name, $value) = /^([^:]+): ?(.+)$/) {
        print "Message header $name is $value\n";
    }
}

Comment on this slide

But Wait, There's More

We've only scratched the surface
- Regular expressions have proved to be too useful to remain clean and elegant
Use ⌈|⌋ for either/or
- ⌈ab|cd⌋ matches either "ab" or "cd"
- ⌈a(b|c)d⌋ matches either "abd" or "acd"
Use ⌈pat{N}⌋ to match exactly N occurrences of a pattern
- More generally, ⌈pat{M,N}⌋ matches between M and N occurrences
- ⌈\d{2,3}⌋ matches "19" or "207", but not "3" or "4567"
  - Well, OK, it will match the first three characters of "456"
  - But ⌈^\d{2,3}⌋ won't
Most important thing is to build up complex REs one step at a time
- Write something that matches part of what you're looking for
- Test it
- Add to it
- That's why they call it computer science: it's experimental
For a broader tutorial, see [Wilson 2005]
- For Python-specific material, see Andrew Kuchling's Python Regular Expression HOWTO
- And if you're going to be doing serious work, check out [Good 2005] or [Friedl 2002]
Comment on this slide

Exercises

Exercise 17.1:

By default, regular expression matches are greedy: the first term in the RE matches as much as it can, then the second part, and so on. As a result, if you apply the RE ⌈X(.*)X(.*)⌋ to the string "XaX and XbX", the first group will contain "aX and Xb", and the second group will be empty.

It's also possible to make REs match reluctantly, i.e., to have the parts match as little as possible, rather than as much. Find out how to do this, and then modify the RE in the previous paragraph so that the first group winds up containing "a", and the second group " and XbX".