Basic XML and XHTML

Draft Version 574 (Thu Dec 1 09:18:53 2005)

XML is quickly becoming the standard way to store data
- Web pages
- Spreadsheets
- Images
- Astronomical observations
Bewildering variety of tools for dealing with it
- And more appearing every day
Even simple things can be tricky to do right
- Generally agreed that standards are more complex than they should have been…
- …but no agreement on which bits should have been left out
Examine:
- The rules for creating legal XML
- The XML-compliant dialect of HTML (for web pages)
- The standard Python library for manipulating XML
  - Note: there are lots of others, each with its strengths
- Reading:
  - [Castro 2002] if all you care about is HTML
  - [Castro 2000] if you want to know more about XML
  - [Harold 2004] if you want to become an expert
Comment on this slide

History

1969-1986: Standard Generalized Markup Language (SGML)
- Developed by Charles Goldfarb and others at IBM
- A way of adding information to medical and legal documents so that computers could process them
 - ```
 <person role="litigant">
 <given-name>Charles</given-name>
 <surname>Babbage</surname>
 </person>
```
- Allows computers to reliably find cases in which Charles Babbage sued someone
- Large (500-page spec) and complex
1989: Tim Berners-Lee creates HyperText Markup Language (HTML) for the World Wide Web
- Much (much) simpler than SGML
- Anyone could write it, so everyone did
Problem: HTML had a small, fixed set of tags
- Everyone wanted to add new ones
- Solution: create a standard way to define a set of tags, and the relationships between them
1998: first version of XML is standardized
- A set of rules for defining markup languages
- Much more complex than HTML…
- …but still much simpler than SGML
New version of HTML called XHTML was also defined
- Like HTML, but obeys all XML rules
- Still a lot of non-XML compliant HTML out there, though
Comment on this slide

Formatting Rules

For our purposes, an XML document contains elements and text
- Full spec allows for external entity references, processing instructions, and other fun
Elements are shown using tags
- Must be enclosed in angle brackets "<>"
- Full form: <tagname>…</tagname>
- Short form (if the element doesn't contain anything): <tagname/>
Elements must be properly nested
- If Y starts inside X, Y must end before X ends
- So <X>…<Y>…</Y></X> is legal…
- …but <X>…<Y>…</X></Y> is not
Every document must have a single root element
- I.e., a single element must enclose everything else
- So the following is not a legal document
 - ```
 <first>
 This document is illegal
 </first>
 <second>
 because it does not have a unique root element.
 </second>
```
Text is normal printable text
- Must use escape sequences to represent "<" and ">"
- In XML, written &name;
 - Sequence Character
 < <
 > >
 " "
 & &
 Table 18.1: Common XML Escape Sequences
Specific dialects of XML may or may not restrict which elements can appear inside which others, and where text can appear
- XHTML is very liberal
- MathML (Mathematical Markup Language) is stricter
Comment on this slide

Sequence	Character
`<`	`<`
`>`	`>`
`"`	`"`
`&`	`&`

XHTML

Most common use of XML is still XHTML (the XML version of hypertext)

Basic tags:

`html`	Root element of entire HTML document.
`body`	Body of page (i.e., visible content).
`h1`	Top-level heading. Use `h2`, `h3`, etc. for second- and third-level headings.
`p`	Paragraph.
`em`	Emphasized text; browser or editor will usually display it in italics.
`address`	Address of document author (also usually displayed in italics).

Table 18.2: Common XHTML Tags

Note: XHTML includes both semantics (“What does this mean?”) and display (“How should this be drawn?”)
- h1 (level-1 heading) is semantic, i (italics) is display
- Now generally considered a bad thing
  - Documents should only contain semantics
  - Display of that semantics should be specified separately…
  - …so that different browsers (or devices) can do it differently

Sample document:

<html>
<body>
<h1>Software Carpentry</h1>

<p>This course will introduce <em>essential software development skills</em>,
and show where and how they should be applied.</p>

<address>Greg Wilson (gvwilson@third-bit.com)</address>
</html>

Figure 18.1: Simple Page Rendered by Firefox

Figure 18.2: Simple Page Rendered by Internet Explorer

Comment on this slide

Attributes

Elements may also have attributes
- Each attribute is a name/value pair that provides extra information about the element
- Enclosed in the opening tag
 - <h1>A Centered Heading</h1>
 - This planet provided as-is.
- Each name may appear at most once
 - Like keys in a dictionary
 - … is illegal
- Values must be quoted
 - Old-style HTML often allowed things like …, but modern parsers will reject it
 - Must use escape sequences for angle brackets, quotes, etc. inside values
Strictly speaking, attributes are redundant
- Can always re-write XML using only elements
 - Usually more typing…
 - …but that only matters if you're creating the XML by hand…
 - …and an increasing amount is created by machines, for machines
- With Attributes Without Attributes
 
 <a b="c"> <d e="f"/> </a>
 
 <a> <a-b>c</a-b> <d><d-e>f</d-e></d> </a>
You should use attributes when:
- Each value can occur at most once for any element.
- The order of the values doesn't matter.
- Those values have no internal structure.
  - If you have to do any significant work on an attribute's value to figure out what it means, you should use an element instead.
Comment on this slide

With Attributes	Without Attributes
<a b="c"> <d e="f"/> </a>	<a> <a-b>c</a-b> <d><d-e>f</d-e></d> </a>

More XHTML Tags

Well-written HTML pages have a head element as well as a body
- Contains metadata about the page
- Element Example Purpose
 title <title/> Page title (for display in title bar of browser, bookmarks, etc.)
 meta <meta/> Information about the document (typically, to help with search)
 Table 18.3: More XHTML Tags

Element	Example	Purpose
`title`	`<title/>`	Page title (for display in title bar of browser, bookmarks, etc.)
`meta`	`<meta/>`	Information about the document (typically, to help with search)

Well-written pages also use comments (just like code)

Introduce with

<html>
<head>
  <title>Comments Page</title>
  <meta name="author" content="aturing"/>
</head>
<body>

<!-- House style puts all titles in italics -->
<h1><em>Welcome to the Comments Page</em></h1>

<!-- Update this paragraph to describe the forum. -->
<p>Welcome to the Comments Forum.</p>

</body>
</html>

Many other tags can be used (and abused) in HTML pages

Use ul for an unordered (bulleted) list, and ol for an ordered (numbered) one
- Each list item is wrapped in li
Use table for tables
- Each row is wrapped in tr (for “table row”)
- Within each row, column items are wrapped in td (for “table data”)
- Note: tables are often used to force multi-column layout, as well as for tabular data

<html>
<head>
  <title>Lists and Tables</title>
  <meta name="svn" content="$Id: xml.swc 54 2005-04-13 13:29:28Z gvwilson $"/>
</head>
<body>

<table cellpadding="3" border="1">
  <tr>
    <td align="center"><em>Unordered List</em></td>
    <td align="center"><em>Ordered List</em></td>
  </tr>
  <tr>
    <td align="left" valign="top">
      <ul>
        <li>Hydrogen</li>
        <li>Lithium</li>
        <li>Sodium</li>
        <li>Potassium</li>
        <li>Rubidium</li>
        <li>Cesium</li>
        <li>Francium</li>
      </ul>
    </td>
    <td align="left" valign="top">
      <ol>
        <li>Helium</li>
        <li>Neon</li>
        <li>Argon</li>
        <li>Krypton</li>
        <li>Xenon</li>
        <li>Radon</li>
      </ol>
    </td>
  </tr>
</table>

</body>
</html>

Figure 18.3: Lists and Tables

Note how Subversion keywords have been put in meta elements in document head
- Automatically updated each time the document is committed to version control

Comment on this slide

Connecting to Other Data

How to put an image in a page?

XML documents can only contain text, so you can't store an image or audio clip directly in a page
- Unless you encode it as text
Usual solution is to store a reference to the external file using the img tag
- The src argument specifies where to find the image file

<html>
<head>
  <title>Images</title>
  <meta name="svn" content="$Id: xml.swc 54 2005-04-13 13:29:28Z gvwilson $"/>
</head>
<body>

<h1>Our Logo</h1>

<img src="../../img/swc_logo.jpg" alt="[Software Carpentry Logo]"/>

</body>
</html>

Figure 18.4: Images in Pages

Important to always use the alt attribute to specify alternative text
- Screen readers for people with visual handicaps use this instead of the image
- And it's good documentation for search engines

Often, the “other data” you want to connect to is other HTML pages

This is what makes it hypertext…
Use the a element to create a link
- The text inside the element is displayed and (usually) underlined for clicking
- The href attribute specifies what the link is pointing at
- Both local filenames and URLs are supported

<html>
<head>
  <title>Links</title>
  <meta name="svn" content="$Id: xml.swc 54 2005-04-13 13:29:28Z gvwilson $"/>
</head>
<body>

<h1>A Few of My Favorite Places</h1>

<ul>
  <li><a href="http://www.google.com">Google</a></li>
  <li><a href="http://www.python.org">Python</a></li>
  <li><a href="http://www.nature.com/index.html">Nature Online</a></li>
  <li>Examples in this lecture:
    <ul>
      <li><a href="comments.html">Comments</a></li>
      <li><a href="image.html">Images</a></li>
      <li><a href="list_table.html">Lists and Tables</a></li>
    </ul>
  </li>
</ul>

</body>
</html>

Figure 18.5: Links in Pages

Comment on this slide

Accessibility

The web is not a particularly friendly place if you're visually disabled
- Screenreaders have a hard time dealing with web pages that use graphics instead of text for buttons…
Top 10 Accessible Web Authoring Practices describes what you should do to make your pages more accessible
- All of these things help search engines and other automatic tools as well
Comment on this slide

The Document Object Model

The Document Object Model (DOM) is a cross-language standard for representing XML documents as trees
- Elements, attributes, and text all represented as objects
- Strengths:
  - Much easier to manipulate trees than strings
  - Same basic model in many different languages (which lowers the learning cost)
- Weaknesses:
  - Needs a lot of memory for large documents
  - Its generic model doesn't take advantage of the more advanced features of some languages
Most popular alternative is SAX (the Simple API for XML)
- Turns an XML document into a stream of events
  - “Element, element, text, element, text…
- Easy to do very simple things…
- …but anything complex requires the programmer to reimplement a subset of DOM
Python comes with a simple implementation of DOM called minidom
- Fast, sturdy, and well documented…
- …if you understand all the terminology, and know more or less what you're looking for)
Comment on this slide

The Basics

Every DOM tree has a single root representing the document as a whole
- Doesn't correspond to anything that's actually in the document
This element has a single child, which is the root node of the document
This node, and other element nodes, may have three types of children:
- Other elements
- Text nodes
- Attribute nodes
Every node keeps track of what its parent is
- Allows programs to search up the tree, as well as down

Example:

XML

DOM Tree

<root>
  <first>element</first>
  <second attr="value">element</second>
  <third-element/>
</root>

Figure 18.6: A DOM Tree

Note: it's common to forget that text and attributes are stored in nodes of their own
- Some other Python implementations of DOM don't bother
- Make simple things simpler…
- …but only a little bit
Comment on this slide

Creating a Tree

Usual way to create a DOM tree is to parse a file

If this is the file:

<?xml version="1.0" encoding="utf-8"?>
<planet name="Mercury">
  <period units="days">87.97</period>
</planet>

Parse and print like this:

import xml.dom.minidom
doc = xml.dom.minidom.parse('mercury.xml')
print doc.toxml('utf-8')

<?xml version="1.0" encoding="utf-8"?>
<planet name="Mercury">
  <period units="days">87.97</period>
</planet>

The toxml method can be called on the document, or on any element node
- Note that we specify "utf-8" as the character encoding
- DOM trees always store text as Unicode, so when you're converting the tree to text, you must tell the library how to represent characters

Can also create a tree by parsing a string

Works just like parsing a file

import xml.dom.minidom

src = '''<planet name="Venus">
  <period units="days">224.7</period>
</planet>'''

doc = xml.dom.minidom.parseString(src)
print doc.toxml('utf-8')

<?xml version="1.0" encoding="utf-8"?>
<planet name="Venus">
  <period units="days">224.7</period>
</planet>

Finally, can build a tree by hand
- ```
import xml.dom.minidom

impl = xml.dom.minidom.getDOMImplementation()

doc = impl.createDocument(None, 'planet', None)
root = doc.documentElement
root.setAttribute('name', 'Mars')

period = doc.createElement('period')
root.appendChild(period)

text = doc.createTextNode('686.98')
period.appendChild(text)

print doc.toxml('utf-8')
```
```
<?xml version="1.0" encoding="utf-8"?>
<planet name="Mars"><period>686.98</period></planet>
```
- xml.dom.minidom is really just a wrapper around other platform-specific XML libraries
 - Have to reach inside it and get the underlying implementation object to create the document node
 - That node then knows how to create other elements in the document
 - Library explains what the first and third arguments to createDocument are
 - Middle one tells createDocument what type of element the document's root node should be
- Set attributes of element nodes using setAttribute(attributeName, newValue)
 - Remember, all attribute values are strings
 - If you want to store an integer or a Boolean, you have to convert it yourself
- Add new nodes to existing ones by:
 - Asking the document to create the node
 - Appending it to a node that's already part of the tree
Notice that the output of the preceding example wasn't nicely indented
- We didn't tell DOM to create text nodes containing carriage returns and blanks
- Most machine-generated XML doesn't
Comment on this slide

Walking a Tree

Often want to visit each node in the tree
- E.g., print an outline of the document showing element nesting

Simplest way is to write a recursive function

import xml.dom.minidom

src = '''<solarsystem>
<planet name="Mercury"><period units="days">87.97</period></planet>
<planet name="Venus"><period units="days">224.7</period></planet>
<planet name="Earth"><period units="days">365.26</period></planet>
</solarsystem>
'''

def walkTree(currentNode, indent=0):
    spaces = ' ' * indent
    if currentNode.nodeType == currentNode.TEXT_NODE:
        print spaces + 'TEXT' + ' (%d)' % len(currentNode.data)
    else:
        print spaces + currentNode.tagName
        for child in currentNode.childNodes:
            walkTree(child, indent+1)

doc = xml.dom.minidom.parseString(src)
walkTree(doc.documentElement)

solarsystem
 TEXT (1)
 planet
  period
   TEXT (5)
 TEXT (1)
 planet
  period
   TEXT (5)
 TEXT (1)
 planet
  period
   TEXT (6)
 TEXT (1)

Node's type is stored in a member variable called nodeType
- ELEMENT_NODE, TEXT_NODE, ATTRIBUTE_NODE, DOCUMENT_NODE
If a node is an element, its children are stored in a list called childNodes
- A read-only structure
- See how to add, delete, and move children in a moment
If a node is a text node, the text is in the member data
- The single-character text nodes are the carriage returns separating elements

Traversing a tree like this is just one of many recurring patterns in object-oriented programming
- We'll discuss them briefly in Backward, Forward, and Sideways

The Visitor pattern is used to separate traversal of a data structure from operations on its elements

One class traverses a particular kind of structure the same way each time
User then defines the operation
- Derive a class, pass a function as an argument, etc.
- The fact that this can be done in several different ways is what makes it a pattern

Step 1: define generic behavior

class Visitor(object):

    def __init__(self):
        pass

    def visit(self, node):
        # When given the document, skip to the root.
        if node.nodeType == node.DOCUMENT_NODE:
            self.visit(node.documentElement)
            return

        # Handle other types of nodes.
        self.before(node)
        self.at(node)
        if node.nodeType == node.ELEMENT_NODE:
            for child in node.childNodes:
                self.visit(child)
        self.after(node)

    def doNothing(self, node):
        pass

    before = doNothing
    at = doNothing
    after = doNothing

Users call Visitor.visit with the root node of the tree they want to traverse
Override one or more of the three do-nothing methods to perform actions before, at, or after nodes

Step 2: derive a class that does something useful (like count how many nodes are in the tree)

class Counter(Visitor):

    def __init__(self):
        Visitor.__init__(self)
        self.count = 0

    def at(self, node):
        if node.nodeType == node.ELEMENT_NODE:
            self.count += 1

Initialize count to zero before traversing
Increment the count at each element node

Step 3: test

if __name__ == '__main__':
    src = '<a><b>c</b><d>e</d><f>g<h/>i</f></a>'
    tree = xml.dom.minidom.parseString(src)
    c = Counter()
    c.visit(tree)
    assert c.count == 5

Comment on this slide

Modifying the Tree

Modifying trees in place is a little bit tricky
- Helps to draw lots of pictures
Example: want to emphasize the first word of each paragraph
- Get the text node below the paragraph
- Take off the first word
- Insert a new  element whose only child is a text node containing that word
- Figure 18.7: Modifying the DOM Tree
Ah, but it's not that simple
- What if the first child of the paragraph already has some markup around it?
 - E.g., what if the paragraph starts with a link?
- Could just wrap the first child with 
 - But if (for example) the link contains several words, this will look wrong
- We'll ignore this problem for now

First part of solution: find all the paragraphs using getElementsByTagName, and iterate over them

You'll use this method a lot…

def emphasize(doc):
    paragraphs = doc.getElementsByTagName('p')
    for para in paragraphs:
        first = para.firstChild
        if first.nodeType == first.TEXT_NODE:
            emphasizeText(doc, para, first)

Second part: break the paragraph text into pieces, and handle each piece in turn

Create a new node for each piece
Push it onto the front of the paragraph's child list
Once they've all been handled, get rid of the original text node

def emphasizeText(doc, para, textNode):

    # Look for optional spaces, a word, and the rest of the paragraph.
    m = re.match(r'^(\s*)(\S*)\b(.*)$', str(textNode.data))
    if not m:
        return
    leadingSpace, firstWord, restOfText = m.groups()
    if not firstWord:
        return

    # If there's text after the first word, re-save it.
    if restOfText:
        restOfText = doc.createTextNode(restOfText)
        para.insertBefore(restOfText, para.firstChild)

    # Emphasize the first word.
    emph = doc.createElement('em')
    emph.appendChild(doc.createTextNode(firstWord))
    para.insertBefore(emph, para.firstChild)

    # If there's leading space, re-save it.
    if leadingSpace:
        leadingSpace = doc.createTextNode(leadingSpace)
        para.insertBefore(leadingSpace, para.firstChild)

    # Get rid of the original text.
    para.removeChild(textNode)

Third part: test it

Yes, it really is part of the program

if __name__ == '__main__':

    src = '''<html><body>
<p>First paragraph.</p>
<p>Second paragraph contains <em>emphasis</em>.</p>
<p>Third paragraph.</p>
</body></html>'''

    doc = xml.dom.minidom.parseString(src)
    emphasize(doc)
    print doc.toxml('utf-8')

<?xml version="1.0" encoding="utf-8"?>
<html><body>
<p><em>First</em> paragraph.</p>
<p><em>Second</em> paragraph contains <em>emphasis</em>.</p>
<p><em>Third</em> paragraph.</p>
</body></html>

Comment on this slide

Summary

There's a lot of hype in hypertext
- Haven't yet heard anyone claim that XML will cure the common cold, but I'm sure it's been said
Strengths:
- One set of rules for people to learn
- One parser can handle all of their data
  - At least, the low-level syntactic bits—still need to figure out what all those tags mean
Weaknesses:
- Raw XML is hard to read
  - Particularly if it has been generated by a machine
- A lot of data isn't actually trees
  - When storing a 2D matrix or a table, you have to organize data by row or by column…
  - …either of which makes the other hard to access
- There are a lot of complications and subtleties
  - Most applications ignore most of them
  - Which means that they fail (usually badly) when confronted with something outside the subset they understand
Like Inglish speling, it's here to stay
Comment on this slide