2 Technical background

2.1 Text Encoding and XML

Strictly speaking, one doesn't need a markup language to share or even analyze digitized text. Computer programs can and have been written to take as input plain text and to analyze or transform it in various ways. Suppose, for example, you have the entire text of James Joyce's Ulysses stored in an ASCII computer file:

Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of lather on which a mirror and a razor lay crossed. A yellow dressinggown, ungirdled, was sustained gently behind him on the mild morning air. He held the bowl aloft and intoned:

—Introibo ad altare Dei.

[etc.]

You can do a reasonably good job of breaking the text into its component paragraphs, sentences, and words based on line breaks, punctuation, and spacing. So you can produce word counts and concordances, calculate the average length of sentences, find interesting patterns of word collocation, or even generate a Joyce pastiche by stringing together random sentences from the text. But with only the bare text itself as data, there are many more things you can't easily do: search for references to people, places, or titles; distinguish reliably between primary text and quoted text, or direct and indirect speech; search for passages in a particular language, such as Latin; find text that is in verse rather than prose; indicate that text is in italics or boldface. For these tasks it is necessary to add something to the bare text, to mark it up: hence the need for a ‘markup language’. Using a markup language you can identify "Buck Mulligan" as a personal name, and perhaps associate it with a standard identifier. You can indicate that the second paragraph represents a quotation, spoken by Buck Mulligan, in the Latin language, quoted from the Tridentine Mass, and typographically rendered in italics in the source text.

Various markup languages have been developed in the past to accomplish this kind of identification, but the Text Encoding Initiative and the wider computing world have settled on a single standard: XML, the Extensible Markup Language.

2.1.1 What Is XML?

XML is popularly known as an ‘angle-bracket language’. If you have ever looked at or edited HTML source code, you have worked with a near cousin of XML—or with pure XML, if your code was in the HTML version known as XHTML (Extensible HyperText Markup Language).

The most basic fact about XML is that XML is not a single markup standard, but rather a specification for creating markup languages, each with its own vocabulary and rules. Moreover, XML can be used to encode nearly any sort of material, from poems
<NurseryRhyme ID="mary_lamb">
 <title>Mary Had a Little Lamb</title>
 <stanza>
  <verse>Mary had a little lamb,</verse>
  <verse>its fleece was white as snow;</verse>
  <verse>and everywhere that Mary went</verse>
  <verse>the lamb was sure to go.</verse>
 </stanza>
</NurseryRhyme>
to regular ‘data’ of the sort that could be stored in a formal database:
<place>
 <name>London</name>
 <latitude hemisphere="N">51.507778</latitude>
 <longitude hemisphere="W">0.128056</longitude>
</place>
Nearly anything can be encoded and expressed in XML, within the constraints of its syntax. There are XML languages for musical notation [ref http://xml.coverpages.org/xmlMusic.html], for mathematical equations (MathML), for representing vector graphics (SVG), for library catalog records (MARCXML) . . . [ref to Wikipedia http://en.wikipedia.org/wiki/List_of_XML_markup_languages?]

2.1.2 XML Syntax in a Nutshell

The basic grammar of XML is simple enough that it can be expressed in four brief rules. (They are oversimplifications, strictly, but the exceptions are not important enough to matter at this stage.)

  1. An XML file must have a single outermost root element that contains everything else.
    <document> . . . contents . . .
    </document>
  2. An XML element must always have a start tag and an end tag. Both start and end tags are denoted by angle brackets preceding the name of the element; the end tag must have a solidus (/) preceding the name:
    <tag> … </tag>
    The two main things that an element can contain are text and other elements.
    <document>
     <para>This is a paragraph with text only.</para>
     <para>This is <emph>another</emph> paragraph with a child element.</para>
    </document>
  3. An XML element may qualified by attributes, contained within the start tag. The value of each attribute is contained within quotation marks (single or double):
    <document type="legal" xml:id="DOC-2008-07-11-0003"></document>
    Within running text, attributes are referred to by prefixing an at-sign (@) to their name: type, xml:id.
  4. XML elements must be nested; they cannot overlap. This syntax is illegal:
    <line><clause>April is the cruelest month</clause>, <clause>breeding</line> <line>Lilacs out of the dead land</clause>, <clause>mixing</line> <line>Memory and desire</clause>, <clause>stirring</line> <line>Dull roots with spring rain</clause>.</line>
    (The inability to capture "overlapping hierarchies" in this way is a fundamental limitation of XML.)

Any XML document that correctly follows these syntax rules is said to be ‘well formed’. So long as XML is well formed, it can be parsed, edited, transformed, or otherwise processed by software tools. (Conversely, an ill-formed XML document will usually generate one or more error messages when opened by such tools, prompting the user to correct its syntax.)

Well-formedness by itself is not usually enough to make XML particularly useful to humans, however. For example, the following XML document is well-formed:

<NurseryRhyme>
 <verse>
  <stanza>Mary had a little lamb,</stanza>
   its fleece was white as snow;
 <stanza>and everywhere that Mary went</stanza>
  <stanza>the lamb was sure to go.</stanza>
 </verse>
</NurseryRhyme>

This attempts to encode a nursery rhyme, but it is lacking a title and switches the <stanza> and <verse> elements—we want the former to contain the latter, not vice-versa. And it has a verse line that lacks a tag entirely.

2.1.3 Rules for Structuring an XML Vocabulary

Fortunately, there are several ways of specifying rules for XML that allow one to create a structured XML vocabulary—a markup language that uses a defined set of elements and attributes that must be arranged in specified ways. When an XML document is compared against such a rule set and meets all of its requirements, it is said to be valid XML (in addition to being well formed). Most software that processes XML can check for both well-formedness and validity.

There are two basic technologies for writing XML rule sets: Document Type Definitions (DTDs) and XML schemas. DTDs were the earliest (in fact they go back all the way to the 1970s, when they were used with GML, the Generalized Markup Language, which is more or less the grandparent of XML); in loose usage, people sometimes use the term DTD to mean any rule set associated with an XML language: ‘Have you figured out what DTD to use for encoding your recipes?’. XML schemas were developed in conjunction with XML itself, in the 1990s. Schemas are more powerful and more complex; their use in connection with TEI will be discussed below in [REFERENCE]. Here we will provide a simple example of how a DTD could be used to define the rules of NRML, our Nursery Rhyme Markup Language.

The rules for NRML are simple: an XML document encodes a single nursery rhyme in a root element called <NurseryRhyme>, which must have an identifier in an ID attribute. Under the root element there must be a title, contained in a <title> element, and one or more <stanza> elements; each <stanza> contains one or more verses in a <verse> element. A <verse> may contain only text.

Here is the DTD which specifies those rules:

<!ELEMENT NurseryRhyme (title, stanza+)> <!ATTLIST NurseryRhyme ID CDATA #REQUIRED > <!ELEMENT title (#PCDATA)> <!ELEMENT stanza (verse+)> <!ELEMENT verse (#PCDATA)>

With the explanation that "CDATA" means "character data" and "#PCDATA" means "parsed character data" (text that may contain special constructs handled by the XML parser), the syntax of this DTD should be fairly intuitive. The content model of an element is given in parentheses. Required elements are separated by a comma and must appear in the order given. A plus sign means "one or more".

Once a DTD or schema has been written for a particular XML markup vocabulary, it can be applied to an XML file by an XML validator to determine whether the file is valid or not. (The way this is done for TEI files will be discussed below [REF].)

2.1.4 XML Semantics

If you have followed the discussion of XML well-formedness and validity, it may occur to you that nothing in the rules for NRML that we have described prevents someone from tagging "Mary Had a Little Lamb" like this:
<stanza>
 <verse>Mary had a little</verse>
 <verse>lamb, its</verse>
 <verse>fleece was white as</verse>
 <verse>snow;</verse>
<!-- etc. -->
<!-- Incidentally, this structure and the previous are 'XML comments': they are not part of the formal document content and can be placed anywhere within an XML file. -->
</stanza>
This markup is both well-formed and valid; no automated parser will ever complain about it. But a human reader is justified in calling it "bad markup", because the encoder has not correctly identifed the boundaries of metrical verses. From a human point of view, the most important rule governing the <verse> element is that it should encode a single line of verse. The semantics of <verse> cannot be expressed in XML or an XML technology; they depend on the conventions of poetics. If one is writing a guide to the NRML language, it is not enough to assume the basic syntax of XML and to present the rule set (DTD or schema) governing its structure: the appropriate use of the <verse> tag must be explained in terms of the rules for recognizing poetic verse.

The same thing is true on a much larger scale with the TEI Guidelines. The TEI vocabulary extends to hundreds of elements, and the rules governing which elements may appear where are quite complex. But the largest part of the Guidelines consists of explanation, illustrated by examples, of which TEI elements should be used to encode various features of texts and their associated metadata. Often one needs to be an expert in a particular subject—bibliography, linguistics, manuscript editing—in order to use TEI encoding in a way that is meaningful as well as technically valid.

2.1.5 Namespaces: Avoiding XML Vocabulary Collisions

As the use of XML expanded and many different XML markup languages emerged, people realized that it would often be useful to permit one XML vocabulary to incorporate another one. For example, there is a formal XML language, MathML, that can be used to encode mathematical equations. If one were using TEI to encode the correspondence of a mathematician, rather than trying to extend the TEI tagset to include equations it would be much simpler to use MathML directly within one's TEI document.

The first problem that arises when attempting to integrate differing XML languages is differentiating among the vocabularies, and in particular, avoiding name collisions. For example, suppose that we are encoding in TEI XML a scholarly work that has footnotes, which are tagged using <note>. And suppose that one footnote contains the opening bars of the song "Twinkle Twinkle Little Star". Let's say there is a widely used simple XML language called TuneML for encoding musical passages that also uses a <note> element, and we want to use it in our footnote. We might end up with something like this:
<note type="footnote" n="1">In the margin, the author has inscribed the opening
bars of <title>Twinkle, Twinkle, Little Star</title>:
<tune clef="G">
  <note>C</note>
  <note>C</note>
  <note>G</note>
  <note>G</note>
 </tune>
[etc.].
</note>
We have an obvious problem. It's impossible to distinguish between the elements from TEI and from TuneML, and the <note> element is therefore used for two entirely different purposes.

The solution adopted to resolve this problem was XML Namespaces, a way of identifying the elements of a single XML vocabulary so that they occupy their own unique ‘name space’. Namespaces are optional in XML; an XML document does not have to use them. None of the preceding examples of XML do, so they would be said to be in "no namespace". Namespaces give great flexibility and power to XML, but unfortunately they are one of the more confusing parts of XML for beginners.

A namespace is distinguished within an XML document using a unique identifier. And here is where the first confusion arises: namespace values share the same legal syntax as the Uniform Resource Identifiers (URI) used in the World Wide Web, and by convention they begin with http://. But a namespace identifier is not the name of a Web page, for all that it looks like one. For example, the namespace identifier of the TEI language is http://www.tei-c.org/ns/1.0. If you go to the Web address http://www.tei-c.org/ns/1.0, you will in fact find a short paragraph about the TEI namespace, but that is an optional courtesy on the part of the TEI. On the other hand, one of the namespaces used in Microsoft's XML format for its Office software is http://schemas.microsoft.com/office/word/2006/wordml, but there is nothing at the Web page http://schemas.microsoft.com/office/word/2006/wordml.

A namespace must be declared in an XML document to take effect. It applies to the node it is declared in, and to all nodes below that unless they are declared to be in a different namespace. Thus, every TEI document begins
<TEI xmlns="http://www.tei-c.org/ns/1.0"> ... content of TEI file
</TEI>
where xmlns= can be read as ‘the XML Namespace is...’.
Let's return to our hypothetical footnote above about Twinkle, Twinkle Little Star. Suppose we know that the namespace of the TuneML language is http://tuneml.org/schema/2009. We could then rewrite our XML footnote like so:
<note type="footnote" n="1">In the margin,
the author has inscribed the opening
bars of <title>Twinkle, Twinkle, Little Star</title>:
<tune xmlns="http://tuneml.org/schema/2009" clef="G">
 <note>C</note>
 <note>C</note>
 <note>G</note>
 <note>G</note>
</tune>
[etc.].</note>
Now we genuine have two different <note> elements. The first one is declared to be in the TEI namespace. The others, describing musical notes, inherit the TuneML namespace that is declared on their parent element <tune>.

2.1.5.1 Namespace Prefixes

If you have followed the preceding description of namespaces and their identifiers, it may occur to you that they are useful to computers but not very friendly to human readers and editors of XML documents. If you are looking at XML nodes deep within a file using multiple namespaces, how can you easily figure out which namespace is in control of the current node? And isn't it cumbersome to attach long namespace identifers every time you refer to a namespace in a file?

The solution adopted in the XML world is namespace prefixes. When declaring a namespace, you may optionally associate it with a prefix. Then for the current XML node and any nodes below it, adding the prefix to an element name is the same as declaring its full identifier. The simplest strategy is to declare all namespaces with prefixes at the top of the document (i.e., on the root element):

<TEI xmlns="http://www.tei-c.org/ns/1.0">

<!-- lots of content, then: -->
<note type="footnote" n="1">In the margin,
the author has inscribed the opening
bars of <title>Twinkle, Twinkle, Little Star</title>:
<tn:tune clef="G">
 <tn:note>C</tn:note>
 <tn:note>C</tn:note>
 <tn:note>G</tn:note>
 <tn:note>G</tn:note>
</tn:tune>
[etc.].</note></TEI>

Compare this example with the preceding one. Note that the TEI namespace is declared on the root element <TEI>, so it does not have to be repeated in descendant nodes; the footnote <note> inherits the TEI namespace. Also in the root element, the TuneML namespace is declared and associated with the prefix tn. Thus, within the file when elements from the TuneML language are used, they can simply be named with their prefix: <tn:tune>, <tn:note>. When using prefixes in this way, you must prefix every element you use, as there is no namespace inheritance—that is why every TuneML note is written as <tn:note>. But most people find XML documents with multiple namespaces more comprehensible when prefixes are used.

2.1.6 Further Reading

to come

2.2 XML on the Web

The ‘angle-bracket language’ that people are most familiar with is HTML, the Hypertext Markup Language that has powered the World Wide Web (WWW) since 1991. Its earliest format (still used today) was based not on XML but on the ancestor of XML, the Standard Generalized Markup Language or SGML. But since 2000 it has also existed as a pure XML language known as XHTML, which obeys all of the rules of syntax outlined above. For the purposes of this section, we are going to ignore the older format and use HTML to mean the XML form of the language.

2.2.1 Web Servers, Web Clients

The WWW is based on a client-server model. That is, a Web server on the Internet hosts content and delivers it over the network to clients—to individual computers, intelligent phones, or other network appliances. Oversimplifying some (but not much), every Web page view is the successful result of the delivery and interpretation of an HTML file with a structure like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<xhtml:html>
 <xhtml:head>
  <xhtml:title>A Web Page</xhtml:title>
 </xhtml:head>
 <xhtml:body>
  <xhtml:h1>A Sample Web Page</xhtml:h1>
  <xhtml:p>This is a <xhtml:em>minimal</xhtml:em> HTML Web page.</xhtml:p>
 </xhtml:body>
</xhtml:html>

Chances are you're familiar with the syntax. Information about the document (metadata) goes inside the <head>, while the <body> contains the document text (plus references to graphics, multimedia, programs, etc., that display within the page). Heading levels are designated by <h1>, <h2>, etc., paragraphs by <p>, emphasized text by <em>, and so on.

The Web began its meteoric rise once people started writing client software that could interpret and display the HTML language in a standard way: the Web browser, or just ‘browser’ for short. From Mosaic through Netscape to MS Internet Explorer, Firefox, Safari, and all the other contemporary ones, a browser has had the function of interpreting HTML code (and bits of programming languages and other data embedded with it) and displaying it to a user in a standard way. So, by convention, the <title> of an HTML file is shown at the very top of the browser window; an <h1> head is usually in boldface type much larger than the regular text font; text inside an <i> tag is of course rendered in italics, and so on. The existence of a fairly limited set of uniform tags meant that Web designers could predict with some reliability how their HTML documents would appear at the recipients' end.

Typical browser view of sample HTML
Figure 1. Typical browser view of sample HTML

2.2.2 Styling the Web

A piece of technology that has given the Web much of its presentational flexibility is Cascading Style Sheets, or CSS. As CSS is an indispensable tool for most projects using the TEI to present material on the Web, it is worth a brief detour here.

CSS is a style sheet language that can be used to precisely define the appearance of XML documents (not just HTML ones, note!) on screen, in print, or even for audio or Braille devices. It allows Web designers to go well beyond the default renderings given to HTML elements by Web browsers. Consider the sample HTML file given above. Typically, a browser will render the <h1> in boldface font at around 150–200% of the body font size, and will render the <em> as italics. But suppose I want my headings to be at 130%, in small capitals, colored green; and I want my emphasized text to be bold and red, not italicized. I can accomplish this by adding a <style> element to my HTML <head>, containing the relevant CSS instructions:

<style type="text/css"> h1 { font-size:130%; font-variant:small-caps; color:green; }
em { font-weight:bold; color:red; font-variant:normal; }
</style>
Typical browser view of sample HTML with CSS styling
Figure 2. Typical browser view of sample HTML with CSS styling

2.2.3 Limitations of HTML

In its origins, HTML was designed for the sharing of technical documents and other information generating by researchers. It has been expanded greatly since the beginning, but its legacy explains two key features of HTML: (1) a relatively impoverished semantic vocabulary for describing text structures, and (2) an emphasis on appearance over structure. Together with CSS, it has an extraordinarily rich array of mechanisms for sizing, coloring, and positioning text and graphics, but it has no native way of distinguishing between, for example, prose and poetry. Most of its structural or semantic tags describe features typical of technical documentation or memo-like prose (<table>, <ol> [ordered list], <dl> [definition list]), or exist to enable hyperlinking, form submission, etc. (<a>, <form>/<input>). It is a superb vehicle for presenting text, graphics, and multimedia, but an inadequate one for representing the underlying structure and meaning of the universe of human textual production.

2.2.4 So... Why not Just Put XML on the Web?

Let's return to our invented Nursery Rhyme Markup Language. Suppose I have used NMRL to mark up hundreds of nursery rhymes; can I just copy my files to the Web? Let's recall what an NRML file looks like; let's call it MaryLamb.xml:
<NurseryRhyme ID="mary_lamb">
 <title>Mary Had a Little Lamb</title>
 <stanza>
  <verse>
   <name>Mary</name> had a little lamb,</verse>
  <verse>its fleece was white as snow;</verse>
  <verse>and everywhere that <name>Mary</name> went</verse>
  <verse>the lamb was sure to go.</verse>
 </stanza>
</NurseryRhyme>
(we've added a new <name> element for reasons that will be apparent below).

The answer is ‘yes and no’. I can certainly copy MaryLamb.xml to a directory on a Web server, and invite people to look at it. But what they will see will be something like this:

MaryLamb.xml viewed in Firefox
Figure 3. MaryLamb.xml viewed in Firefox

This is a ‘raw XML’ view of the file. Web browsers know how to display HTML because it has a known set of tags and styling commands. But without some hints, they cannot make any assumptions about how an unknown XML language should be displayed.

2.2.5 CSS to the Rescue?

CSS can define the appearance of any XML elements, not just ones from the HTML vocabulary. So it is entirely possible to take a newly invented XML language like our NRML and use CSS to display it on the Web. We do this by putting our CSS instructions into a separate file, and then referencing the CSS file from our XML file using an XML processing instruction. Let's say we create a file NRML.css with the following contents:

NurseryRhyme { display:block; margin:1cm; font-size: 14pt; font-family: "Bookman Old Style"} title { display:block; font-size:larger; font-weight:bold; margin-bottom:1em;} stanza { display:block; margin-bottom:1em;} verse { display:block; line-height: 1.3;} name { display:inline; font-variant:small-caps; }

The CSS display instruction tells the browser whether each element is to be displayed as a block (separated from surrounding elements, like a paragraph) or inline (part of the surrounding flow of text). We have defined margins plus a font size and style for the whole <NurseryRhyme> element, and added various spacing and typographic styling to the other elements.

With our CSS file done, we add the following code to the top of MaryLamb.xml:

<?xml-stylesheet href="NRML.css" type="text/css"?>

The result, viewed in a modern Web brower, looks like this:

MaryLamb.xml plus CSS, viewed in Firefox
Figure 4. MaryLamb.xml plus CSS, viewed in Firefox

So is this all we need to do to publish XML documents, whether in TEI or any other XML language, on the Web? The answer is again ‘yes and no’. Yes, this is an effective simple way of displaying formatted XML; but no, it is not a flexible enough solution to handle several important needs when publishing on the Web:

  • CSS instructions merely tell a Web browser how to display an XML file; CSS cannot add substantive content to the file. HTML files, for example, often include important metadata via <meta> tags in the <head>, which convey information about the file's character encoding, language, authorship, copyright status, and more. CSS cannot add <meta> tags.
  • The version of CSS supported by most Web browsers can only tell browser how to display XML elements in the order they are encountered. It cannot reorder, transform, or apply special logic to the underlying data.

For example, suppose we want to publish our XML version of Mary Had a Little Lamb with the following enhancements:

  1. each verse line is preceded by its line number in brackets
  2. every second line is indented
  3. the ID value of the document (<NurseryRhyme>/ID) is given in a footnote line following the verse
  4. An HTML <meta> tag is added to provide the keyword "nursery rhyme" to be picked up by Web indexers like Google

All this can be done, but not by the CSS language. Instead, we must use a more powerful general-purpose programming language that can operate on XML data. Fortunately, there are well-established tools for this purpose, and they are usually indispensable for any project with a body of TEI-encoded texts that they want to share on the Web. The following section looks at one of the most commonly used transformation tools.

2.3 Transforming XML

The XML language was formally proposed by the World Wide Web Consortium (W3C) in 1998. As the language was being developed, it was recognized that for XML to be useful there would have to be programming tools available to query, transform, and render XML in various ways, whether for Web publishing or for other purposes. Concurrently, therefore, a working group developed specifications for an extensible stylesheet language: a kind of programming language that could be applied to XML documents to extract data, transform on type of XML file into another, generate text output, or produce a print-ready document in an entirely different typesetting language such as LaTeX or PostScript. The result, formalized during 1999–2001, was XSL, the eXtensible Stylesheet Language. XSL is the umbrella term for three specific language that do the actual work of transformation:

XPath
The XML Path language, or XPath, provides a way to identify and retrieve specific parts of an XML document. For example, using our nursery rhyme example, the <verse> element reading ‘its fleece was white as snow’ can be extracted with the following XPath instruction: //stanza[1]/verse[2]. Or we can return every verse containing the word Mary with the XPath //verse[contains(., "Mary")].
XSLT
XSLT (which stands for ‘Extensible Stylesheet Language Transformations’, though the full name is rarely used) is the workhorse of XML programming languages. It is designed to take as input one or more XML documents, and to enable a wide variety of operations on them in order to transform their contents in virtually any way. XSLT is commonly used to transform content from one XML language to another—for example, from TEI to HTML—or to add or subtract features within a particular XML language. For example, we could use XSLT to add internal line numbers to our NRML verse elements:
<verse n="2">its fleece was white as snow;</verse>
XSLT can also transform XML to plain text. XSLT contains many of the structures of other general-purpose programming languages: variables, flow control, if-then logic, and (via XPath, which it incorporates) many functions that operate on strings and numeric values.
XSL-FO
XSL-FO stands for ‘XSL Formatting Objects’. It is a stylesheet language designed specifically to apply document formatting to XML files primarily for paged output, such as book publication. It is commonly used to convert an XML file into PDF that can be displayed or printed.

The most common application of XSL in the TEI world is probably the use of XSLT to transform TEI documents for display on the Web. In the remainder of this section, we will give a realistic example of how XSLT can be used with our nursery rhyme to achieve the exact Web publication format we want.

2.4 XSLT: A Practical Example

For reference, here is our nursery rhyme XML:

<NurseryRhyme ID="mary_lamb">
 <title>Mary Had a Little Lamb</title>
 <stanza>
  <verse>Mary had a little lamb,</verse>
  <verse>its fleece was white as snow;</verse>
  <verse>and everywhere that Mary went</verse>
  <verse>the lamb was sure to go.</verse>
 </stanza>
</NurseryRhyme>

We want to publish it on the Web with the formatting we used in the CSS output example above, but we also want to (1) number verse lines, (2) indent every other line, (3) show the document ID in a footnote line, and (4) add a <meta> tag with the keyword "nursery rhyme". To accomplish this, we are going to write a short XSLT program that takes MaryLamb.xml as input, and produces as output a single HTML file that contains all the content and style instructions needed. The result will look like this:

MaryLamb.xml as transformed by XSLT, viewed in Firefox
Figure 5. MaryLamb.xml as transformed by XSLT, viewed in Firefox

The XSLT used to produce our new HTML file follows. Viewing XSLT for the first time can be intimidating, as it is a relatively verbose language. Here are a few preliminary comments and things to notice in the code:

  • XSLT is itself written as an XML document. All of its basic instructions take the form of XML elements with the namespace prefix xsl. For example, where another programming language might create a variable called ‘lineNo’ with an assignment like let $lineNo := number(line), in XSLT you would create the variable with an XML element:
    <xsl:variable name="lineNo">
     <xsl:number/>
    </xsl:variable>
  • The root element of every XSLT program is an <xsl:stylesheet> element. It declares namespaces and sets other options. An <xsl:output> element can be used to specify output in XML, HTML, XHTML (as below), or text.
  • The first workhorse of XSLT is the template, expressed in a <xsl:template> element. For transforming XML, templates are typically created for all or most distinct elements in the input document. They provide the rules for transforming each kind of element into something else. The real strength of XSLT is that it applies templates recursively, meaning that it can descend into your XML document and transform all its nested elements without your having to write any special program logic.
  • The other workhorse of XSLT is the fact that you can create XML output simply by inserting an XML element in your program. If you look through the <xsl:template> sections below, you will see that in each case it is followed by a construct in HTML. For example, the first template matches the NRML root element <NurseryRhyme>, and it immediately contstructs an HTML document template using <html>, <head>, <body>, and so on.

Look over the script, and by comparing it with the input nursery rhyme and your knowledge of HTML elemements, try to get a general sense of what it is doing. You'll find the template for <verse> is the most complicated. After the script we'll present a bit of explication of what is going on in the templates.

<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"> <!-- NRML-to-HTML.xsl: transform Nursery Rhyme Markup Language into HTML --> <xsl:output method="xhtml"/> <xsl:template match="NurseryRhyme"> <html> <head> <title> <xsl:value-of select="title"/> </title> <style type="text/css"> body { margin:1cm; font-size: 14pt; font-family: "Bookman Old Style"} h1 { font-size:larger; font-weight:bold; margin-bottom:1em;} div.id { color: blue; font-family: sans-serif; font-size: 80%; margin-top: 2em;} div.stanza { margin-bottom:1em; line-height: 1.3} p.verse { margin: 0;} p.verseI { margin: 0; text-indent: 1.5em;} span.name { font-variant: small-caps; } </style> <meta name="keywords" content="nursery rhyme"/> </head> <body> <h1><xsl:apply-templates select="title"/></h1> <xsl:apply-templates select="stanza"/> <div class="id">[Document ID: <b><xsl:value-of select="@ID"/></b>]</div> </body> </html> </xsl:template> <xsl:template match="stanza"> <div class="stanza"> <xsl:apply-templates/> </div> </xsl:template> <xsl:template match="verse"> <xsl:variable name="lineNo"> <xsl:number level="any"/> </xsl:variable> <p> <xsl:attribute name="class"> <xsl:choose> <xsl:when test="$lineNo mod 2 eq 0">verseI</xsl:when> <xsl:otherwise>verse</xsl:otherwise> </xsl:choose> </xsl:attribute> <xsl:value-of select="concat('[', $lineNo, '] ')"/> <xsl:apply-templates/> </p> </xsl:template> <xsl:template match="name"> <span class="name"><xsl:apply-templates/></span> </xsl:template> </xsl:stylesheet> 
You should be able to see that for nearly every element in NRML, there is an <xsl:template> section in our XSLT program. For example, when the program encounters an NRML <name>, this is what it does:
<xsl:template match="name">
 <span class="name">
  <xsl:apply-templates/>
 </span>
</xsl:template>
It outputs an HTML <span> element with class set to "name" (so that CSS will format it appropriately). Then it applies templates using the <xsl:apply-templates> element. <xsl:apply-templates> is the heart of XSLT recursion. Essentially, it says this:
  • examine the contents of the current element (here, <name>)
  • if the element has children (child nodes) for which there are template rules elsewhere in the XSLT file, apply those rules
  • if we encounter children for which there are no template rules, apply the default template: output the textual content.
Thus when our XSLT script encounters
<name>Mary</name>
, it will output
<span class="name">Mary</span>
following the template rule for "name".

The template for <verse> provides an example of the power of XSLT. It sets a variable lineNo, which is simply the ordinal number of the current <verse> within the poem. It does this by calling on a built-in XSLT operation, the <xsl:number> element, which alows for highly flexible numbering. It then uses the $lineNo variable ('$' is the sign for ‘variable’) within a numeric function that tests whether the current verse is even or odd ($lineNo mod 2 eq 0) in order to assign the appropriate class attribute to the HTML <p> that controls whether or not the line is indented. $lineNo is also used to add the line number within [ ] preceding each verse.

XSLT is a powerful and complex language, and like any full-featured programming language it requires considerable learning time to reach productivity. But once learned, XSLT can be an extraordinarly productive tool for manipulating any text encoded as XML. For instance, it took the author of this section about half an hour to write the 50-line XSLT program above. Once written, it can transform any number of nursery rhymes encoded in NRML. Compare that against the labor it would take to convert, say, 1000 NRML files to HTML by hand!

Date: 2013-03-21