Text only | Skip links
Skip links||IT Services, University of Oxford

1. In which we are introduced

By the end of this course you will know:
  1. what we mean by text encoding (and more specifically, the Text Encoding Initiative)
  2. how to use a simple editor to mark up documents in TEI XML
  3. how to develop a TEI-conformant schema, tailored to specific project needs
  4. (quite) a lot about some key parts of the TEI Guidelines:
    • metadata and the TEI Header
    • names and named entities
    • linking and alignment
    • transcription of documents and audio
Additionally, we aim to provide
  • short workshops on a variety of TEI-related topics
  • consultation sessions
  • guest lectures

A splendid time is guaranteed for all.

2. Course Materials

  • All course materials including:
    • All slides from lectures (in TEI XML and PDF)
    • All exercises (in TEI XML, HTML, and PDF)
    • All materials for the exercises
    are available on the TEI @ Oxford website.
  • The url is: http://tei.oucs.ox.ac.uk/Oxford/2010-07-oxford/index.xml
  • All these materials are licensed with a Creative Commons Attribution license, which means they are freely available for re-use (though do let us know!)
  • And they're on the USB key in your bag

3. After the workshop...

  • After the workshop, if you have questions about:
    If you mail the TEI-L mailing list it is better because:
    • we'll still try to answer as well as we would privately
    • you get answers not only from us, but TEI experts around the world
    • questions from those of all levels of ability stop the list becoming too technical
    • everyone benefits from having the answers be public — and you benefit by reading (and sometimes answering!) others' problems

4. 1987 was a long time ago...

The Text Encoding Initiative was born into a very different world
  • the world wide web did not exist
  • the tunnel beneath the English Channel was still being built
  • a state called the Soviet Union had just launched a space station called Mir
  • serious computing was done on mainframes
  • mobile phones did not exist

5. ...but also a familiar one

  • Corpus linguistics and ‘artificial intelligence’ had created a demand for large scale lexical resources in academia and beyond
  • Advances in text processing were beginning to affect lexicography and document management systems (e.g. TeX, Scribe, tRoff..)
  • The Internet existed and theories about how to use it ‘hypertextually’ abounded
  • Books, articles, and even courses in something called "Computing in the Humanities" were becoming commonplace

6. Birth of the Text Encoding Initiative

  • Spring 1987: European workshops on standardisation of historical data (J.P. Genet, M. Thaller )
  • Autumn 1987: In the US, the NEH funds an exploratory international workshop on the feasibility of defining "text encoding guidelines"
Vassar College, Poughkeepsie
Figure 1. Vassar College, Poughkeepsie

7. Today's question:

  • So the TEI is very old!
  • It comes from a time before the Web, before the DVD, the mobile phone, cable tv, or Microsoft Word
  • Not much in computing survives 5 years, never mind 20
  • Why is it still here, and how has it survived?
  • What relevance can it possibly have today?

8. Is the TEI still relevant?

  • With XML everyone can create their own markup system and still share data!
  • In the Semantic Web, XML systems will all understand each other's data!
  • RDF can describe every kind of markup; SPARQL can search it!

Well .... maybe ....

9. Are these images of the same thing?

10. Are these images of the same thing?

11. A text is not a document

Where is the text?
  • in the shape of letters and their layout?
  • in the original from which this copy derives?
  • in the stories we read into it? or in its author's intentions?

A "document" is something that exists in the world, which we can digitize.

A "text" is an abstraction, created by or for a community of readers, which we can encode.

12. Encoding of texts

  • A text is more than a sequence of encoded glyphs or lexical tokens
    • It has a structure and a communicative function
    • It also has multiple possible readings
  • Encoding, or markup, is a way of making these things explicit

Only that which is explicit can be reliably processed

13. The virtuous circle of encoding

14. Some alphabet soup

SGML Standard Generalized Markup Language
HTML Hypertext Markup Language
W3C World Wide Web Consortium
XML eXtensible Markup Language
DTD Document Type Definition (or Declaration)
CSS Cascading Style Sheet
Xpath XML Path Language
XSLT eXtensible Stylesheet Language - Transformations
XQuery XML Querying
RELAXNG Regular Expression Language for XML (New Generation)

Oh, and then there's also TEI, the Text Encoding Initiative

15. XML in three slides (1)

An XML document may contain:-
  • elements, possibly bearing attributes
  • processing instructions
  • comments
  • entity references
  • marked sections (CDATA, IGNORE, INCLUDE)
An XML document must be well-formed and may be valid
<?xml version="1.0" ?>
<element attribute="value"> content </element>
<!-- comment -->

16. XML in three slides (2)

  • An XML document represents a (kind of) tree with a single root and many descendant nodes
  • A node can be
    • a subtree
    • a single element (possibly bearing some attributes)
    • a string of character data
    • empty
  • Each element has a name or generic identifier

17. XML in three slides (3)

  • An XML document is encoded as a linear string of Unicode characters
  • It begins with a special processing instruction
  • Each element occurrence within it is marked by a start- and an end-tags.
  • Attribute name/value pairs are supplied on the start-tag and may be given in any order
  • The characters < and & are Magic and must always be "escaped" if you want to use them as themselves
  • Comments are delimited by <!- - and - ->
  • CDATA sections are delimited by <![CDATA[ and ]]>
  • Entity references are delimited by & and ;

Note: a document can be valid in addition to being well-formed. This means that it obeys the rules of a specified schema.

18. Schemas and namespaces

  • Informally, a namespace is a way of identifying the provenance of a bunch of elements: a schema does the same, but it also specifies some rules about those elements should be used.
  • a schema allows you to
    • ensure that your documents use only predefined elements, attributes, and entities
    • enforce structural rules such as ‘every chapter must begin with a heading’ or ‘recipes must include an ingredient list’
  • a namespace is just a URI; a schema is a formal specification written in a formal language

19. Schema languages

  • XML DTD language
  • W3C Schema

All have different tool kits, different syntaxes, and different methods of doing things, particularly for content validation.

20. The TEI is not (just) a schema

The TEI architecture provides:
  • definitions and names for several hundred useful textual distinctions
  • a set of modules that can be used to generate schemas making those distinctions
  • usage rules (varying in their formality) for those elements
  • a customization mechanism for selecting, modifying, and combining these definitions
  • a very rich library of existing specialised components
  • an integrated suite of standard stylesheets for delivering schemas and documentation in various languages and formats

The TEI thus constitutes a simple consensus-based way of organizing and structuring textual (and others) resources.

21. Relevance of the TEI

Why would you want those things?
  • because we need to interchange resources
    • between people
    • (increasingly) between machines
  • because we need to integrate resources
    • of different media types
    • from different technical contexts
  • because we need to preserve resources
    • cryogenics is not the answer!
    • we need to preserve metadata as well as data

22. A framework like the TEI makes good business sense

  • re-usability and repurposing of resources
  • modular software development
  • lower training costs
  • ‘frequently answered questions’ — common technical solutions for different application areas

The TEI was designed to support multiple views of the same resource

23. The Imaginary Punch Project

  • Punch is a famous English humorous journal, published weekly between 1841 and 1992: see http://www.punch.co.uk/historyofpunch.html.
  • The IPP plans to make available fully marked up texts of the journal, in conjunction with page images...
    • for social historians
    • for librarians
    • for linguists
  • How will the TEI help? More specifically, which parts of the TEI will the project use?

24. What's in a text?

25. Looking at Punch, what do we need to mark up?

  • issue information and page number for reference purposes
  • "chunks" or divisions of text, which may contain a picture, a poem, some prose, some drama, or a combination
  • within the chunks, we can identify formal units such as
    • a picture, a caption
    • stanzas, lines
    • paragraphs
    • speeches and stage-directions
  • and more...

26. Macrostructure

All the issues of Punch for one year make up a volume. If consider the volume as a single <text>, we could treat each issue as a <div> within it. Or (better) we could use the <group> element:
<text xml:id="v147">
<!-- introductory materials for volume 147 here -->
  <text xml:id="I1914-07-01">
<!-- first issue (1 July) -->
  <text xml:id="I1914-07-15">
<!-- second issue (15 July) -->
<!-- etc... -->
<!-- volume index, appendix etc. -->

27. TEI tags for the high level structure

Treating each issue as a single <text> element each identifiable chunk within it can be a <div> element of a particular type (e.g. cartoon, verse, prose)

For example, page 1 has two divisions,
<pb n="1"/>
<div type="cartoon">
<!-- ... -->
<div type="poem">
<!-- ... -->
page 2 also has two, of different types:
<pb n="2"/>
<div type="prose">
 <head>The enchanted castle</head>
<!-- ... -->
<div type="snippet">
<!-- ... -->

28. Why divisions rather than pages?

Because a division can start on one page (page 5 for example) and finish on another (page 6)

We use an empty element <pb> to mark the boundary between pages, rather than enclosing each page in a <div type="page">.

<pb n="5"/>
<div type="cartoon">
<!-- ... -->
<div type="review">
 <head>Egypt in Venice</head>
<!-- ... -->
 <pb n="6"/>
<!-- ... -->
<div type="cartoon">
<!-- ... -->
<div type="verse">
<!-- ... -->
<div type="snippets">
<!-- ... -->

The sequence in which divisions appear is rather arbitrary (see page 6 for example)

29. Divisions can contain divisions...

<div type="snippets">
 <div type="snippet">
 <div type="snippet">
<!--Men for the Antarctic... Canadians-->
  • TEI also provides division elements with names that indicate their degree of nesting (<div1>, <div2> etc.) which some people prefer.
  • Divisions must always tessellate: once "down" a level, you cannot pop "up" again within the same division.
  • The heading or headings of a division are part of the division, not separate from it.

30. Global attributes

Some features (potentially) apply to everything:
  • identity
  • language
  • rendition
TEI provides global attributes for these:
  • xml:id provides a unique identifier for any element;
  • n provides a name or number for any element
  • xml:lang specifies the language of any element, using an ISO standard code
  • rend and rendition provide ways of specifying the visual appearance (rendition) of any element

31. What are divisions made of?

(apart from other smaller divisions)

  • <head> (heading)
  • <p> (paragraph)
  • <sp> (speech, contains any of the foregoing, also <stage> and <speaker>)
  • <list> (contains <head>, <label>, <item>)
  • <table>, (contains <row> containing <cell>) ...
  • <l> (verse line) optionally grouped into <lg> (line group) stanzas
  • <figure> (contains <graphic>, <figDesc>, <head>...)

32. Below the paragraph...

Within the elements already introduced, TEI offers plenty of scope for mark-up of smaller components. For example:
  • boundaries, such as page, column, or line breaks
  • highlighting, emphasis and quotation
  • editorial changes such as correction, normalization etc.
  • names, numbers, dates, addresses...
  • links and cross-references
  • notes, annotation, indexing
  • graphics
  • bibliographic citations
  • words and other analyses

33. A simple dialogue example ...

34. ... encoded

<div type="cartoon">
  <head>When the ships come home</head>
  <figDesc>A man in Turkish dress lounges on a sofa,
     smoking a cigarette and consulting a book
     labelled <title>Naval ledger</title>. Another man, in
     traditional Greek costume, stands beside him,
     also reading a notebook, labelled <title>Engagements</title>.</figDesc>

  <p> Isn't it time we started fighting again?</p>
  <p> Yes, I daresay. How soon could you begin?</p>
  <p> Oh, in a few weeks.</p>
  <p> No good for me. Shan't be ready till
     the autumn.</p>

35. An editorial intervention

Consider: ‘Excuse me sir, but would you like to buy a nice little dawg?’ on page 6.

We can:
  • use <orig> to show that "dawg" is what it says, even though this is a nonstandard spelling
  • use <reg> to show that "dog" is an editorially-supplied regularisation of what it says
  • or provide both within a <choice> element to say either is a valid encoding:
...a nice little

More (much more) of this kind of thing later...

36. Macrostructure

As well as the transcribed text, we have to combine metadata about each volume, and images of its pages. These are the three parts of a canonical TEI resource:
<!-- required; provides metadata -->
<!-- the document represented in image form -->
<!-- the text transcribed and marked up -->

37. What kinds of metadata?

For IPP and for any other comparable project, we will need a place for such information as
  • identification of the resource itself ("what is this thing?")
  • statements of responsibility ("who did what when?")
  • indication of source ("what was this derived from?")
  • publication statement ("how is this item distributed and by whom?")
  • declaration of encoding practice ("what do the codes we added mean?")

The TEI Header supports all these, and more...

38. Simple TEI Header for IPP

   <title>Punch, or the London Charivari, Vol. 147, July 1, 1914</title>
   <idno type="gutenberg">24357</idno>
    <p>This text is freely available for re-use
         under US and UK law, consult your local
         legal restrictions if elsewhere.</p>
   <p>This text is a TEI version of a Project Gutenberg
       text originally located at <ptr
       As per their license agreement we have removed all
       references to the PG trademark.</p>
  <change when="2008-07-26T23:49:55.968+01:00"/>

39. Macrostructure

If many such documents are grouped together to form a corpus (rather than a collection), it may be useful to factor out the metadata they have in common:
<!-- shared metadata -->
<!-- specific metadata -->
<!-- ... -->
<!-- specific metadata -->
<!-- ... -->

TEI@Oxford. Date: July 2010
Copyright University of Oxford