Text only | Skip links
Skip links||IT Services, University of Oxford

1. TEI Infrastructure

  • The TEI encoding scheme consists of a number of modules
  • These declare XML elements and their attributes
  • An element's declaration assigns it to one (or more) model classes
  • Another part declares its possible content and attributes with reference to these classes
  • This indirection allows strength and flexibility
  • It makes it easy to add/exclude new elements by referencing existing classes

2. What is a module?

  • A convenient way of grouping together a number of element declarations
  • These are usually on a related topic or specific application
  • Most chapters focus on elements drawn from a single module, which that chapter then defines
  • A TEI Schema is created by selecting modules and add/removing elements from them as needed

3. Modules

Module name Chapter
analysis Simple Analytic Mechanisms
certainty Certainty and Responsibility
core Elements Available in All TEI Documents
corpus Language Corpora
dictionaries Dictionaries
drama Performance Texts
figures Tables, Formulae, and Graphics
gaiji Representation of Non-standard Characters and Glyphs
header The TEI Header
iso-fs Feature Structures
linking Linking, Segmentation, and Alignment
msdescription Manuscript Description
namesdates Names, Dates, People, and Places
nets Graphs, Networks, and Trees
spoken Transcriptions of Speech
tagdocs Documentation Elements
tei The TEI Infrastructure
textcrit Critical Apparatus
textstructure Default Text Structure
transcr Representation of Primary Sources
verse Verse

4. The Imaginary Punch Project

  • Punch is a famous English humorous journal, published regularly between 1841 and 1992: see http://www.punch.co.uk/historyofpunch.html.
  • The IPP plans to make available fully marked up texts of the journal, in conjunction with page images...
    • for social historians
    • for librarians
    • for linguists
  • How will the TEI help? And which parts of the TEI will we use?

5. Punch example page 1

6. Punch example page 2

7. Looking at Punch, what do we need to mark up?

  • issue information and page number for reference purposes
  • "chunks" or divisions of text, which may contain a picture, a poem, some prose, some drama, or a combination
  • within the chunks, we can identify formal units such as
    • a picture, a caption
    • stanzas, lines
    • paragraphs
    • speeches and stage-directions
  • and more...

8. TEI tags for the high level structure

We will treat each issue as a single <text> element, and each identifiable chunk within it as a <div> element of a particular type (e.g. cartoon, verse, prose)

For example, page 1 has two divisions,
<pb n="1"/>
<div type="cartoon">
 <p>....</p>
</div>
<div type="poem">
 <head>Progress</head>
 <lg>
  <l>....</l>
 </lg>
</div>
page 2 also has two, of different types:
<pb n="2"/>
<div type="prose">
 <head>The enchanted castle</head>
 <p>....</p>
</div>
<div type="snippet">
 <head>Correspondence</head>
 <p>....</p>
</div>

9. Why divisions rather than pages?

Because a division can start on one page (page 5 for example) and finish on another (page 6)

We use an empty element <pb> to mark the boundary between pages, rather than enclosing each page in a <div type="page">.

<pb n="5"/>
<div type="cartoon">
 <p>...</p>
</div>
<div type="review">
 <head>Egypt in Venice</head>
 <p>...</p>
 <pb n="6"/>
 <p>...</p>
</div>
<div type="cartoon">
 <p>...</p>
</div>
<div type="poem">
 <head>Enigma</head>
 <lg>
  <l>...</l>
 </lg>
</div>
<div type="snippets">
 <p>...</p>
</div>

10. Divisions can contain divisions...

<div type="snippets">
 <div type="snippet">
  <p>Curiously....Chancellor</p>
 </div>
 <div type="snippet">
  <p>Men for the Antarctic... Canadians</p>
 </div>
</div>
  • TEI also provides division elements with names that indicate their degree of nesting (<div1>, <div2> etc.) which some people prefer
  • Divisions must always tessellate: once "down" a level, you cannot pop "up" again within the same division.

11. Floating text

As mentioned above, <div>s must tesselate over the entire text
<div1>
 <p> ... </p>
 <div2>
  <p> ... </p>
 </div2>
 <div2>
  <p> ... </p>
 </div2>
</div1>
is valid BUT
<div1>
 <p> ... </p>
 <div2>
  <p> ... </p>
 </div2>
 <p> ... </p>
</div1>
is not valid.

A special <floatingText> element is available for "interruptions"

12. What are divisions made of?

(apart from other smaller divisions)

  • <head> (heading)
  • <p> (paragraph)
  • <sp> (speech, contains any of the foregoing, also <stage> and <speaker>)
  • <list> (contains <head>, <label>, <item>)
  • <table>, (contains <row> containing <cell>) ...
  • <l> (verse line) optionally grouped into <lg> (line group) stanzas
  • <figure> (contains <graphic>, <figDesc>, <head>...)

13. For example....

Page 3 contains a figure and a dialogue...
<div type="cartoon">
 <figure>
  <head>When the ships come home</head>
  <figDesc>A man in Turkish dress lounges on a sofa,
     smoking a cigarette and consulting a book labelled
     "Naval ledger". Another man, in traditional Greek
     costume, stands beside him, also reading a
     notebook.</figDesc>
  <graphic url="Punch/XML/Graphics/003.png"/>
 </figure>
 <sp>
  <speaker>Greece.</speaker>
  <p> Isn't it time we started fighting again?</p>
 </sp>
 <sp>
  <speaker>Turkey.</speaker>
  <p> Yes, I daresay. How soon could you begin?</p>
 </sp>
 <sp>
  <speaker>Greece.</speaker>
  <p> Oh, in a few weeks.</p>
 </sp>
 <sp>
  <speaker>Turkey.</speaker>
  <p> No good for me. Shan't be ready till the autumn.</p>
 </sp>
</div>

14. Punch example page

15. For example...

The militants' tariff (on Page 15) contains headings, paragraphs, and a table...
<div type="prose">
 <head>THE MILITANTS' TARIFF.</head>
 <head rend="right">Etna Lodge, W.</head>
 <p>Mrs. Bangham Smasher, having entered into partnership with the
   Misses Burnham Blazer, as General Agents of Destruction, begs to
   inform the public that the firm will be prepared to execute
   commissions of all kinds, at the shortest notice, on the very moderate
   terms given below : -- </p>
 <table>
  <row role="label">
   <cell/>
   <cell>£</cell>
   <cell>s.</cell>
   <cell>d.</cell>
  </row>
  <row>
   <cell>For breaking windows, per window ...</cell>
   <cell>0</cell>
   <cell>7</cell>
   <cell>6</cell>
  </row>
  <row>
   <cell>For howling, kicking, or biting during service in church,
       per howl, kick, or bite ...</cell>
   <cell>0</cell>
   <cell>10</cell>
   <cell>6</cell>
  </row>
<!-- ... -->
 </table>
</div>

16. Global attributes

Some features (potentially) apply to everything:
  • identity
  • language
  • rendition
TEI provides global attributes for these:
  • xml:id provides a unique identifier for any element;
  • n provides a name or number for any element
  • xml:lang specifies the language of any element, using an ISO standard code
  • rend and rendition provide ways of specifying the visual appearance (rendition) of any element

17. For example...

Egypt in Venice (on Page 5) begins with two headings, one in French....
<div type="prosexml:lang="enxml:id="I1914-07-01_05_02">
 <head>Egypt in Venice.</head>
 <head xml:lang="frrend="it">"La Légende de Joseph."</head>
 <p>Those who know the kind of attractions that the Russian ballet
   offers in so many of its themes ....</p>
</div>
Each stanza of the poem on page 10 has a last line which is significantly indented:
<lg>
 <l>There were eight pretty walkers who went up a hill;</l>
 <l>They were Jessamine, Joseph and Japhet and Jill,</l>
 <l>And Allie and Sally and Tumbledown Bill,</l>
 <l rend="indent">And Farnaby Fullerton Rigby.</l>
</lg>

18. Punch example page 3

19. Macrostructure 1

All the issues of Punch for one year make up a volume. We could regard the volume as a single <text>, and each issue as a <div> within it. Or we could use the <group> element:
<text xml:id="v147">
 <front>
<!-- introductory materials for volume 147 here -->
 </front>
 <group>
  <text xml:id="I1914-07-01">
   <body>
<!-- first issue (1 July) -->
   </body>
  </text>
  <text xml:id="I1914-07-15">
   <body>
<!-- second issue (15 July) -->
   </body>
  </text>
<!-- etc... -->
 </group>
 <back>
<!-- volume index, appendix etc. -->
 </back>
</text>

20. Macrostructure 2

As well as the texts, we have detailed metadata about each volume, and images of its pages. These are the three parts of a canonical TEI document:
<TEI>
 <teiHeader>
<!-- required; provides metadata -->
 </teiHeader>
 <facsimile>
<!-- the text, represented in image form -->
 </facsimile>
 <text>
<!-- the text, transcribed and marked up -->
 </text>
</TEI>

21. Macrostructure 3

If many such documents are grouped together to form a corpus (rather than a collection), it may be useful to factor out the metadata they have in common:
<teiCorpus>
 <teiHeader>
<!-- shared metadata -->
 </teiHeader>
 <TEI>
  <teiHeader>
<!-- specific metadata -->
  </teiHeader>
  <text>
<!-- ... -->
  </text>
 </TEI>
 <TEI>
  <teiHeader>
<!-- specific metadata -->
  </teiHeader>
  <text>
<!-- ... -->
  </text>
 </TEI>
</teiCorpus>

22. What kinds of metadata?

For IPP and for any other comparable project, we will need a place for such information as
  • identification of the resource itself ("what is this thing?")
  • statements of responsibility ("who did what when?")
  • indication of source ("what was this derived from?")
  • publication statement ("how is this item distributed and by whom?")
  • declaration of encoding practice ("what do the codes we added mean?")

The TEI Header supports all these, and more.

23. Below the paragraph...

Within the elements already introduced, TEI offers plenty of scope for mark-up of smaller components. For example:
  • boundaries, such as page, column, or line breaks
  • highlighting, emphasis and quotation
  • editorial changes such as correction, normalization etc.
  • names, numbers, dates, addresses...
  • links and cross-references
  • notes, annotation, indexing
  • graphics
  • bibliographic citations
  • words and other analyses

24. Highlighting

By highlighting we mean any combination of typographic features (font, size, hue, etc.) which distinguishes the highlighted text from its surroundings. This may be for many reasons...
  • to mark foreign, archaic, technical usages
  • for emphasis when spoken
  • to show something is not part of the text.. (e.g. cross references, titles, headings)
  • or is attributed to some other agency inside or outside the text (e.g. direct speech, quotation)

TEI provides both a generic <hi> tag and a large number of specific ones...

25. A few highlighting examples

  • <hi> (highlighted: reason unknown or unimportant)
    <p>[The rest of this communication is omitted owing to considerations of space.—<hi rend="sc">Ed</hi>.]</p>
  • <emph> (emphasized)
    <said>'E won't bite yer <emph>if you buy 'im</emph> guv'ner.</said>
  • <title> and <foreign>:
    <p>
     <foreign xml:lang="fr">À propos</foreign> of Oxford, it is a question
    whether that extremely amusing book <title>Verdant Green</title> is
    still much read by freshers.

    </p>
  • <distinct> (linguistically marked)
    But then I remind myself that the Russian
    ballet is nothing if not <distinct>bizarre</distinct>

26. Quotation

Quotation marks can similarly be used to set off text for many reasons:
  • <q> (used if the reason is unknown or unimportant)
  • <said> (speech or thought)
  • <quote> (attributed to an external source)
  • <mentioned> and <soCalled> (nuances of narrative status)
<p>
 <said who="#Celia">I know a lovely tin of potted grouse,</said> said Celia,
and she went off to cut some sandwiches.
</p>
<head>How to utilise the art of <soCalled>suggestion</soCalled>
</head>
<head>The Doctor, six down at the turn, <soCalled>suggests</soCalled> to his
opponent that they are playing croquet, and wins by two and one.</head>
Note that these elements can nest within one another:
<p>The poet returned to his work. <said>
  <quote>In tooth and claw,</quote>
 </said> he muttered to himself, <said>
  <quote>In tooth and claw.</quote>
 </said>
</p>

27. Editorial intervention

As a simple example, consider: ‘Excuse me sir, but would you like to buy a nice little dawg?’ on page 6.

We can:
  • use <orig> to show that "dawg" is what it says, even though this is a nonstandard spelling
  • use <reg> to show that "dog" is an editorially-supplied regularisation of what it says
  • or provide both within a <choice> element to say either is a valid encoding:
...a nice little <choice>
 <orig>dawg</orig>
 <reg>dog</reg>
</choice>?

28. Names of persons, places, things...

  • <name> (a name in the text, contains a proper noun or noun phrase)
  • <rs> (a general-purpose name or referencing string )
  • <title> (any form of title)

The type attribute is useful for categorizing these.

<p>The scene opens at a party given by <name type="person">Potiphar</name> in <name type="place">Venice</name>. </p>

They both also have key, ref, and nymRef attributes for linking to references.

29. Dates

  • <date> contains a date and time in any format
  • For processing it is convenient to add a normalized version, using the when attribute
  • Uncertain dates and times, and ranges, can be indicated by other attributes: notBefore, notAfter, from, and to
<p>House of Commons, <date when="1914-06-22"> Monday, June 22, 1914</date>.</p>
<p>
 <date notAfter="1914-06-01notBefore="1914-03-01">Sunday,
   a month ago,</date> was hot.
</p>

30. Cross references

  • A cross reference is a link from one point in a text (the source) to another (the target).
  • TEI provides generic elements <ptr> and <ref> for this purpose. If the linking text can be automatically generated use <ptr>; otherwise use <ref>.
  • The source is the location of the <ptr> or <ref>; the target is specified by the target attribute, in the form of a URI reference.
See <ref target="#Section12">section 12 on page 34</ref>.
See <ptr target="#Section12"/>.

31. Bibliographic Citations

TEI provides special elements for bibliographic citations or references:
  • <bibl> (loosely structured)
  • <biblStruct> (standard bibliographic structure)
  • <listBibl> (encloses a bibliography)

These are typically used in preparing bibliographies, or in footnotes. But even in Punch, there are examples.

32. Simple <bibl> Example

In Punch, bibliographic citations are usually associated with a a quotation from another paper:

The<cit> element groups the two:
<cit>
 <quote>It was the time when Henry III. was batting with Simon de Montfort
   and his Barons.</quote>
 <bibl>
  <title>Straits Times.</title>
 </bibl>
</cit>

33. Embedded notes

Notes, whether appearing in the original source, or added by an editor, can be marked using the <note> element.

We might use this to add biographical details to the Punch transcriptions:
<p>By-the-by, it is denied that Sir <name rend="sc">Joseph Beecham</name>
 <note>Sir Joseph Beecham, 1st Baronet (8 June 1848 - 23 October 1916)...</note>. was
in any way responsible for the Government's "Pills for Earthquakes,"
by which it was hoped to avert the Irish crisis.</p>

<note> has attributes place and resp to indicate where it is located on the page, and who is responsible for adding it.

34. Linked notes

Since we have several references to the same person, it might be better to put the notes elsewhere and point to them from the names:
<div type="notes">
 <note xml:id="BEECHJO">Sir Joseph Beecham, 1st Baronet (8 June 1848 -
   23 October 1916) the eldest son of Thomas Beecham (1820-1907) played a
   large part in the growth and expansion of his father's medicinal pill
   business which he joined in 1866....</note>
<!-- other notes -->
</div>
<div type="snippets">
 <p>... Both Earl <name rend="sc">Beauchamp</name> and <name>Sir <ref target="#BEECHJO">Joseph Beecham</ref>
  </name> appear in the recent
   Honours List.</p>
 <p>By-the-by, it is denied that Sir <name rend="scref="#BEECHJO">Joseph Beecham</name> was in any way responsible...</p>
</div>

There is also a specialised <person> element we can use for this.

35. TEI Header Structure

The TEI header has four main components:
  • <fileDesc> (file description) contains a full bibliographic description of an electronic file.
  • <encodingDesc> (encoding description) documents the relationship between an electronic text and the source or sources from which it was derived.
  • <revisionDesc> (revision description) summarizes the revision history for a file.
  • <profileDesc> (text-profile description) provides a detailed description of non-bibliographic aspects of a text, specifically the languages and sublanguages used, the situation in which it was produced, the participants and their setting. (just about everything not covered in the other header elements

Only <fileDesc> is required; the others are optional.

36. Simple TEI Header for IPP

<teiHeader>
 <fileDesc>
  <titleStmt>
   <title>Punch, or the London Charivari, Vol. 147, July 1, 1914</title>
  </titleStmt>
  <publicationStmt>
   <idno type="gutenberg">24357</idno>
   <availability>
    <p>This text is freely available for re-use under US and UK law,
         consult your local legal restrictions if elsewhere.</p>
   </availability>
  </publicationStmt>
  <sourceDesc>
   <p>This text is a TEI version of a Project Gutenberg text originally
       located at <ptr
      target="http://www.gutenberg.org/dirs/2/4/3/5/24357/"/>
. As per their
       license agreement we have removed all references to the PG
       trademark.</p>
  </sourceDesc>
 </fileDesc>
 <revisionDesc>
  <change when="2008-07-26T23:49:55.968+01:00"/>
 </revisionDesc>
</teiHeader>


Lou Burnard and other TEI@Oxford authors. Date: February 2009
Copyright University of Oxford