Text only | Skip links
Skip links||IT Services, University of Oxford

1. An Introduction to the TEI

The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. Its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts chiefly in the humanities, social sciences and linguistics.

1.1. Why the TEI?

The TEI provides
  • a language-independent framework for defining markup languages
  • a very simple consensus-based way of organizing and structuring textual (and other) resources...
  • ... which can be enriched and personalized in highly idiosyncratic or specialised ways
  • a very rich library of existing specialised components
  • an integrated suite of standard stylesheets for delivering schemas and documentation in various languages and formats
  • a large and active open source style user community

1.2. Relevance

Why would you want those things?
  • because we need to interchange resources
    • between people
    • (increasingly) between machines
  • because we need to integrate resources
    • of different media types
    • from different technical contexts
  • because we need to preserve resources
    • cryogenics is not the answer!
    • we need to preserve metadata as well as data

1.3. Reasons for attempting to define a common framework

  • re-usability and repurposing of resources
  • modular software development
  • lower training costs
  • ‘frequently answered questions’ — common technical solutions for different application areas

The TEI was designed to support multiple views of the same resource

1.4. Conformance issues

A document is TEI Conformant if and only if it:
  • is a well-formed XML document
  • can be validated against a TEI Schema, that is, a schema derived from the TEI Guidelines
  • conforms to the TEI Abstract Model
  • uses the TEI Namespace (and other namespaces where relevant) correctly
  • is documented by means of a TEI Conformant ODD file which refers to the TEI Guidelines
or if it can be transformed automatically using some TEI-defined procedures into such a document (it is then considered TEI-conformable).

Standardization should not mean ‘Do what I do’, but rather ‘Explain what you do in terms I can understand’

2. Default Text Structure

All TEI documents are structured in a particular manner. This section attempts to describe the different variations on this as briefly as possible.

2.1. Structure of a TEI Document

There are two basic structures of a TEI Document:
  • <TEI> (TEI document) contains a single TEI-conformant document, comprising a TEI header and a text, either in isolation or as part of a teiCorpus element.
  • <teiCorpus> contains the whole of a TEI encoded corpus, comprising a single corpus header and one or more TEI elements, each containing a single text header and a text.

2.2. TEI basic structures: <TEI> content

<TEI>
 <teiHeader>
<!-- required -->
 </teiHeader>
 <facsimile>
<!-- optional, new in TEI P5 -->
 </facsimile>
 <text>
<!-- required if no facsimile -->
 </text>
</TEI>

2.3. TEI basic structures: <teiCorpus>

<teiCorpus>
 <teiHeader>
<!-- required -->
 </teiHeader>
 <TEI>
<!-- required -->
 </TEI>
</teiCorpus>

2.4. <text>

What is a text?
  • A text may be unitary or composite
    • unitary: forming an organic whole
    • composite: consisting of several components which are in some important sense independent of each other
  • a unitary text contains
    • optional front matter
    • <body> (required)
    • optional back matter

2.5. Composite texts

A composite text contains
  • optional front matter
  • <group> (required)
  • optional back matter

A corpus is a collection of text and header pairs. It has its own header.

<group> tags may self-nest.

2.6. TEI text structure (1)

<text>
 <front>
<!-- optional -->
 </front>
 <body>
<!-- required -->
 </body>
 <back>
<!-- optional -->
 </back>
</text>

2.7. TEI text structure (2)

<text>
 <front>
<!-- ... -->
 </front>
 <group>
  <text>
   <body>
    <p>...</p>
   </body>
  </text>
 </group>
 <back>
<!-- ... -->
 </back>
</text>

2.8. Another Grouped Text Example

<TEI>
 <teiHeader>
<!-- header information for the whole collection -->
 </teiHeader>
 <text>
<!-- optional front matter -->
  <group>
   <text>
<!-- optional front matter -->
    <body>
<!-- First Body -->
    </body>
   </text>
   <text>
<!-- optional front matter -->
    <body>
<!-- Second Body-->
    </body>
   </text>
  </group>
 </text>
</TEI>

2.9. The Imaginary Punch Project (IPP)

  • In some of my talks I'm going to use the journal Punch as an example. It is a famous English humorous journal, published regularly between 1841 and 1992: see http://www.punch.co.uk/historyofpunch.html.
  • The IPP plans to make available fully marked up texts of the journal, in conjunction with page images...
    • for social historians
    • for librarians
    • for linguists
  • How will the TEI help? And which parts of the TEI will we use?

2.10. Punch example page 1

2.11. Punch example page 2

2.12. Punch example page 3

2.13. Looking at Punch, what do we need to mark up?

  • issue information and page number for reference purposes
  • "chunks" or divisions of text, which may contain a picture, a poem, some prose, some drama, or a combination
  • within the chunks, we can identify formal units such as
    • a picture, a caption
    • stanzas, lines
    • paragraphs
    • speeches and stage-directions
  • and more...

2.14. Macrostructure

All the issues of Punch for one year make up a volume. We could regard the volume as a single <text>, and each issue as a <div> within it. Or we could use the <group> element:
<text xml:id="v147">
 <front>
<!-- introductory materials for volume 147 here -->
 </front>
 <group>
  <text xml:id="I1914-07-01">
   <body>
<!-- first issue (1 July) -->
   </body>
  </text>
  <text xml:id="I1914-07-15">
   <body>
<!-- second issue (15 July) -->
   </body>
  </text>
<!-- etc... -->
 </group>
 <back>
<!-- volume index, appendix etc. -->
 </back>
</text>

2.15. TEI tags for the high level structure

We will treat each issue as a single <text> element, and each identifiable chunk within it as a <div> element of a particular type (e.g. cartoon, verse, prose)

For example, page 1 has two divisions,
<pb n="1"/>
<div type="cartoon">
 <p>....</p>
</div>
<div type="poem">
 <head>Progress</head>
 <lg>
  <l>....</l>
 </lg>
</div>

2.16. More high level structure

page 2 also has two, of different types:
<pb n="2"/>
<div type="prose">
 <head>The enchanted castle</head>
 <p>....</p>
</div>
<div type="snippet">
 <head>Correspondence</head>
 <p>....</p>
</div>

2.17. Why divisions rather than pages?

Because a division can start on one page (page 5 for example) and finish on another (page 6)

We use an empty element <pb> to mark the boundary between pages, rather than enclosing each page in a <div type="page">.

<pb n="5"/>
<div type="cartoon">
 <p>...</p>
</div>
<div type="review">
 <head>Egypt in Venice</head>
 <p>...</p>
 <pb n="6"/>
 <p>...</p>
</div>
<div type="cartoon">
 <p>...</p>
</div>

2.18. Divisions can contain divisions...

<div type="snippets">
 <div type="snippet">
  <p>Curiously....Chancellor</p>
 </div>
 <div type="snippet">
  <p>Men for the Antarctic... Canadians</p>
 </div>
</div>
  • TEI also provides division elements with names that indicate their degree of nesting (<div1>, <div2> etc.) which some people prefer
  • Divisions must always tessellate: once "down" a level, you cannot pop "up" again within the same division.

2.19. Divisions may have heads and trailers

<div>
 <head>Chapter 1</head>
 <p>
<!-- content of the div -->
 </p>
 <trailer>...</trailer>
</div>

2.20. Floating Text (1)

<div>s must tesselate over the entire text
<div1>
 <div2>
<!-- content -->
 </div2>
 <div2>
<!-- content -->
 </div2>
</div1>
is valid, while
<div1>
<!-- content -->
 <div2>
<!-- content -->
 </div2>
<!-- content -->
</div1>
is not valid.

2.21. Floating Text Example

<p>She was thus ruminating, when a Gentleman enter'd the Room, the Door being a jar... calling for a Candle, she beg'd a thousand Pardons, engaged him to sit down, and let her know, what had so long conceal'd him from her Correspondence. </p>
<pb n="5"/>
<floatingText>
 <body>
  <head>The Story of <hi>Captain Manly</hi>
  </head>
  <p>
<!-- Captain Manly's store here -->
  </p>
 </body>
</floatingText>
<pb n="37"/>
<p>The Gentleman having finish'd his Story ...
<!-- more -->
</p>

2.22. Virtual divisions

Where the whole of a division can be automatically generated, for example because it is derived from another part of this or another document, an encoder may prefer not to represent it explicitly but instead simply mark its location by means of a processing instruction, or by using the special purpose <divGen> element:
<front>
 <divGen type="toc"/>
 <div>
  <head>Preface</head>
  <p>...</p>
 </div>
</front>
(intended primarily for use in document production or manipulation, rather than in transcription of pre-existing material)

2.23. Back to Punch

Page 3 contains a figure and a dialogue...
<div type="cartoon">
 <figure>
  <head>When the ships come home</head>
  <figDesc>A man in Turkish dress lounges on a sofa,
     smoking a cigarette and consulting a book labelled
     "Naval ledger". Another man, in traditional Greek
     costume, stands beside him, also reading a
     notebook.</figDesc>
  <graphic
    url="materials/Punch/Pages/XML/Graphics/003.png"/>

 </figure>
 <sp>
  <speaker>Greece.</speaker>
  <p> Isn't it time we started fighting again?</p>
 </sp>
 <sp>
  <speaker>Turkey.</speaker>
  <p> Yes, I daresay. How soon could you begin?</p>
 </sp>
 <sp>
  <speaker>Greece.</speaker>
  <p> Oh, in a few weeks.</p>
 </sp>
 <sp>
  <speaker>Turkey.</speaker>
  <p> No good for me. Shan't be ready till the autumn.</p>
 </sp>
</div>

2.24. Punch example page 3

2.25. For example...

The militants' tariff (on Page 15) contains headings, paragraphs, and a table...
<div type="prose">
 <head>THE MILITANTS' TARIFF.</head>
 <head rend="right">Etna Lodge, W.</head>
 <p>Mrs. Bangham Smasher, having entered into partnership with the
   Misses Burnham Blazer, as General Agents of Destruction, begs to
   inform the public that the firm will be prepared to execute
   commissions of all kinds, at the shortest notice, on the very moderate
   terms given below : -- </p>
 <table>
  <row role="label">
   <cell/>
   <cell>£</cell>
   <cell>s.</cell>
   <cell>d.</cell>
  </row>
  <row>
   <cell>For breaking windows, per window ...</cell>
   <cell>0</cell>
   <cell>7</cell>
   <cell>6</cell>
  </row>
  <row>
   <cell>For howling, kicking, or biting during service in church,
       per howl, kick, or bite ...</cell>
   <cell>0</cell>
   <cell>10</cell>
   <cell>6</cell>
  </row>
<!-- ... -->
 </table>
</div>

2.26. Punch example page 15

2.27. Special Features...

Egypt in Venice (on Page 5) begins with two headings, one in French....
<div type="prosexml:lang="enxml:id="I1914-07-01_05_02">
 <head>Egypt in Venice.</head>
 <head xml:lang="frrend="it">"La Légende de Joseph."</head>
 <p>Those who know the kind of attractions that the Russian ballet
   offers in so many of its themes ....</p>
</div>
Each stanza of the poem on page 10 has a last line which is significantly indented:
<lg>
 <l>There were eight pretty walkers who went up a hill;</l>
 <l>They were Jessamine, Joseph and Japhet and Jill,</l>
 <l>And Allie and Sally and Tumbledown Bill,</l>
 <l rend="indent">And Farnaby Fullerton Rigby.</l>
</lg>

2.28. Punch example page 5

2.29. Punch example page 10



James Cummings. Date: July 2009
Copyright University of Oxford