Text only | Skip links
Skip links||IT Services, University of Oxford

1. An Introduction to the TEI

The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. Its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts chiefly in the humanities, social sciences and linguistics.

1.1. Why the TEI?

The TEI provides
  • a language-independent framework for defining markup languages
  • a very simple consensus-based way of organizing and structuring textual (and other) resources...
  • ... which can be enriched and personalized in highly idiosyncratic or specialised ways
  • a very rich library of existing specialised components
  • an integrated suite of standard stylesheets for delivering schemas and documentation in various languages and formats
  • a large and active open source style user community

1.2. Relevance

Why would you want those things?
  • because we need to interchange resources
    • between people
    • (increasingly) between machines
  • because we need to integrate resources
    • of different media types
    • from different technical contexts
  • because we need to preserve resources
    • cryogenics is not the answer!
    • we need to preserve metadata as well as data

1.3. The virtuous circle of encoding

1.4. The scope of intelligent markup

Even within the original scope of the TEI we have
  • basic structural and functional components
  • diplomatic transcription, images, annotation
  • links, correspondence, alignment
  • data-like objects such as dates, times, places, persons, events (named entity recognition)
  • meta-textual annotations (correction, deletion, etc)
  • linguistic analysis at all levels
  • contextual metadata of all kinds
  • ... and so on and so forth

Is it possible to delimit encyclopaedically all possible kinds of markup?

1.5. Reasons for attempting to define a common framework

  • re-usability and repurposing of resources
  • modular software development
  • lower training costs
  • ‘frequently answered questions’ — common technical solutions for different application areas

The TEI was designed to support multiple views of the same resource

1.6. Conformance issues

A document is TEI Conformant if and only if it:
  • is a well-formed XML document
  • can be validated against a TEI Schema, that is, a schema derived from the TEI Guidelines
  • conforms to the TEI Abstract Model
  • uses the TEI Namespace (and other namespaces where relevant) correctly
  • is documented by means of a TEI Conformant ODD file which refers to the TEI Guidelines
or if it can be transformed automatically using some TEI-defined procedures into such a document (it is then considered TEI-conformable).

Standardization should not mean ‘Do what I do’, but rather ‘Explain what you do in terms I can understand’

2. Default Text Structure

All TEI documents are structured in a particular manner. This section attempts to describe the different variations on this as briefly as possible.

2.1. Structure of a TEI Document

There are two basic structures of a TEI Document:
  • <TEI> (TEI document) contains a single TEI-conformant document, comprising a TEI header and a text, either in isolation or as part of a teiCorpus element.
  • <teiCorpus> contains the whole of a TEI encoded corpus, comprising a single corpus header and one or more TEI elements, each containing a single text header and a text.

2.2. TEI basic structures (1)

<!-- required -->
<!-- required -->

2.3. TEI basic structures (2)

<!-- required -->
<!-- optional, new in TEI P5 -->
<!-- required if no facsimile -->

2.4. <text>

What is a text?
  • A text may be unitary or composite
    • unitary: forming an organic whole
    • composite: consisting of several components which are in some important sense independent of each other
  • a unitary text contains
    • optional front matter
    • <body> (required)
    • optional back matter

2.5. Composite texts

A composite text contains
  • optional front matter
  • <group> (required)
  • optional back matter

A corpus is a collection of text and header pairs. It has its own header.

<group> tags may self-nest.

2.6. TEI text structure (1)

<!-- optional -->
<!-- required -->
<!-- optional -->

2.7. TEI text structure (2)

<!-- ... -->
<!-- ... -->

2.8. Another Grouped Text Example

<!-- header information for the whole collection -->
<!-- optional front matter -->
<!-- optional front matter -->
<!-- First Body -->
<!-- optional front matter -->
<!-- Second Body-->

2.9. The Imaginary Punch Project

  • Punch is a famous English humorous journal, published regularly between 1841 and 1992: see http://www.punch.co.uk/historyofpunch.html.
  • The IPP plans to make available fully marked up texts of the journal, in conjunction with page images...
    • for social historians
    • for librarians
    • for linguists
  • How will the TEI help? And which parts of the TEI will we use?
  • Although we won't have time to accomplish this in this workshop, we've provided the Punch text and images in your materials directory.

2.10. Punch example page 1

2.11. Punch example page 2

2.12. Punch example page 3

2.13. Looking at Punch, what do we need to mark up?

  • issue information and page number for reference purposes
  • "chunks" or divisions of text, which may contain a picture, a poem, some prose, some drama, or a combination
  • within the chunks, we can identify formal units such as
    • a picture, a caption
    • stanzas, lines
    • paragraphs
    • speeches and stage-directions
  • and more...

2.14. Macrostructure

All the issues of Punch for one year make up a volume. We could regard the volume as a single <text>, and each issue as a <div> within it. Or we could use the <group> element:
<text xml:id="v147">
<!-- introductory materials for volume 147 here -->
  <text xml:id="I1914-07-01">
<!-- first issue (1 July) -->
  <text xml:id="I1914-07-15">
<!-- second issue (15 July) -->
<!-- etc... -->
<!-- volume index, appendix etc. -->

2.15. TEI tags for the high level structure

We will treat each issue as a single <text> element, and each identifiable chunk within it as a <div> element of a particular type (e.g. cartoon, verse, prose)

For example, page 1 has two divisions,
<pb n="1"/>
<div type="cartoon">
<div type="poem">

2.16. More high level structure

page 2 also has two, of different types:
<pb n="2"/>
<div type="prose">
 <head>The enchanted castle</head>
<div type="snippet">

2.17. Why divisions rather than pages?

Because a division can start on one page (page 5 for example) and finish on another (page 6)

We use an empty element <pb> to mark the boundary between pages, rather than enclosing each page in a <div type="page">.

<pb n="5"/>
<div type="cartoon">
<div type="review">
 <head>Egypt in Venice</head>
 <pb n="6"/>
<div type="cartoon">

2.18. Divisions can contain divisions...

<div type="snippets">
 <div type="snippet">
 <div type="snippet">
  <p>Men for the Antarctic... Canadians</p>
  • TEI also provides division elements with names that indicate their degree of nesting (<div1>, <div2> etc.) which some people prefer
  • Divisions must always tessellate: once "down" a level, you cannot pop "up" again within the same division.

2.19. More about divisions

  • generic, hierarchic subdivisions, each incomplete
  • the type attribute is used to label a particular level e.g. as 'part' or 'chapter'
  • the n attribute gives a particular division a name or number
  • the xml:id attribute gives a particular division a unique identifier

2.20. Divisions may have heads and trailers

 <head>Chapter 1</head>
<!-- content of the div -->

2.21. Numbered and unnumbered divisions

The level can be made explicit by using 'numbered' divs (div1, div2). Opinions vary:

<div1> vs. <div n="1">
  • numbered: the number indicates the depth of this particular division within the hierarchy, the largest such division being ‘div1’, any subdivision within it being ‘div2’, etc.
  • unnumbered: nest recursively to indicate their hierarchic depth. (And computers can count very well!)
The two styles must not be combined within a single <front>, <body>, or <back> element.

N.B. Divisions always tessellate

2.22. Groups vs Floating Texts

The <group> element should be used to represent a collection of independent texts which is to be regarded as a single unit for processing or other purposes.

<floatingText> contains a single text of any kind, whether unitary or composite, which interrupts the text containing it at any point and after which the surrounding text resumes.

2.23. Floating Text (1)

<div>s must tesselate over the entire text
<!-- content -->
<!-- content -->
is valid, while
<!-- content -->
<!-- content -->
<!-- content -->
is not valid.

2.24. Floating Text (2)

In the second case, div2 is a 'floating' text and its content must be encoded using the <floatingText> element.

The <floatingText> element is a member of the model.divPart class, and can thus appear within any division level element in the same way as a paragraph.

2.25. Floating Text Example

<p>She was thus ruminating, when a Gentleman enter'd the Room, the Door being a jar... calling for a Candle, she beg'd a thousand Pardons, engaged him to sit down, and let her know, what had so long conceal'd him from her Correspondence. </p>
<pb n="5"/>
  <head>The Story of <hi>Captain Manly</hi>
<!-- Captain Manly's store here -->
<pb n="37"/>
<p>The Gentleman having finish'd his Story ...
<!-- more -->

2.26. Virtual divisions

Where the whole of a division can be automatically generated, for example because it is derived from another part of this or another document, an encoder may prefer not to represent it explicitly but instead simply mark its location by means of a processing instruction, or by using the special purpose <divGen> element:
 <divGen type="toc"/>
(intended primarily for use in document production or manipulation, rather than in transcription of pre-existing material)

3. Back to Punch

Page 3 contains a figure and a dialogue...
<div type="cartoon">
  <head>When the ships come home</head>
  <figDesc>A man in Turkish dress lounges on a sofa,
     smoking a cigarette and consulting a book labelled
     "Naval ledger". Another man, in traditional Greek
     costume, stands beside him, also reading a
  <graphic url="materials/Punch/XML/Graphics/003.png"/>
  <p> Isn't it time we started fighting again?</p>
  <p> Yes, I daresay. How soon could you begin?</p>
  <p> Oh, in a few weeks.</p>
  <p> No good for me. Shan't be ready till the autumn.</p>

4. Punch example page 3

5. For example...

The militants' tariff (on Page 15) contains headings, paragraphs, and a table...
<div type="prose">
 <head>THE MILITANTS' TARIFF.</head>
 <head rend="right">Etna Lodge, W.</head>
 <p>Mrs. Bangham Smasher, having entered into partnership with the
   Misses Burnham Blazer, as General Agents of Destruction, begs to
   inform the public that the firm will be prepared to execute
   commissions of all kinds, at the shortest notice, on the very moderate
   terms given below : -- </p>
  <row role="label">
   <cell>For breaking windows, per window ...</cell>
   <cell>For howling, kicking, or biting during service in church,
       per howl, kick, or bite ...</cell>
<!-- ... -->

6. Punch example page 15

7. For example...

Egypt in Venice (on Page 5) begins with two headings, one in French....
<div type="prosexml:lang="enxml:id="I1914-07-01_05_02">
 <head>Egypt in Venice.</head>
 <head xml:lang="frrend="it">"La Légende de Joseph."</head>
 <p>Those who know the kind of attractions that the Russian ballet
   offers in so many of its themes ....</p>
Each stanza of the poem on page 10 has a last line which is significantly indented:
 <l>There were eight pretty walkers who went up a hill;</l>
 <l>They were Jessamine, Joseph and Japhet and Jill,</l>
 <l>And Allie and Sally and Tumbledown Bill,</l>
 <l rend="indent">And Farnaby Fullerton Rigby.</l>

8. Punch example page 5

9. Punch example page 10

10. Elements Available in All TEI Documents

The so-called 'Core' module groups together elements which may appear in any kind of text and the tags used to mark them in all TEI documents. This includes:
  • paragraphs
  • highlighting, emphasis and quotation
  • simple editorial changes
  • basic names numbers, dates, addresses
  • simple links and cross-references
  • lists, notes, annotation, indexing
  • graphics
  • reference systems, bibliographic citations
  • simple verse and drama

10.1. Paragraphs

<p> (paragraph) marks paragraphs in prose
  • Fundamental unit for prose texts
  • <p> can contain all the phrase-level elements in the core
  • <p> can appear directly inside <body> or inside <div> (divisions)
<p>It was a cottage, the cottage of a
dream. And by a cottage I mean, not
four plain rooms and a kitchen, but one
surprising room opening into another;
rooms all on different levels and of
different shapes, with delightful places
to bump your head on; open fireplaces;
a large square hall, oak-beamed, where
your guests can hang about after breakfast,
while deciding whether to play
golf or sit in the garden. Yet all so
cunningly disposed that from outside
it looks only a cottage or, at most, two
cottages persuaded into one.</p>

10.2. Highlighting

By highlighting we mean the use of any combination of typographic features (font, size, hue, etc.) in a printed or written text in order to distinguish some passage of a text from its surroundings. For words and phrases which are:
  • distinct in some way (e.g. foreign, archaic, technical)
  • emphatic or stressed when spoken
  • not really part of the text (e.g. cross references, titles, headings)
  • a distinct narrative stream (e.g. an internal monologue, commentary)
  • attributed to some other agency inside or outside the text (e.g. direct speech, quotation)
  • set apart in another way (e.g. proverbial phrases, words mentioned but not used)

10.3. Highlighting Examples

  • <hi> (general purpose highlighting)
    <p>[The rest of this communication is
    omitted owing to considerations of
    space.—<hi rend="sc">Ed</hi>.]</p>
  • <distinct> (linguistically distinct)
    But then I remind myself
    that the Russian ballet is nothing if not
  • Other similar elements include: <emph>, <mentioned>, <soCalled>, <term> and <gloss>

10.4. Quotation

Quotation marks can be used to set off text for many reasons, so the TEI has the following elements:
  • <q> (separated from the surrounding text with quotation marks)
  • <said> (speech or thought)
  • <quote> (passage attributed to an external source)
  • <cit> (groups a quotation and citation)
 <said who="#Celia">I know a lovely tin of potted
   grouse,</said> said Celia, and she went off
to cut some sandwiches. By twelve
o'clock we were getting out of the

10.5. Simple Editorial Changes: <choice> and Friends

  • <choice> (groups alternative editorial encodings)
  • Errors:
    • <sic> (apparent error)
    • <corr> (corrected error)
  • Regularization:
    • <orig> (original form)
    • <reg> (regularized form)
  • Abbreviation:
    • <abbr> (abbreviated form)
    • <expan> (expanded form)

10.6. Choice Example

I profess not to know how women's
</choice> are wooed and won. To me they have
always been <choice>
</choice> of riddle and <choice>

10.7. Additions, Deletions, and Omissions

  • <add> (addition to the text, e.g. marginal gloss)
  • <del> (phrase marked as deleted in the text)
  • <gap> (indicates point where material is omitted)
  • <unclear> (contains text unable to be transcribed clearly)

10.8. Example of <add>, <del>, <gap>, and <unclear>

<add place="left">The Cause</add> The immediate
cause, however, of the prevalence of supernatural

<add place="supra">stories</add>
in these parts, was doubtless owing to the
<unclear reason="blood splatter">vicinity</unclear>
of Sleepy Hollow.
<gap reason="illegible">
 <desc>The rest of this paragraph is covered
   in dried blood.</desc>

10.9. Basic Names

  • <name> (a name in the text, contains a proper noun or noun phrase)
  • <rs> (a general-purpose name or referencing string )

The type attribute is useful for categorizing these, and they both also have key, ref, and nymRef attributes.

10.10. Basic Names Example

<p>The scene opens at a party given by <name
in <name ref="http://en.wikipedia.org/wiki/Venicetype="place">Venice</name>. </p>
<p>It is when the natural end of the story is reached, and <name xml:id="SIMON">Simon</name> has come into his own and has just been
wedded to his proper affinity, that the structure seems to me to fall
with a crash. I might perhaps, though not without reluctance, have
pardoned an impertinent railway accident which leaves <rs corresp="#SIMON">the young man</rs> apparently crippled for life.</p>

10.11. Addresses

  • <email> (an electronic mail address)
  • <address> (a postal address)
  • <addrLine> (a non-specific address line)
  • <street> (a full street address)
  • <postCode> (a postal (or zip) code)
  • <postBox> (a postal box number)
  • <name> can also be used
  • and the 'namesdates' module extends this with more geographic names

10.12. Basic Address Example

 <name>George Bernard Shaw</name>
 <addrLine>Shaw's Corner</addrLine>
 <settlement>Ayot St Lawrence</settlement>
 <postCode>HE 1 XXX</postCode>

10.13. Basic Numbers and Measures

  • <num> (marks a number of any sort)
  • <measure> (marks a quantity or commodity)
  • <measureGrp> (groups specifications relating to a single object)
  • While <num> has simple type and value attributes, <measure> has type, quantity, unit and commodity attributes

10.14. Number and Measure examples

<l>They went off at a pace I am bound to deplore,</l>
<l>For they did <num value="20">twenty</num> yards in a minute or more</l>
<l>And a yard or <num value="2">two</num> over, a capital score</l>
<l>For Farnaby Fullerton Rigby.</l>
<p>If neither of these values is available, a value of <num>20,35</num>
for ash content can be assumed initially and checked, after the
sampling has been carried out, using one of the methods described in
ISO 13909-7.</p>
It is on these days that we travel to our Castle of Stopes; as the
crow flies, <measure quantity="24140unit="m">fifteen miles</measure>
away. Indeed, that is the way we get to it, for it is a castle in the

10.15. Dates

  • <date> (contains a date in any format and includes a when attribute for a regularised form and a calendar attribute to specify what calendar system)
  • <time> (contains a time in any format and includes a when attribute for a regularised form)
<p>At <time when="09:30:00">9.30 o'clock</time>,
as the fog lifted somewhat, the rescuing steamer
Lyonnesse had sighted the Gothland, fast on the rocks, with a bad
list to starboard, and apparently partly filled with pater.</p>
<p>House of Commons, <date when="1914-06-22">Monday, June 22, 1914</date>.</p>

10.16. Simple Linking

  • <ptr> (defines a pointer to another location)
  • <ref> (defines a reference to another location, with optional linking text)
  • Both elements have:
    • target attribute taking a URI reference
    • cRef attribute for canonical referencing schemes
  • If the linking text is able to be generated, <ptr> and <ref> might be used in the same place.

10.17. Simple Linking Example

See <ref target="#Section12">section 12 on page 34</ref>.

See <ptr target="#Section12"/>.

10.18. Lists

  • <list> (a sequence of items forming a list)
  • <item> (one component of a list)
  • <label> (label associated with an item)
  • <headLabel> (heading for column of labels)
  • <headItem> (heading for column of items)

10.19. Simple List Example

The previous slide contained only:
    <gi>list</gi> (a sequence of items forming a list)</item>
    <gi>item</gi> (one component of a list)</item>
    <gi>label</gi> (label associated with an item)</item>
    <gi>headLabel</gi> (heading for column of labels)</item>
    <gi>headItem</gi> (heading for column of items)</item>

10.20. Notes

  • <note> (contains a note or annotation)
  • Notes can be those existing in the text, or provided by the editor of the electronic text
  • A place attribute can be used to indicate the physical location of the note
  • Although notes should usually be encoded where its identifier/mark first appears, notes can also be kept separately and point back to their location with a target attribute

10.21. Note Example

<p>It is not only misfortune that makes strange bedfellows. <note place="foot">By-the-by, it is denied that Sir <name>Joseph Beecham</name> was in any way responsible for the Government's <title>Pills for Earthquakes</title>, by which it was hoped to avert the Irish crisis.</note>

10.22. Indexing

  • If converting an existing index, use nested lists. For auto-generated indexes:
  • <index> (marks an index entry) with optional indexName attribute
  • The <term> element is used to mark a term inside an <index> element
  • The <index> element can self-nest for hierarchical index entries

10.23. Indexing Example

<p>… activated sludge treatment<index>
  <term>activated sludge</term>
 </index> process for the biological treatment of wastewater in which a mixture of wastewaterand <hi>activated sludge</hi> is agitated and aerated. The <hi>activated sludge</hi> is subsequently separated from the <hi>treated wastewater</hi> by <term>sedimentation</term>
 </index>, and is removed or returned to the process as required.</p>

10.24. Graphics

  • <graphic> (indicates the location of an inline graphic, illustration, or figure)
  • <binaryObject> (encoded binary data embedding a graphic or other object)
  • The figure module provides <figure> and <figDesc> for more complex graphics
 <graphic url="images/014.png"/>
 <head>Garden City Washing-day.</head>
 <p>Our sensitive artist insists on a harmonious colour-scheme.</p>
 <figDesc>A bearded man sits in a deckchair and wags his finger at a woman hanging up washing</figDesc>

James Cummings. Date: April 2009
Copyright University of Oxford