Text only | Skip links
Skip links||IT Services, University of Oxford

1. Summary

  • The big three modules used by almost every TEI schema:
    • Text Structure
    • The Header
    • The Core
  • plus, if we have time, drama and verse

2. Structure of a TEI document

What is a text ?

  • A text may be unitary or composite
  • a unitary text contains
    • optional front matter
    • optional back matter
    • a body
  • in a composite text, the body is replaced by a group of texts (or nested groups)
  • A corpus is a collection of text and header pairs. It has its own header.

(and you can also have a nested text, within or outside a quotation)

3. TEI basic structure

4. A text usually has divisions

  • generic, hierarchic subdivisions, each incomplete
  • the type attribute is used to label a particular level e.g. as ‘part’ or ‘chapter’
  • the n attribute gives a particular division a name or number
  • the xml:id attribute gives a particular division a unique identifier
  • associated <head> and <trailer> elements (from the divtop class) may also be supplied
  • A <divGen> element can be used for ‘generated’ divisions
  • the part and org attributes from att.divLike are available to indicate structural anomalies
  • the level can be made explicit by using ‘numbered’ divs. Opinions vary.

5. For example...

<!-- titlepage, etc here -->
  <div1 type="bookn="Ixml:id="JA0100">
   <head>Book I.</head>
   <div2 type="chaptern="1xml:id="JA0101">
    <head>Of writing lives in general...</head>
<!-- remainder of chapter 1 here -->
   <div2 n="2xml:id="JA0102">
<!-- chapter 2 here -->
<!-- remainder of book 1 here -->
  <div1 type="bookn="IIxml:id="JA0200">
<!-- book 2 here -->
<!-- remaining books here -->

6. Tesselation is mandatory

At any level, you can have a sequence of division-contents followed by the next smaller division :
  <p> ...</p>
But you cannot pop up a level once you are in division-country:
 <div rend="slide">
<!-- this is illegal !!! -->

7. Div decoration

  • model.divWrapper elements include <argument>, <byline>, <dateline>, <salute>, <signed>.
  • groups of them can can be wrapped in <opener> or <closer>
  • some members of model.titlepagePart (<docAuthor>, <docDate>, <docImprint> etc.) are also available.

Proposals have been extended for letters e.g. by DALF.

8. Document structure: another example

A multi-component electronic edition like this

might be encoded like this:
<!-- intro, preambles etc -->
<!-- outline -->
<!-- edited text -->
<!-- translated text -->
<!-- aligned text and trans-->
<!-- page images -->
<!-- transcription principles-->
<!-- transcription of ms1 -->
<!-- transcription of ms2 -->
<!-- appendixes, indexes etc. -->

9. The TEI Header

The TEI header was designed with two goals in mind
  • needs of bibliographers and librarians trying to document ‘electronic books’
  • needs of text analysts trying to document ‘coding practices’ within digital resources

The result is that discussion of the header tends to be pulled in two opposite directions...

10. The Librarian's Header

  • Conforms to standard bibliographic model, using similar terminology
  • Organized as a single source of information for bibliographic description of a digital resource, with established mappings to other such records (e.g. MARC)
  • Emerging code of best practice in its use, endorsed by major digital collections
  • Pressure for greater and more exact constraints to improve precision of description: preference for structured data over loose prose

11. Everyman's Header

  • Gives a polite nod to common bibliographic practice, but has a far wider scope
  • Supports a (potentially) huge range of very miscellaneous information, organized in fairly ad hoc ways
  • Many different codes of practice in different user communities
  • Unpredictable combinations of narrowly encoded documentation systems and loose prose descriptions

12. TEI Header structure

The TEI header has four main components:
  1. <fileDesc>: describes the TEI document itself and its sources
  2. <encodingDesc>: describes the relationship between the source and the encoded version of it
  3. <profileDesc>: err, well, just about everything else
  4. <revisionDesc>: provides a change history

13. File Description

  • has some mandatory parts:
    • <titleStmt>: provides a title for the resource and any associated statements of responsibility
    • <sourceDesc>: documents the sources from which the encoded text derives (if any)
    • <publicationStmt>: documents how the encoded text is published or distributed
  • and some optional ones:
    • <editionStmt>: yes, electronic texts have editions too
    • <seriesStmt>: and they also fit into "series".
    • <extent>: how many floppy disks, CDs, gigabits?
    • <notesStmt>: nuff said

NB A "file" may actually correspond with several operating system files.

14. A sample file description

15. The File Description

  • <titleStmt>: contains a mandatory <title>[245] which identifies the electronic file, (not its source!)
  • optionally followed by additional titles, and by ‘statements of responsibility’, as appropriate, using <author>, <editor>, <sponsor>, <funder>, <principal>[536] or the generic <respStmt>
  • <publicationStmt>: may contain
    • plain text (e.g. to say the text is unpublished)
    • one or more <publisher>, <distributor>, <authority>
    • each followed by <pubPlace>, <address>, <availability>, <idno>

16. The Source Description

Most electronic texts were not ‘born digital’: their source/s need specification in traditional bibliographic style

  • <bibl>, <biblStruct>
  • (for texts which were born digital): <biblFull> may contain a nested <fileDesc>
  • <listBibl> a list of the foregoing
  • prose description
  • more specialized elements are available for spoken texts (<recordingStmt> etc.) and for manuscripts (<msDescription>)

17. For example

18. Association between header and text

  • By default everything asserted by a header is true of the text to which it is prefixed
  • This can be over-ridden
    • as when a text header over-rides or amplifies a corpus-header setting
    • when model.declarable elements are selected by means of the decls attribute (available on all model.declaring elements)
    • using special purpose selection/definition elements e.g. <catRef> and <taxonomy> (see below)
  • Most components of the encoding description are declarable.

19. Encoding Description

<encodingDesc> groups notes about the procedures used when the text was encoded, either summarized in prose or within specific elements such as
  • <projectDesc>: goals of the project
  • <samplingDecl>: sampling principles
  • <editorialDecl> editorial principals, e.g. <correction>, <normalization>, <quotation>, <hyphenation>, <segmentation>, <interpretation>
  • <classDecl>: classification system/s used
  • <tagUsage>: specifics about usage of particular elements

The <encodingDesc> can replace the user manual, or facilitate semi-automatic document management, given agreed codes of practice.

20. An example

21. Profile Description

An extensible rag-bag of descriptions, categorized only as ‘non-bibliographic’. Default members of the model.profileDescPart) class include:
  • <creation>: information about the origination of the intellectual content of the text, e.g. time and place
  • <langUsage>: information about languages, registers, swriting systems etc used in the text
  • <textDesc> and <textClass>: classifications applied to the text by means of a list of specified criteria or by means of a collection of pointers, respectively
  • <particDesc> and <settingDesc>: information about the ‘participants’, either real or depicted, in the text
  • <handList>: information about the hands identified in a manuscript

22. Classification Methods

  • <textClass> provides a classification (by domain, medium, topic...) for the whole of a text
  • expressed in one or more of the following ways:
    • direct reference to a locally defined category (using <catRef>)
    • reference to an externally defined category (using <classCode>)
    • documented by <keywords>
 <catRef target="#X123"/>
 <classCode scheme="DD12">001.9</classCode>
  <term>End of the World</term>
  <term>Day of Judgment</term>

The target of the <catRef> is provided by one of the <taxonomy> elements within the <classDecl>, usually within a corpus header

  <category xml:id="X1">
   <catDesc>Homiletic writing</catDesc>
   <category xml:id="X123">
    <catDesc>Day of Judgment</catDesc>
<!-- ... -->
<!-- ... -->

The xml:id value space is global

23. Detailed characterization of a text

<textDesc> provides a description of a text in terms of its ‘Situational parameters’

<textDesc n="novel">
 <channel mode="w">print; part issues</channel>
 <constitution type="single"/>
 <derivation type="original"/>
 <domain type="art"/>
 <factuality type="fiction"/>
 <interaction type="none"/>
 <preparedness type="prepared"/>
 <purpose type="entertaindegree="high"/>
 <purpose type="informdegree="medium"/>

These subelements constitute the class model.textDescPart: redefine that to roll your own.

24. Links between speakers, their setting, and their speech

In the header:

 <occupation>sales assistant</occupation>
<!-- .... -->
<setting xml:id="KDFSE002who="#PS0M6">
 <placeName>Lancashire: Morecambe </placeName>
 <locale>at home</locale>
 <activity spont="H"> watching television </activity>
In the text:

<!-- .... --><u who="#PS0M6">
 <s n="311">Show your daddy.</s>
<u who="PS0M8">
 <s n="312">Daddy.</s>

25. Language and character set usage

The <langUsage> element is provided to document usage of languages in the text. Languages are identified by their ISO codes:

 <language ident="en">English</language>
 <language ident="bg-cy">Bulgarian in Cyrillic characters
 <language ident="bg">Romanized Bulgarian</language>

26. Revision Description

A list of <change> elements, each with a date and who attributes, indicating significant stages in the evolution of a document.

Most recent first.

 <change date="2006-08-09resp="#LB">handedits following newhrdgen.xsl</change>
 <change date="08/14/03resp="#XaraIndexTools">Indexed</change>
 <change date="2003-08-12resp="#OUCS">Revised taxonomy definitions</change>
 <change date="2000-10-11resp="#OUCS">Final manual corrections for BNC-W</change>
 <change date="2000-10-18resp="#OUCS">Further manual corrections for BNC-W</change>
 <change date="2000-01-08resp="#OUCS">Manually changed catdescriptions etc. for BNC-W</change>
 <change date="1994-11-30resp="#OUCS">First release for BNC-1</change>

27. The Core Module

This contains nearly 100 different elements ‘likely to appear in almost any kind of text’:
  • paragraphs and lists
  • elements indicated by highlighting and quotation (emphasis, titles, quotations, foreign words, terms, glosses, rhetorical moves...)
  • editorial changes (addition, deletion, correction...)
  • names, numbers, measures, dates, abbreviations...
  • links and cross references
  • annotation and indexing
  • graphics
  • bibliographies and bibliographic reference
  • referencing systems and milestones
  • verse and drama

fortunately, you all already know all about most of them

28. Visually salient elements

  • What constitutes best practice?
    <p>The fact that a word is in
    <hi rend="fr">fraktur</hi> does
    not <hi rend="it">necessarily</hi> make it a <hi rend="it">neologism</hi>
    <p>The fact that a word is in
    <foreign xml:lang="de">fraktur</foreign> does
    not <emph>necessarily</emph> make it a
  • Be consistent and (preferably) economical:
    <hi rend="fr">
     <foreign xml:lang="de">fraktur</foreign>
    <foreign xml:lang="derend="fr">fraktur</foreign>
    <foreign xml:lang="de">
     <hi rend="fr">fraktur</hi>
    <term xml:lang="de">
     <hi rend="fr">fraktur</hi>
    <term xml:lang="derend="fr">fraktur</term>
  • Semantics is sacred :
    <p>Do not use <gi>emph</gi> unless you <emph>really</emph> mean
  • Do not assume the mapping between <gi> and rendition is trivial.
  • Avoid markup voodoo

29. A digression on Markup Voodoo

  • Markup Voodo is the belief that anything that can be marked up, should be. It will be useful one day.
  • The core module in particular provides ample temptation for hair-splitters (<soCalled> vs <q> cs <quote> vs <distinct> ...)
  • It is better to have a clear policy at the outset stating
    • which tags you will always use and in what circumstances
    • where elements can be used in different ways, which way you prefer
  • It is unlikely that such a policy will be the same at the start of the project as it is at the end: be prepared for change!
  • ODD is your friend...

30. Semantically salient elements

  • Dates and times, quantities and measures, names of persons etc.
  • The core elements allow you to do two things:
    • identify a bit of text which you think refers to something in the real world (e.g. a time, a number, a person)
    • optionally associate that reference with
      • a normalised value using a standard representation (using e.g. data.temporal)
      • a link to a canonical description of the item referenced (e.g. using key)
      • the latter can also simply be used for normalization
  • For example:
    <q>My dear <persName key="BENM1">Mr.
       Bennet</persName>,</q> said <rs type="personkey="BENM2">
    his lady</rs> to him one day,
    <q>have you heard that
    <rs type="placekey="NETP1">Netherfield Park</rs>
    was let <date notBefore="1720notAfter="1802-10-2">last
  • Handle with care...

31. Linking, indexing, annotation

Big changes with P5 (not all complete):
  • <ptr> and <ref> are now able to point anywhere, not simply within a document
  • <index> has now been revised extensively
  • <note> is to become global

More detail tomorrow

32. Figures and graphics

The <figure> element marks where there is some graphic content on a page. It can contain:
  • a <graphic> element, which points to the actual resource by means of a URL
  • some embedded SVG (in the SVG namespace)
  • a <binaryObject> element, containing the actual graphic in an appropriate notation
  • text, marked up using e.g. <head>, <p> <quote> etc.
  • a placeholder <figDesc> element
  • a nested <figure> for complex figure build up

It does not make provision for specifically graphic-related metadata: that goes in the header.

33. The Verse Module

Adds some extras for verse texts:
  • <caesura> and <rhyme>
  • additional attributes for metrical and rhyme scheme analysis
<lg rhyme="ABCCBBA">
 <l>The sunlight on the <rhyme label="A">garden</rhyme>
  <rhyme label="A">Harden</rhyme>s and grows <rhyme label="B">cold</rhyme>,</l>
 <l>We cannot cage the <rhyme label="C">minute</rhyme>
 <l>Wi<rhyme label="C">thin it</rhyme>s nets of <rhyme label="B">gold</rhyme>
 <l>When all is <rhyme label="B">told</rhyme>
 <l>We cannot beg for <rhyme label="A">pardon</rhyme>.</l>

34. The Drama Module

Intended for radio and movie scripts as well as conventional theatrical material

  • Adds a range of specialised front matter elements e.g. <performance>, <prologue>, <set>, <castList> and its children.
  • Adds specialised forms of stage direction e.g. <view>, <camera>, <caption>, <sound>:
<camera>Zoom in to overlay showing some stock film of hansom cabs
galloping past.</camera>
<caption>London, 1895.</caption>
<caption>The residence of Mr Oscar Wilde.</caption>
classy music starts.</sound>
<view>Mix through to Wilde's drawing
room. A crowd of suitably dressed folk are engaged in typically
brilliant conversation, laughing affectedly and drinking
<sp who="#tj">
 <speaker>Prince of Wales</speaker>
 <p>My congratulations, Wilde. Your latest play is a great success.</p>

Lou Burnard. Date: September 2006
Copyright University of Oxford