Text only | Skip links
Skip links||IT Services, University of Oxford


1. Default Text Structure and Header

Two of the major modules used in all TEI documents (the other one is Core)
  • Text Structure
  • TEI Header

2. Structure of a TEI Document

There are two basic structures of a TEI Document:
  • <TEI> (TEI document) contains a single TEI-conformant document, comprising a TEI header and a text, either in isolation or as part of a teiCorpus element.
  • <teiCorpus> contains the whole of a TEI encoded corpus, comprising a single corpus header and one or more TEI elements, each containing a single text header and a text.

3. TEI basic structure (1)

<!-- required -->
<!-- required -->
<!-- required -->
<!-- new in TEI P5 1.0 -->
<!-- required -->

4. <text>

What is a text?
  • A text may be unitary of composite
    • unitary: forming an organic whole
    • composite: consisting of several components which are in some important sense independent of each other
  • a unitary text contains
    • optional front matter
    • <body> (required)
    • optional back matter

5. Composite texts

A composite text contains
  • optional front matter
  • <group> (required)
  • optional back matter

A corpus is a collection of text and header pairs. It has its own header.

<group> tags may self-nest.

6. TEI basic structure - 2

<!-- required -->
<!-- required -->

7. A text usually has divisions

  • generic, hierarchic subdivisions, each incomplete
  • the type attribute is used to label a particular level e.g. as 'part' or 'chapter'
  • the @n attribute gives a partuclar division a name or number
  • the @xml:id attribute gives a partular division a unique identifier

8. Divisions may have heads and trailers

 <head>Chapter 1</head>
<!-- content of the div -->

9. Partial and composite divisions

In particular where dealing with unusually large or unusually small texts, encoders may find it convenient to present as textual divisions sequences of text which are incomplete with reference to the original text, or which are in fact an ad hoc agglomeration of tiny texts.

The @org, @sample and @part attributes from att.divLike are available to indicate such structural anomalies
  • @org = how the content of the div is organized (composite or unitary)
  • @sample = indicates whether this division is a sample of the original source and if so, from which part.
  • @part = whether or not the division is fragmented by some other structural element (for example, a speech divided among one or more verse stanzas)

10. Example


 <head>The Legend of Sleepy Hollow</head>
 <p>THE PRECEDING Tale is given, almost in the precise words in which I heard it
   related at a Corporation meeting of the ancient city of Manhattoes, at which were
   present many of its sagest and most illustrious burghers. The narrator was a
   pleasant, shabby, gentlemanly old fellow, in pepper-and-salt clothes, with a
   sadlyhumorous face; and one whom I strongly suspected of being poor, -- he made such
   efforts to be entertaining. When his story was concluded, there was much laughter
   and approbation, particularly from two or three deputy aldermen, who had been asleep
   a greater part of the time. There was, however, one tall, dry-looking old gentleman,
   with beetling eyebrows, who maintained a grave and rather severe face throughout:
   now and then folding his arms, inclining his head, and looking down upon the floor,
   as if turning a doubt over in his mind.</p>
 <gap reason="sampling"/>

11. numbered and unnumbered divs

The level can be made explicit by using 'numbered' divs (div1, div2). Opinions vary:

<div1> vs. <div n="1">
  • numbered: the number indicates the depth of this particular division within the hierarchy, the largest such division being ‘div1’, any subdivision within it being ‘div2’, etc.
  • unnumbered: nest recursively to indicate their hierarchic depth.
The two styles must not be combined within a single <front>, <body>, or <back> element.

N.B. Divisions always tessellate

12. Classes for divisions

The TEI architecture defines five classes, all of which are populated by this module:
  • model.divTop groups elements appearing at the beginning of a text division.
  • model.divTopPart groups elements which can occur only at the beginning of a text division.
  • model.divBottom groups elements appearing at the end of a text division.
  • model.divBottomPart groups elements which can occur only at the end of a text division.
  • model.divWrapper groups elements which can appear at either top or bottom of a textual division.

13. model.divWrapper

<argument> A formal list or prose description of the topics addressed by a subdivision of a text.
<byline> contains the primary statement of responsibility given for a work on its title page or at the head or end of the work.
<dateline> contains a brief description of the place, date, time, etc. of production of a letter, newspaper story, or other work, prefixed or suffixed to it as a kind of heading or trailer.
<docAuthor> (document author) contains the name of the author of the document, as given on the title page (often but not always contained in a byline).
<docDate> (document date) contains the date of a document, as given (usually) on a title page.
<epigraph> contains a quotation, anonymous or attributed, appearing at the start of a section or chapter, or on a title page.

14. model.divTopPart

<head> (heading) contains any type of heading, for example the title of a section, or the heading of a list, glossary, manuscript description, etc.
<salute> (salutation) contains a salutation or greeting prefixed to a foreword, dedicatory epistle, or other division of a text, or the salutation in the closing of a letter, preface, etc.
<opener> groups together dateline, byline, salutation, and similar phrases appearing as a preliminary group at the start of a division, especially of a letter.

model.divTop = model.divTopPart + model.divWrapper

15. model.divBottomPart

<closer> groups together salutations, datelines, and similar phrases appearing as a final group at the end of a division, especially of a letter.
<signed> (signature) contains the closing salutation, etc., appended to a foreword, dedicatory epistle, or other division of a text.
<trailer> contains a closing title or footer appearing at the end of a division of a text.
<postscript> contains a postscript, e.g. to a letter.

model.divBottom = model.divBottomPart + model.divWrapper

16. Grouped and Floating Texts

The <group> element should be used to represent a collection of independent texts which is to be regarded as a single unit for processing or other purposes.

<floatingText> contains a single text of any kind, whether unitary or composite, which interrupts the text containing it at any point and after which the surrounding text resumes.

17. Grouped texts

Examples of composite texts which should be represented using the <group> element include anthologies and other collections. The presence of common front matter referring to the whole collection, possibly in addition to front matter relating to each individual text, is a good indication that a given text might usefully be encoded in this way; this structure may be found useful in other circumstances too.

<!-- header information for the whole collection -->
    <titlePart> The Works of Washington Irving. New Edition, Revised. Vol. II. The
         Sketch-Book </titlePart>
   <docImprint>New York: G. P. Putnam, 1861</docImprint>
<!-- any other front matter specific to this collection -->
     <head rend="italic">The Works of Washington Irving</head>
     <docTitle> The Voyage </docTitle>
     <byline>By Washington Irving.</byline>
     <p>To an American visiting Europe, the long voyage he has to make is an
           excellent preparative. ... </p>
<!-- remainder of The Voyage here -->
     <head rend="italic">The Works of Washington Irving</head>
     <byline>By Washington Irving.</byline>
<!-- text of Roscoe here -->

A text which is a member of a group may itself contain groups. This is quite common in collections of verse, but may happen in any kind of text.

18. Floating texts

As mentioned above, <div>s must tesselate over the entire text
<!-- content -->
<!-- content -->
is valid, while
<!-- content -->
<!-- content -->
<!-- content -->
is not valid.

In the second case, div2 is a 'floating' text and its content must be encoded using the <floatingText> element.

The <floatingText> element is a member of the model.divPart class, and can thus appear within any division level element in the same way as a paragraph.

19. Floating text Example

[18th century text The Lining to the Patch-Work Screen, by Jane Barker (1726)]
<p>Galecia one Evening setting alone in her Chamber by a clear Fire, and a clean Hearth
... reflected on the Providence of our All-wise and Gracious Creator.... </p>
<p>She was thus ruminating, when a Gentleman enter'd the Room, the Door being a jar...
calling for a Candle, she beg'd a thousand Pardons, engaged him to sit down, and let
her know, what had so long conceal'd him from her Correspondence. </p>
<pb n="5"/>
  <head>The Story of <hi>Captain Manly</hi>
  <p>Dear Galecia, said he, though you partly know the loose, or rather lewd Life that
     I led in my Youth; yet I can't forbear relating part of it to you by way of
<!-- Captain Manly's story here --> I had lost and spent all I had
     in the World; in which I verified the Old Proverb, That a Rolling Stone never
     gathers Moss, </p>
<pb n="37"/>
<p>The Gentleman having finish'd his Story, Galecia waited on him to the Stairs-head;
and at her return, casting her Eyes on the Table, she saw lying there an old dirty
rumpled Book, and found in it the following story: </p>

20. Virtual divisions

Where the whole of a division can be automatically generated, for example because it is derived from another part of this or another document, an encoder may prefer not to represent it explicitly but instead simply mark its location by means of a processing instruction, or by using the special purpose <divGen> element:
<!-- <titlePage>...</titlePage> -->
 <divGen type="toc"/>
(intended primarily for use in document production or manipulation, rather than in transcription of pre-existing material)

21. The TEI Header

The TEI header was designed with two goals in mind
  • needs of bibliographers and librarians trying to document ‘electronic books’
  • needs of text analysts trying to document ‘coding practices’ within digital resources
The result is that discussion of the header tends to be pulled in two directions...

22. The Librarian’s Header

  • Conforms to standard bibliographic model, using similar terminology
  • Organized as a single source of information for bibliographic description of a digital resource, with established mappings to other such records (e.g. MARC)
  • Emerging code of best practice in its use, endorsed by major digital collections
  • Pressure for greater and more exact constraints to improve precision of description: preference for structured data over loose prose

23. Everyman’s Header

  • Gives a polite nod to common bibliographic practice, but has a far wider scope
  • Supports a (potentially) huge range of very miscellaneous information, organized in fairly ad hoc ways
  • Many different codes of practice in different user communities
  • Unpredictable combinations of narrowly encoded documentation systems and loose prose descriptions

24. TEI Header Structure

The TEI header has four main components:
  • <fileDesc> (file description) contains a full bibliographic description of an electronic file.
  • <encodingDesc> (encoding description) documents the relationship between an electronic text and the source or sources from which it was derived.
  • <revisionDesc> (revision description) summarizes the revision history for a file.
  • <profileDesc> (text-profile description) provides a detailed description of non-bibliographic aspects of a text, specifically the languages and sublanguages used, the situation in which it was produced, the participants and their setting. (just about everything not covered in the other header elements

Only <fileDesc> is required; the others are optional.

25. Example Header: Minimal

<!-- ... -->

26. Example Header: TEI corpus

 <teiHeader type="corpus">
<!-- corpus-level metadata here -->
  <teiHeader type="text">
<!-- metadata specific to this text here -->
<!-- ... -->
  <teiHeader type="text">
<!-- metadata specific to this text here -->
<!-- ... -->

27. Types of content in the TEI header

  • free prose
    • prose description: series of paragraphs
    • phrase: character data, interspersed with phrase-level elements, but not paragraphs
  • grouping elements: specialized elements recording some structured information (Elements whose names end with the suffix Stmt, e.g. <editionStmt>, <titleStmt>, usually enclose these)
  • declarations: Elements whose names end with the suffix Decl (e.g. subjectDecl, refsDecl) enclose information about specific encoding practices applied in the electronic text; often these practices are described in coded form, but they may be described in prose as well. Typically, such information takes the form of a series of declarations, identifying a code with some more complex structure or description.
  • descriptions: Elements whose names end with the suffix Desc (e.g. <settingDesc>, <projectDesc>) contain a prose description, possibly, but not necessarily, organized under some specific headings by suggested sub-elements.

28. File Description

  • has some mandatory parts:
    • <titleStmt>: provides a title for the resource and any associated statements of responsibility
    • <sourceDesc>: documents the sources from which the encoded text derives (if any)
    • <publicationStmt>: documents how the encoded text is published or distributed
  • and some optional ones:
    • <editionStmt>: yes, electronic texts have editions too
    • <seriesStmt>: and they also fit into "series".
    • <extent>: how many floppy disks, CDs, gigabits?
    • <notesStmt>: nuff said

NB A "file" may actually correspond with several operating system files.

29. The File Description

  • <titleStmt>: contains a mandatory <title>which identifies the electronic file (not its source!)
  • optionally followed by additional titles, and by ‘statements of responsibility’, as appropriate, using <author>, <editor>, <sponsor>, <funder>, <principal> or the generic <respStmt>
  • <publicationStmt>: may contain
    • plain text (e.g. to say the text is unpublished)
    • one or more <publisher>, <distributor>, <authority>, each followed by <pubPlace>, <address>, <availability>, <idno>

30. The Source Description

Most electronic texts were not ‘born digital’: their source/s need specification in traditional bibliographic style
  • <bibl>, <biblStruct>
  • (for texts which were born digital): <biblFull> may contain a nested <fileDesc>
  • <listBibl> a list of the foregoing
  • prose description
  • more specialized elements are available for spoken texts (<recordingStmt> etc.) and for manuscripts (<msDescription>)

31. For Example

 <bibl> "The Legend of Sleepy Hollow", published in The Works of Washington Irving (New
   York, Putnam, 1861) </bibl>

32. Association between header and text

By default everything asserted by a header is true of the text to which it is prefixed. This can be over-ridden:
  • as when a text header over-rides or amplifies a corpus-header setting
  • when model.declarable elements are selected by means of the decls attribute (available on all model.declaring elements)
  • using special purpose selection/definition elements e.g. <catRef> and <taxonomy> (see below)
Most components of the encoding description are declarable.

33. Encoding Description

<encodingDesc> groups notes about the procedures used when the text was encoded, either summarized in prose or within specific elements such as
  • <projectDesc>: goals of the project
  • <samplingDecl>: sampling principles
  • <editorialDecl>: editorial principals, e.g. <correction>, <normalization>, <quotation>, <hyphenation>, <segmentation>, <interpretation>
  • <classDecl>: classification system/s used
  • <tagUsage>: specifics about usage of particular elements
The <encodingDesc> can replace the user manual, or facilitate semi-automatic document management, given agreed codes of practice.

34. Profile Description

An extensible rag-bag of descriptions, categorized only as ‘non-bibliographic’. Default members of the model.profileDescPart) class include:
  • <creation>: information about the origination of the intellectual content of the text, e.g. time and place
  • <langUsage>: information about languages, registers, writing systems etc used in the text
  • <textDesc> and <textClass>: classifications applied to the text by means of a list of specified criteria or by means of a collection of pointers, respectively
  • <particDesc> and <settingDesc>: information about the ‘participants’, either real or depicted, in the text
  • <handList>: information about the hands identified in a manuscript

35. Classification Methods

<textClass> provides a classification (by domain, medium, topic...) for the whole of a text expressed in one or more of the following ways:
  • direct reference to a locally defined category (using <catRef>)
  • reference to an externally defined category (using <classCode>)
  • documented by <keywords>

36. Example

 <catRef target="#X123"/>
 <classCode scheme="DD12">001.9</classCode>
  <term>End of the World</term>
  <term>Day of Judgment</term>
<!-- The target of the <catRef> is provided by one of the <taxonomy> elements within the <classDecl>, usually within a corpus header -->
  <category xml:id="X1">
   <catDesc>Homiletic writing</catDesc>
   <category xml:id="X123">
    <catDesc>Day of Judgment</catDesc>
<!-- ... -->
<!-- ... -->
<!-- The xml:id value space is global -->

37. Detailed characterization of a text

<textDesc> provides a description of a text in terms of its ‘Situational parameters’

<textDesc n="novel">
 <channel mode="w">print; part issues</channel>
 <constitution type="single"/>
 <derivation type="original"/>
 <domain type="art"/>
 <factuality type="fiction"/>
 <interaction type="none"/>
 <preparedness type="prepared"/>
 <purpose type="entertaindegree="high"/>
 <purpose type="informdegree="medium"/>
<!-- These subelements constitute the class model.textDescPart: redefine that to roll your own. -->

38. Language and character set usage

The <langUsage> element is provided to document usage of languages in the text. Languages are identified by their ISO codes:
 <language ident="en">English</language>
 <language ident="bg-cy">Bulgarian in Cyrillic characters </language>
 <language id="bg">Romanized Bulgarian</language>

39. Revision Description

A list of <change> elements, each with a date and who attributes, indicating significant stages in the evolution of a document. Most recent first.

40. Example

 <change date="2006-08-09resp="#LB">handedits following newhrdgen.xsl</change>
 <change date="08/14/03resp="#XaraIndexTools">Indexed</change>
 <change date="2003-08-12resp="#OUCS">Revised taxonomy definitions</change>
 <change date="2000-10-11resp="#OUCS">Final manual corrections for BNC-W</change>
 <change date="2000-10-18resp="#OUCS">Further manual corrections for BNC-W</change>
 <change date="2000-01-08resp="#OUCS">Manually changed catdescriptions etc.
   for BNC-W</change>
 <change date="1994-11-30resp="#OUCS">First release for BNC-1</change>

41. Next...?

Next James will tell us about the Core TEI Elements and Non-Standard Characters.

Dot Porter. Date: 2007-10-31
Copyright University of Oxford