Text only | Skip links
Skip links||IT Services, University of Oxford

1. Background and expectations

I am assuming that you:
  • know what the Text Encoding Initiative is
  • have an idea what richly digitized text looks like
  • realize that the TEI is an ongoing issue, not a frozen standard
  • care about access to resources by computers and not just humans…
I also apologize for my misleading name which contradicts my purely English background.

2. The problem

The Text Encoding Initiative was designed from the start as a dynamic model which could provide both
  • a firmly-anchored model for well-understood structural components of digital texts, and
  • a framework in which scholars could freely record their work in an open-ended and non-prescriptive way.

Do we end up with interchange, interoperability, or just a private record? Only relatively recently have we started to see enough open availablle TEI texts, and enough tools, for this to really matter.

3. Many expressions of the same semantics

 <surname type="linked">Bulwer-Lytton</surname>, <roleName>Baron Lytton of <placeName>Knebworth</placeName>
 <rs type="name">Edward George Bulwer-Lytton, Baron Lytton of Knebworth</rs>
 <name type="person">Edward George Bulwer-Lytton, Baron Lytton of Knebworth</name>
 <persName>Edward George Bulwer-Lytton, Baron Lytton of Knebworth</persName>

4. continued…

   Baron Lytton of Knebworth </persName>
 <surname type="linked">Bulwer-Lytton</surname>, <roleName>Baron Lytton of <placeName>Knebworth</placeName>

(Note the difference in XML whitespace between these two last)

<persName ref="#EBL">
 <roleName>Baron Lytton of <placeName>Knebworth</placeName>

5. continued …

We can't even agree on the difference between
<head>CHAPTER I.</head>
<head>CHARLOTTE BROOKS.</head>

6. A typical artefact

7. True but useful?



  <graphic url="images/1918-02-12-page-2-zone-1.jpg"/>

   <hi rend="red_background">A</hi> Little <hi rend="green_background">H</hi>ealth, A <hi rend="green_background">L</hi>ittle <hi rend="red_background">W</hi>ealth, A Little ~</line>
   <hi rend="blue_background">H</hi>ouse, <hi rend="blue_background">A</hi>nd <hi rend="red_background">F</hi>reedom,~ And at The <hi rend="red_background">E</hi>nd, </line>
   <hi rend="red_background">I</hi>’d, Like A <hi rend="blue_background">F</hi>riend, But little <zone

    <del rend="strokedhand="#Wilfred_Owen">
     <hi rend="green_background">B</hi>ut Little</del>
    <add place="belowhand="#Wilfred_Owen">And Every</add>
   </zone> Cause To <hi rend="blue_background">N</hi>eed Him.</line>

8. Many ways to solve a problem — linking text and facsimile

The Best Practice in Libraries guidelines say you can choose between:
  • Use the @facs attribute on a <pb> element to point to the corresponding page image using a URI.
  • Use the <facsimile> element to define a set of images that corresponds to the text, in conjunction with the @facs attribute on a <pb> element to point to the corresponding page image using a URI.
  • Use the @xml:id attribute on each <pb> element and a METS document to provide correspondence between <pb> elements and one or more facsimile page images (e.g., master, web derivatives, etc.).


9. To punctuate or not

<p>She said, <said>Nobody uses the term <soCalled>electronic text</soCalled>
<p>She said, <said>“Nobody uses the term <soCalled>‘electronic text’</soCalled>
<p>She said, <said rend="quotes: '“' '”'">Nobody uses the term <soCalled rend="quotes: '‘' '’'">electronic text</soCalled> anymore</said>!</p>


10. A more subtle example, a complex signature

(thanks to Martin Mueller for the example

Note that TEI says a <signed> element ‘contains the closing salutation, etc., appended to a foreword, dedicatory epistle, or other division of a text. ’

11. One encoding

   <cell>Duncan Campbell,</cell>
   <cell rows="2">Witnesses</cell>
   <cell>John Thom,</cell>

12. Another encoding

   <signed>Duncan Campbell</signed>,</cell>
  <cell rows="2">Witnesses</cell>
   <signed>John Thom</signed>,</cell>

Is <signed> structural or decorative?

13. Yet another encoding

<ab type="signed">Argyle</ab>
<ab type="signed">
   <cell>Duncan Campbell,</cell>
   <cell cols="2">Witnesses</cell>
   <cell>John Thom,</cell>

14. and still more encoding

 <list type="glossrend="braced-right">
     <name>Duncan Campbell</name>,</item>
     <name>Iohn Thom</name>,</item>

15. Choices, absurd choices

16. What do we find in attribute values?

ECCO values for @rend on <hi>
above, below, blackletterType, margDblQuotes, margQuotes, small, sub, sup
ECCO values for @rend on <gap>
_____, a, alphabet, blank〉, book〉, different, duplicate〉, from, in, inserted, left, letters〉, line, lines〉, line〉, math, missing〉, non-Latin, page, pages, page〉, paragraph〉, span
ECCO values for @type on <lg>
Psalm, address, air, airandduet, airandrecitative, answer, anthem, antistrophe, ballad, canto, catch, chorus, chorusandair, closer, duet, duetwithchorus, elegy, epigram, epigraph, epilogue, epistle, epitaph, epithalamicair, etc

17. However, there is also the good side

We have
  • An interchange format (XML)
  • An encoding (UTF-8)
  • A very rich descriptive vocabulary
  • A lot of agreement on semantics
  • A powerful environment for describing variations (ODD)

18. Taking a step back, some basic questions:

  • Why? Why do we care about our texts being interoperable?
  • What? Which parts, exactly, of our texts do we want to share?
  • How? What methods can we use to carry out our strategy?
and of course a follow-on question, which is
  • What next? what do we suggest that the community around the TEI does next? to which the answers may range from ‘nothing’ to ‘abandon the TEI and XML entirely’

19. Why?

If we don't want to share, why did we do the work in the first place? There does not seem much doubt that people want to combine their work with others to produce a bigger ‘answer’ than they can find on their own. But what exactly do they want to share and combine? I suggest there are four possible models of sharing and interoperability:
  • Bibliography
  • Extraction
  • Mapping
  • The words
Of course, these are not mutually exclusive, so there is also
  • Containing

20. Bibliography

We release enough information about our text to let others find it in the traditional way, eg author, title, date etc; and then they can simply read it. The markup is decoration aimed at describing the original presentation to let us produce facsimiles. The text markup must be detailed enough to act as typesetting instruction.

21. Extraction

We leave clues in our text which lets someone comb through it and produce assertions about the real world. To take an extreme example, we may read the novel Jane Eyre, and establish that yes, Jane did marry Rochester. It is a fact about that fictional world. This model treats the text as a database for mining. We only need the data markup, and the structural markup to provide context.
<http://ota.ox.ac.uk/persname/mcturk> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/NET/crm-owl#E82_Actor_Appellation> . <http://ota.ox.ac.uk/persname/mcturk> <http://www.w3.org/1999/02/22-rdf-syntax-ns#value> "McTurk" . <http://ota.ox.ac.uk/persname/orrin> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/NET/crm-owl#E82_Actor_Appellation> . <http://ota.ox.ac.uk/persname/orrin> <http://www.w3.org/1999/02/22-rdf-syntax-ns#value> "Orrin" .

22. Mapping

We put markup in our text so that it can all be taken out and directly compared to another text. Every component of the text has a strict meaning. The markup is not describing the presentation but presenting the analysis. A linguistic or critical analysis might be typical of this approach. Everything must be structural, and have semantics.

23. The Words

There is quite a widespread view that in fact interoperability really deals with just the words. Strip out all that dirty markup describing presentation, and you have a stream of Unicode text which you can use anywhere for anything. Interpretation and analysis are dynamic and real time. Effectively, this is what Google Books offer. Scale is what matters. Markup must be precise enough not be ambigious about eg spaces between words, and line-breaking.
the tragedy of hoffman enter hoffman hence clouds of melancholy ile be no longer subiect to your sismes but thou deare soule whose nerues and artires in dead resoundings summon vp reuenge and thou shalt hate be but appeas'd sweete hearse the dead remembrance of my liuing father strikes ope a cur-taine where ap-peares a body.

24. Containing

We envisage the text as having two (or more) layers, the presentational layer, and the interpretative layer, and we do not mix the two. There are two distinct documents. Here we enter, of course, the long-standing issue of overlapping markup. A purely technical question, in some ways, but some approaches (stand-off markup, and non-XML solutions) make the layer distinctions clearer than others. The markup is very concerned to define enough granularity in the text to allow it to be referred to.

25. What?

Are we
  • only wanting the metadata header to make a MARC record
  • interpreting the existing markup to render it as a web page
  • targetting a specific subset of the markup to answer a question
  • …or do we just not care what the markup is

26. How?

If we want just words, or just metadata, we're pretty much OK.

But very often we have to interpret markup, and do something with it.

Experience in trying to maintain a generic family of TEI stylesheets suggests that this is not at all easy.

27. What shall we do? Abandon the whole nonsensical farrago of meaningful markup?

  • Use a word-processor to replicate format using entirely visual decoration
  • Leave it up to the natural language people to extract meaning
We know very well what is happening in an ancient Greek manuscript, without benefit of all the whitespace and typographic techniques we now take for granted.

28. What shall we do? Separate presentation and interpretation?

  • adopt the web model of HTML for making the text look like something, in a reasonable re-useable way
  • decorate it with semantic clues which let us perform extraction of embedded assertions
This is what most of the world is up to.

29. Using HTML5

<div class="tei_body">

   <h1 itemprop="head">
    <span class="headingNumber">1. </span>
    <span itemprop="headclass="head">
     <span class="capitalize">STRANGE MEETING</span>
  <div itemprop="lgclass="stanza">
   <div class="l">It seemed
       that out of battle I escaped</div>
   <div class="l">Down some
       profound dull tunnel , long since
   <div class="l">Through
       granites which titanic wars had groined .

30. What shall we do? Go back to the dark ages?

  • extract interchange data
  • manage it in RDF or a conventional database
  • keep a loose bibliographical link back to the original text
  • reproduce old texts in eBook format

31. What shall we do? Keep the faith?

Believe that our complex markup does allow for
  • repeatable analysis
  • better archival recording of decisions
  • more objective descriptions
The TEI is better than HTML5 because
  • the vocabulary is richer
  • the structure is more complex
  • the specialist domains are much wider
  • the schema is more powerful
Unfortunately it has no default rendering model.

32. If we keep the faith, how can we resolve the TEI confusion?

  • Render down a TEI text into a simpler normalized form, much more tightly bound to a set of TEI elements which are explicitly used to record aspects we want to share.
  • Using this so-called 90/10 solution, each encoding project would then have to define a mapping between its own interesting use of TEI elements, and the Ur-elements.
  • The mapped form is a generated output, not an archival form.
  • The original is interchangeable, the simplified form is interoperable.

33. Implications of a TEI 90/10 model

We will have to find a way of rephrasing guidance like: ‘Alternatively, the <note> element may encode the text of the note at the point it occurs on the page or at another point convenient when converting from a born-digital source document, such as at the end of the containing <div> (or <div1>) or in a special <div> (or <div1>) element within <back>. The point of reference should be encoded using a <ref> or <ptr> element ’ …


34. Where do we want to get to?

We want a computer to be able to process a TEI text and easily recognize:
1The distinction between metadata and text<teiHeader> and <text>
2Structural components which provide context for both formatting and extraction<front>, <body>, <div>, <list>, <note> etc
3Inline strict and loose semantic markup<hi>, <name>, <foreign>
4Typologies and links@type, @ref, etc
5Editorial interpretation<interp>, <s>, <lemma> etc
6Data<name>, <date>, <placeName> etc
preferably distinguishing those which map to other vocabularies.

35. Detail: where to store mapping?

It is easy enough to look at the TEI <person> element and say it corresponds to what the CRM calls E21_Person
This class comprises real persons who live or are assumed to have lived. Legendary figures that may have existed, such as Ulysses and King Arthur, fall into this class if the documentation refers to them as historical figures. In cases where doubt exists as to whether several persons are in fact identical, multiple instances can be created and linked to indicate their relationship. The CRM does not propose a specific form to support reasoning about possible identity. Examples: - Tut-Ankh-Amun - Nelson Mandela

The <equiv> element allows us to provide a specification in ODD which points from a TEI element to an external identifier, and says how to get there.

36. Example of <equiv>

<elementSpec ident="geomode="change">

names the underlying concept of which the parent is a representation
references the underlying concept of which the parent is a representation by means of some external identifier
references an external script which contains a method to transform instances
gives the MIME media type of filter script

37. What is in crm.xsl?

Named XSL templates which do creation of RDF XML:
<xsl:template name="E47">
   <value xmlns="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

    <xsl:value-of select="."/>
<xsl:template name="E69">
       <xsl:call-template name="calc-date-value"/>

38. Input

<person xml:id="ArnMag01sex="1role="scholar">
 <persName xml:lang="is">Árni Magnússon</persName>
 <persName xml:lang="la">Arnas Magnæus</persName>
 <persName xml:lang="da">Arne Magnusson</persName>
 <birth when="1663-11-13">13 November 1663</birth>
 <death when="1730-01-07">7 January 1730</death>
  <date from="1663to="1680">1663-1680</date>
   <settlement type="farm">Hvammur</settlement>
   <region type="county">Dalasýsla</region>
   <region type="compass">Western</region>
   <country key="IS">Iceland</country>
  <date from="1680to="1683">1680-1683</date>
   <settlement type="institution">Skálholt</settlement>
   <region type="county">Árnessýsla</region>
   <region type="compass">Southern</region>
   <country key="IS">Iceland</country>

39. Result

<RDF xmlns="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

 <E21_Person xmlns="http://purl.org/NET/crm-owl#"

  <P131_is_identified_by xmlns="http://purl.org/NET/crm-owl#"

   <E82_Actor_Appellation xmlns="http://purl.org/NET/crm-owl#"

    <value xmlns="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
Árni Magnússon</value></E82_Actor_Appellation></P131_is_identified_by>
  <P98i_was_born xmlns="http://purl.org/NET/crm-owl#"

   <E67_Birth xmlns="http://purl.org/NET/crm-owl#"

    <P4_has_time-span xmlns="http://purl.org/NET/crm-owl#"

     <E52_Time-Span xmlns="http://purl.org/NET/crm-owl#"

      <P82_at_some_time_within xmlns="http://purl.org/NET/crm-owl#"

       <E61_Time_Primitive xmlns="http://purl.org/NET/crm-owl#"

        <value xmlns="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

40. Result (continued)

<RDF xmlns="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

 <E21_Person xmlns="http://purl.org/NET/crm-owl#"

  <P74_has_current_or_former_residence xmlns="http://purl.org/NET/crm-owl#"

   <E53_Place xmlns="http://purl.org/NET/crm-owl#"

    <P2_has_type xmlns="http://purl.org/NET/crm-owl#"

    <P87_is_identified_by xmlns="http://purl.org/NET/crm-owl#"

     <E48_Place_Name xmlns="http://purl.org/NET/crm-owl#"

      <value xmlns="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    <P89_falls_within xmlns="http://purl.org/NET/crm-owl#"
 <E53_Place xmlns="http://purl.org/NET/crm-owl#"

  <P2_has_type xmlns="http://purl.org/NET/crm-owl#"

  <P87_is_identified_by xmlns="http://purl.org/NET/crm-owl#"

   <E48_Place_Name xmlns="http://purl.org/NET/crm-owl#"

    <value xmlns="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

41. The ODD is the key

Our TEI metaschema gives us a location to place the rules about:
  • whether an element is to be regarded as significant in the 90/10 mapping process
  • whether the element is to be transformed for the mapping (eg <name type="person"> to <persName>)
  • which data category the element is in
  • whether its attributes are significant beyond local typology
Different schemas can be used to maintain several rule sets.

42. Different markup types

43. Conclusions

  • The TEI does not need defending or justifying. It is a mature, multi-tasking, markup scheme
  • The TEI has enough extra power to justify its use as a layer about HTML5 or RDF
  • It is also commonly non-interoperable
  • TEI's ODD gives you the way to express relationship between public and private concerns
  • We need a clean agreement on how to distill interoperable TEI from interchangeable TEI

Sebastian RahtzHead of Information and Support GroupOxford University Computing Services. Date: Mapping the Landscape of eResearch, Berlin, February 22nd 2012
Copyright University of Oxford