Text only | Skip links
Skip links||IT Services, University of Oxford

1. The TEI

The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. Its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics. The TEI Guidelines have become an accepted standard for digital text especially where there are concerns about long-term preservation, interchange, or interoperability. http://www.tei-c.org/

1.1. 1988 was a long time ago...

The TEI arose out of some meetings in 1987, eventually creating itself in 1988:
  • The first computer virus – Brain – appears, in the USA
  • Construction of the channel tunnel between England and France begins
  • The Soviet Union launches space station Mir
  • Olaf Palme assassinated
  • Records of the year: Raising Hell (Run DMC) and Graceland (Paul Simon)

1.2. The world before the web

  • In your lab...
    • an IBM (or ICL, or Siemens, or Burroughs, or Univac, or...) mainframe
    • a Vax, or a Sun workstation
  • On your desk...
    • A PC running DOS 3.0 or a Mac running System 6? Maybe? If you were lucky?
    • WordPerfect of Winword?
    • Maybe a CD drive?
    • Maybe a network connection? If you worked in 'Science'?

1.3. ...but we did use computers then

  • Corpus linguistics
  • Databases on CD ROM
  • Large-scale lexical datasets (eg TLF, TLG, LASLA...)
  • Digital lexicography (e.g. OED)
  • Document management systems (e.g. TeX, Scribe, tRoff..)
  • Text archives (Oxford Text Archive had been around since 1976!)
  • Hypertext theory was all the rage

but no world wide web and not many desktop PCs...

1.4. Origins of the Text Encoding Initiative

  • Spring 1987: European workshops on standardisation of historical data (J.P. Genet, M Thaller)
  • Autumn 1987: NEH funds an exploratory international workshop on the feasibility of defining "text encoding guidelines"
Culprits:
Vassar College, Poughkeepsie
Figure 1. Vassar College, Poughkeepsie

1.5. Today's question: Still useful?

  • So the TEI is very old!
  • It comes from a time before the Web, before the DVD, the mobile phone, cable tv, or Microsoft Word
  • Not much in computing survives 5 years, never mind more than 20
  • What relevance can it possibly have today?
  • Why is it still here, and how has it survived?

1.6. TEI organizational structures (1991)

1.7. TEI then...

  • Organized as a research project
  • Appointed editors at the centre, answerable to a steering committee, taking input from working committees

1.8. TEI organizational structures (today)

1.9. ... TEI now

  • Organized as a community effort
  • Elected Technical Council at the centre, answerable to electors, taking input from the community
  • Maintenance/Feature releases every 6 months

1.10. The TEI aims to be independent of schema language

The TEI encoding scheme is a framework providing:
  • definitions and names for several hundred useful textual distinctions
  • a set of modules that can be used to generate schemas making those distinctions
  • a customization mechanism for modifying and combining those definitions with new ones using the same conceptual model
  • a very simple consensus-based way of organizing and structuring textual (and others) resources...
  • ... which can be enriched and personalized in highly idiosyncratic or specialised ways
  • a very rich library of existing specialised components
  • an integrated suite of standard stylesheets for delivering schemas and documentation in various languages and formats

1.11. Relevance of the TEI

Why would you want those things?
  • because we need to interchange resources
    • between people
    • (increasingly) between machines
  • because we need to integrate resources
    • of different media types
    • from different technical contexts
  • because we need to preserve resources
    • cryogenics is not the answer!
    • we need to preserve metadata as well as data

1.12. Reasons for attempting to define a common framework

  • re-usability and repurposing of resources
  • modular software development
  • lower training costs
  • ‘frequently answered questions’ — common technical solutions for different application areas

The TEI was designed to support multiple views of the same resource

1.13. Being a good digital citizen

  • XML implies Unicode; but the TEI also provides markup for non-Unicode characters and glyphs
  • TEI schemas can be generated for
    • Traditional XML DTD language
    • ISO RELAX NG language
    • W3C Schema Language
  • TEI content models use (an interoperable subset of) RELAX NG syntax
  • TEI datatypes are defined in terms of W3C datatypes
  • All linking and pointing uses W3C standards
  • Additional constraints may be expressed in ISO Schematron or similar
  • Hooks are provided for mapping to other ontological frameworks
  • Namespaces are fully supported to help mix vocabularies

1.14. If TEI is so marvellous, why isn't everyone using it?

There are two kinds of reason why standards fail...
  • the theory is not yet ripe
  • "not invented here": the community of users is too diverse

1.15. Coping with partially-baked ideas

In a TEI ODD, you can
  • constrain the domain of a value list more tightly
  • enforce schematron rules about e.g. co-dependency
  • remove (non-mandatory) child elements
  • add new elements in your own namespace

You can develop and test your theory while remaining TEI conformable

1.16. Not Invented Here?

  • TEI P5 has extensive I18N features for translation of
    • schema objects
    • schema documentation
  • Cf ROMA at http://www.tei-c.org/Roma/
  • TEI is hospitable to other namespaces
    • so you can use SVG for graphics, MathML for math, Word Table markup if you like
  • ODD also includes an <equiv> element for mapping to external ontologies

1.17. Digital data, digital text...

Digital texts are only metaphorically books

... but this metaphor is so pervasive it affects our capacity to profit from them.

1.18. What's that noise in the digital library?

  • A digital edition should represent the intentions and meaning of a text, not simply its appearance
  • Otherwise, there can be no analysis beyond the documentary level, no "conversation between books"

1.19. TEI Chapters (1)

In addition to Front Matter and Back Matter, the TEI Guidelines contain chapters on:
  • 1. The TEI Infrastructure
  • 2. The TEI Header
  • 3. Elements Available in All TEI Documents
  • 4. Default Text Structure
  • 5. Representation of Non-standard Characters and Glyphs
  • 6. Verse
  • 7. Performance Texts
  • 8. Transcriptions of Speech
  • 9. Dictionaries
  • 10. Manuscript Description
  • 11. Representation of Primary Sources
  • 12. Critical Apparatus
...

1.20. TEI Chapters (2)

...
  • 13. Names, Dates, People, and Places
  • 14. Tables, Formulæ, and Graphics
  • 15. Language Corpora
  • 16. Linking, Segmentation, and Alignment
  • 17. Simple Analytic Mechanisms
  • 18. Feature Structures
  • 19. Graphs, Networks, and Trees
  • 20. Non-hierarchical Structures
  • 21. Certainty and Responsibility
  • 22. Documentation Elements
  • 23. Using the TEI

2. Markup

In order to talk about texts, markup and encoding of texts, we need to understand what we mean by these basic concepts. When we talk about text encoding, what do we mean by a text? What is in a text and what assumptions do we make in reading them?

2.1. What's in a text?

2.2. What's in a text (2)?

BL Ms Cotton Vitelius A xv, fol. 129r

2.3. What's in a text (3)?

2.4. Are these images of the same thing?

2.5. Are these images of the same thing?

2.6. A text is not a document

Where is the text?
  • in the shape of letters and their layout?
  • in the original from which this copy derives?
  • in the stories we read into it? or in its author's intentions?

A "document" is something that exists in the world, which we can digitize.

A "text" is an abstraction, created by or for a community of readers (which we can encode).

2.7. Encoding of texts

  • A text is more than a sequence of encoded glyphs or lexical tokens
    • It has a structure and a communicative function
    • It also has multiple possible readings
  • Encoding, or markup, is a way of making these things explicit

Only that which is explicit can be reliably processed

2.8. What's the point of markup?

  • To make explicit (to a machine) what is implicit (to a person)
  • To add value by supplying multiple (sometimes competing) annotations
  • To facilitate re-use of the same material
    • in different formats
    • in different contexts
    • by different users

2.9. Styles of markup

  • In the beginning there was procedural markup
    RED INK ON; print balance; RED INK OFF
  • which being generalised became descriptive markup <balance type='overdrawn'>some numbers</balance>
  • also known as encoding or annotation

descriptive markup allows for easier re-use of data

2.10. Some more definitions

  • Markup makes explicit the distinctions we want to make when processing a string of bytes
  • Markup is a way of naming and characterizing the parts of a text in a formalized way
  • It's (usually) more useful to markup what we think things are than what they look like

2.11. Separation of form and content

  • Presentational markup cares more about fonts and layout than meaning
  • Descriptive markup says what things are, and leaves the rendition of them for a separate step
  • Separating the form of something from its content makes its re-use more flexible
  • It also allows easy changes of presentation across a large number of documents

2.12. Markup as a scholarly activity

  • The application of markup to a document can be an intellectual activity
  • In deciding what markup to apply, and how this represents the original, one is undertaking the task of an editor
  • There is (almost) no such thing as neutral markup -- all of it involves interpretation
  • Markup can assist in answering research questions, and the deciding what markup is needed to enable such questions to be answered can be a research activity in itself
  • Good textual encoding is never as easy or quick as people would believe
  • Detailed document analysis is needed before encoding for the resulting markup to be useful

2.13. What does markup capture?

Compare
<hi rend="dropcap">H</hi>WÆT WE GARDE <lb/>na in
gear-dagum þeod-cyninga <lb/>þrym gefrunon, hu ða æþelingas
<lb/>ellen fremedon. oft scyld scefing sceaþe
<add>na</add>
<lb/>þreatum, moneg<expan>um</expan> mægþum
meodo-setl
<add>a</add>
<lb/>of<damage>
 <desc>blot</desc>
</damage>teah ...
and
<lg>
 <l>Hwæt! we Gar-dena in gear-dagum</l>
 <l>þeod-cyninga þrym gefrunon,</l>
 <l>hu ða æþelingas ellen fremedon,</l>
</lg>
<lg>
 <l>Oft Scyld Scefing sceaþena þreatum,</l>
 <l>monegum mægþum meodo-setla ofteah;</l>
 <l>egsode Eorle, syððan ærest wearþ</l>
 <l>feasceaft funden...</l>
</lg>

2.14. Schemas and namespaces

  • A namespace is one way of specifying the meaning of the markup introduced in a document: like a dictionary
  • A more powerful way is to use a schema: a kind of grammar for your markup
  • A namespace tells you who defined or claims this element, but not much more about how to use it
  • A schema tells you how you are supposed to use an element, and thus allows you to validate your documents against a set of rules

2.15. What can a schema do for you?

  • ensure that your documents use only predefined elements, attributes, and entities
  • enforce structural rules such as ‘every chapter must begin with a heading’ or ‘recipes must include an ingredient list’
  • make sure that the same thing is always called by the same name

Schema languages vary in the amount of validation they support. In TEI you create the schema by describing how you want to customize the TEI.

2.16. A useful mental exercise

Imagine you are going to markup several thousand pages of complex material....
  • Which features are you going to markup?
  • Why are you choosing to markup this feature?
  • How reliably and consistently can you do this?

Now, imagine your budget has been halved. Repeat the exercise!

3. TEI@Oxford

Digital Humanities in Oxford

  • The Bodleian Library digitize texts, make manuscript catalogues, and host a Google team scanning large parts of the library holdings
  • The Humanities Faculty has a large number of research projects which use digital texts and resources in history, archaeology, linguistics, and literature
  • The Oxford eResearch Centre hosts a number of research projects in the humanities, e.g. in classical art, in machine vision, and in development of a Virtual Research Environment for the Humanities
  • Oxford is active in European infrastructural initiatives such as CLARIN, DARIAH, InterEdition, and of course the TEI
  • At the Computing Services (OUCS) we provide support and expertise in areas including text encoding and analysis, corpus linguistics, digital archives and open source licensing

3.1. TEI@Oxford

There are a few projects which are representative of the work we do, focussing on tools we have developed for them:

  • Text/database hybrid: William Godwin's Diary
  • Text comparison: The Holinshed project
  • Critical Editions: The Wandering Jew's Chronicle
  • Document Conversion: TEI ISO and OxGarage

3.2. Digital Diaries

  • William Godwin:
    • 1756-1836, philosopher, writer, political activist,
    • husband of Mary Wollstonecraft, father of Mary Godwin (aka Wollstonecraft Shelley).
  • Inter-Departmental:
    • Politics, Statistics, Computing Services, Bodleian Library, etc.
  • Objectives:
    • research to identify people mentioned in the 48 years of diary;
    • provide a searchable cross-referenced electronic edition alongside digital images of the diary
  • Started in October 2007, it coincided with initial release of TEI P5 and immediately benefited from new features. The final website is about to launch!

3.3. Diary

3.4. Diary + XML

3.5. The Godwin ODD

  • Created a customised TEI ODD for the Godwin project, this included:
    • removal of many TEI modules and elements
    • addition of new syntactic sugar elements
    • providing closed attribute value lists
  • ODD provides:
    • schemas (DTD, RelaxNG, W3C Schema)
    • project specific documentation
    • option for internationalisation
    • useful assistance in chosen editor
  • The results are canonicalised back to 'pure' TEI.

3.6. <person>, <persName>, and people

  • One objective was to link c. 64000 instances of <persName> elements (with around 10000 distinct values) to the right <person> elements, stored in separate files in an XML database
  • several researcher collaborate in the process
  • Problems of ambiguity and multiple @ref values..
    <persName ref="#BR01 #BR02 #BR03">The
    Browns</persName>
  • Person records follow a strictly limited template
  • All filesare stored in a SVN-managed project website
  • Similar system is used for identifying titles cited and places

3.7. <person> and <persName>

3.8. Transforming Godwin

  • Unlike many editions, researchers interested more in social relationships, frequencies of contact, and statistics
  • Lists of element/attribute combinations produced to assist proofreading and editorial standards
    • lists by frequency
    • lists by distinct-value
    • lists by year
  • XQuery in eXist is used for front-facing site
  • For project site: XSLT2-based grouping with xsl:for-each-group to create statistical lists
  • Transformations to CSV for people in Statistics Department interested in networked relations of contacts in meetings
  • Final site launching soon at: http://godwindiary.bodleian.ox.ac.uk/

3.9. Holinshed's Chronicles and the TEI Comparator

  • The project aims to produce a print, old-spelling, annotated critical edition of Holinshed's Chronicles of England, Scotland, and Ireland
  • Two original source editions exist: 1577 and 1587 (revised and significantly expanded)
  • An electronic full-text copy of 1587 exists in EEBO-TCP
  • EEBO-TCP was commissioned to create an electronic full-text version of 1577 edition
  • We helped to to fuzzy-match paragraphs from one to another, using stand-off linking, to assist them in making their comparisons

3.10. These are not modern books!

3.11. Getting clean texts

The EEBO project provided the two texts in SGML markup.

We converted them to be XML fully-compliant against version 5 of the Text Encoding Initiative Guidelines.

The result includes some notion of typeface change, the marginal notes and figures, and indications of illegibility in the source:
<p
  cid="sid-ce7f3bd0-f920-4476-9977-43d499fb9c6b">
But to
procéede, when the ſayde <hi>Albion</hi> had gouerned here
in this .... to be re|membred.<note place="marg">Leſtrigo.</note> It happened in tyme of <hi>Lucus</hi>
king of the Celtes, that Leſtrigo and his iſſue ...
conceyued this opinion, that if they had once gotten foote
into any Re|gion whatſoeuer, it woulde not be long ere they
did by ſome meanes or other,<note place="marg">
  <hi>Ianige<gap extent="+1unit="letters">
    <desc>illegible</desc>
   </gap>
  </hi> the po<gap extent="+1unit="letters">
   <desc>illegible</desc>
  </gap>|ty of <gap extent="1unit="word">
   <desc>illegible</desc>
  </gap> lying in Italy.</note> not onelye eſtabliſhe their
ſeates, but </p>

3.12. Deciding what to compare

Luckily, Holinshed is relatively simple in structure. We can take paragraphs as the main unit of comparison, leaving aside things such as headings:
<div1 type="dedication">
 <pb n="2"/>
 <head>TO THE RIGHT Honorable and his ſingular good Lorde,
   Sir VVilliam Cecill, Baron of Burghleygh, Knight of
 <hi>the most noble order of the Garter, Lord high
     Treaſou|rer</hi> of England, Maiſter of the Courtes of
   Wardes and Lyueries, and one of the Queenes Maieſties
   priuie Counſell.</head>
 <p
   cid="sid-c3e471d1-cdb9-46fb-9433-5e9bb56201b5">

  <hi>_COnſidering with my ſelfe,</hi> right Honorable and
   my ſin|gular good Lorde, how ready (no doubt) many wil be
   to ac|cuſe me of vayne preſumptio~, for enterpriſing to
   deale in this ſo weighty a worke, and ſo farre aboue my
   reache to ac|compliſh: [.....]</p>
</div1>
although this leaves the problem of one paragraph in the first edition being split into many in the second.

3.13. Establishing a notation for storing results of comparison

We had two choices
  1. Maintain one master text with embedded links to the other text
  2. Keep the two texts separate, and store all the linkages in a third text, referring to uniquely-identified points in the two main texts
<linkGrp type="TEI-Comparator">
 <link
   xml:id="sid-426e7975-b861-4368-b0fe-321612c97171"
   targets="1577.xml#sid-06ed72ca-cfb3-4b8d-999c-6350c43065ae 1587.xml#sid-23f5aec2-e996-4e00-ad5d-f67199c93170"
   type="match"/>

 <link
   xml:id="sid-01cab7e2-035a-46c6-b8d5-4fe6bf89ad33"
   targets="1577.xml#sid-381bfa62-436f-4d98-95da-e0471ccc9218 1587.xml#sid-e52eab7e-9fd4-4672-8d02-875e292a2bf2"
   type="match"/>

 <link
   xml:id="sid-3afffe4d-c974-49ec-b776-5e876aaf1dc5"
   targets="1577.xml#sid-9739786a-798f-453b-8b5b-fac15d2be815 1587.xml#sid-9972ad3d-5b66-46d1-a3f2-afd8f1ce243b"
   type="match"/>

</linkGrp>

3.14. Working out a system to make comparisons

First, we simplify the input to remove variability between texts:
  • replace long s with short s
  • replace vv with w
  • replace ~ with n
  • replace non-letters with spaces
  • remove vowels except for u
  • transform u into v
  • remove double t, p, s, r, n, m
  • replace sc with s
Then we take a text, walk over each identified unit, find all the words in it (using whitespace separator) and try to find ones somewhere in the other text which use a reasonable number of the same words.

Remembering that we have to search the whole text each time, so the processing is quite expensive!

3.15. Providing an interface to allow links to be made and unmade

The application lets you
  • navigate the structure of either edition
  • for a unit, ask to see matches from the other edtion
  • confirm and annotate one of those matches
  • join two units by manual search
  • see page image

One of the more problematic issues is the navigation by structure rather than the more familiar (but not formally recorded) reference by year.

3.16. Showing the results nicely for scholars

The Comparator shows one fully formatted text beside overlapping passages in another text

3.17. The web static display

shows one fully formatted text, with the ability to toggle between versions

3.18. Questions

Is this on the desktop or the web?
It is a web application, to avoid any platform or installation issues
Does it work with other texts?
Yes, it should, though we have not done much testing. Each new set of texts would need a new configuration explaining how the text is organized
Is the software available?
Yes from http://tei-comparator.sourceforge.net/

3.19. Wandering Jew's Chronicle

3.20. Wandering Jew's Chronicle

  • 15 Early-Modern Witnesses
  • Need for on-the-fly collation
  • User-centred image comparison/overlay between woodcuts
  • Witness metadata, transcription, scholarly apparatus

3.21. TEI tools 1

A major objective in making large collections of TEI texts is to analyse them:
  • searching for words and phrases as objects of interest in their own right (cf Google)
  • identifying patterns of linguistic usage
  • ‘intelligent search’ paying attention to associated markup and metadata
  • but not requiring complex markup

3.22. TEI tools 2

People need the ability to display TEI-encoded texts as well as search or analyse them

  • Hence we are interested in converting between TEI XML and other formats, including
    • HTML (relatively easy, using XSLT)
    • PDF (converting to XSL FO or LaTeX and then running a formatter)
    • OpenOffice (plug in filter using XSL)
    • Word 2007 (external XSL scripts and Java library)

3.23. TEI ISO

  • Project with ISO, OUCS, TEI, Brigham Young University, and the Max Planck Digital Library
  • International Standards Organization: production of standards documents
  • TEI ODD for documenting a schema to store these standards as TEI XML
  • A suite of XSLT to allow lossless conversion to and from various word-processing systems (e.g. MS Word)
  • This ability to round-trip from presentational markup is now also be available to other users

3.25. Vesta, a desktop processor

A Java application which can
  • process TEI ODD schema definitions (like the Roma web application)
  • convert TEI XML files to — and from — supported formats
  • support multiple profiles

3.26. Where can I find out more?



James Cummings. Date: 2010-10
Copyright University of Oxford