Text only | Skip links
Skip links||IT Services, University of Oxford


1. Introduction to the Workshop

James Cummings works for the Oxford Text Archive and has a Ph.D. in Medieval Studies, and is a member of the TEI Technical Council.

Dot Porter is the Program Coordinator at the Collaboratory for Research in Computing for Humanities at the University of Kentucky and is a member of the TEI Technical Council.

All slides and materials for this workshop are freely available for re-use at: http://tei.oucs.ox.ac.uk/Oxford/2007-10-31-MMWorkshop/

1.1. Aims

The aims of this workshop are:
  • to provide a general understanding of TEI P5, highlighting major changes from TEI P4
  • to survey some of the more popular/important chapters in the TEI P5 Guidelines
  • to have an exercises in editing some TEI P5 XML, and customising the TEI with Roma
  • to suggest some of the possible ways of editing, publishing and accessing TEI XML
  • to give students a chance to ask their own questions relating to their own work
We will be trying to accommodate both those who are new to TEI XML and those who are already familiar with TEI P4, but assuming some basic knowledge. If in doubt, ask questions!

1.2. Timetable (1)

The timetable for the workshop is as follows:
Time Topic Speaker
09:00-09:30 Introduction and Basic TEI P5 Overview and Infrastructure (JC & DP)
09:30-10:00 Overall P4 to P5 Changes and Conversion to P5 (JC)
10:00-10:30 Default Text Structure and Header (DP)
10:30-11:00 Core TEI Elements, Non-Standard Characters (JC)
11:00-11:15 Coffee
11:15-11:30 Editing Options for TEI Users (DP)
11:30-12:15 Exercise: Editing Some TEI P5 XML (DP leads)
12:15-13:15 Lunch

1.3. Timetable (2)

The timetable continue after lunch:
13:15-13:45 Names, Dates, People and Places (JC)
13:45-14:15 Representation of Primary Sources (DP)
14:15-14:45 Using and Customizing the TEI: Documentation Elements, Roma and ODD (JC)
14:45-15:15 Exercise: Using Roma to Customize the TEI (JC leads)
15:15-15:30 Coffee
15:30-16:00 Manuscript Description (DP)
16:00-16:15 Extremely Brief Summary of Some Other Important Chapters (JC)
16:15-16:30 Publishing/Accessing TEI Documents (JC)
16:30-17:30 Conclusion and Group Discussion followed by Questions? (JC & DP)

2. Text, Markup, XML and the TEI

In order to introduce you to Text Encoding Initiative P5 XML, we need to make sure you have at least some basic understanding of texts, markup and XML.

2.1. What's in a text?

2.2. What's in a text?

2.3. The ontology of text

Where is the text?
  • in the shape of letters and their layout?
  • in the original from which this copy derives?
  • in the stories we read into it? or in its author's intentions?

A "text" is an abstraction, created by or for a community of readers. Markup encodes and makes concrete such abstractions.

2.4. Encoding of texts

  • Texts are more than sequences of encoded glyphs
    • They have structure and content
    • They also have multiple readings
  • Encoding, or markup, is a way of making these things explicit
  • Only that which is explicit can be reliably processed

2.5. Styles of markup

  • In the beginning there was procedural markup
    RED INK ON; print balance; RED INK OFF
  • which being generalised became descriptive markup <balance type='overdrawn'>some numbers</balance>
  • also known as encoding or annotation

descriptive markup allows for easier re-use of data

2.6. Some more definitions

  • Markup makes explicit the distinctions we want to make when processing a string of bytes
  • Markup is a way of naming and characterizing the parts of a text in a formalized way
  • It's (usually) more useful to markup what we think things are than what they look like

2.7. What's the point of markup?

  • To make explicit (to a machine) what is implicit (to a person)
  • To add value by supplying multiple annotations
  • To facilitate re-use of the same material
    • in different formats
    • in different contexts
    • by different users

2.8. Some alphabet soup

SGML Standard Generalized Markup Language
HTML Hypertext Markup Language
W3C World Wide Web Consortium
XML eXtensible Markup Language
DTD Document Type Definition (or Declaration)
CSS Cascading Style Sheet
Xpath XML Path Language
XSLT eXtensible Stylesheet Language - Transformations
XQuery XML Querying
RELAXNG Regular Expression Language for XML (New Generation)

Oh, and then there's also TEI, the Text Encoding Initiative

2.9. XML: what it is and why you should care

  • XML is structured data represented as strings of text
  • XML looks like HTML, except that:-
    • XML is extensible
    • XML must be well-formed
    • XML can be validated
  • XML is application-, platform-, and vendor- independent
  • XML empowers the content provider and facilitates data integration

2.10. XML terminology

An XML document may contain:-
  • elements, possibly bearing attributes
  • processing instructions
  • comments
  • entity references
  • marked sections (CDATA, IGNORE, INCLUDE)

An XML document must be well-formed and may be valid

2.11. The rules of the XML Game

  • An XML document represents a (kind of) tree
  • It has a single root and many nodes
  • Each node can be
    • a subtree
    • a single element (possibly bearing some attributes)
    • a string of character data
  • Each element has a type or generic identifier
  • Attribute names are predefined for a given element; values can also be constrained

2.12. Representing an XML tree

  • An XML document is encoded as a linear string of characters
  • It begins with a special processing instruction
  • Element occurrences are marked by start- and end-tags
  • The characters < and & are Magic and must always be "escaped" if you want to use them as themselves
  • Comments are delimited by <!- - and - ->
  • CDATA sections are delimited by <![CDATA[ and ]]>
  • Attribute name/value pairs are supplied on the start-tag and may be given in any order
  • Entity references are delimited by & and ;

2.13. XML syntax: the small print

What does it mean to be well-formed?

  1. there is a single root node containing the whole of an XML document
  2. each subtree is properly nested within the root node
  3. names are always case sensitive
  4. start-tags and end-tags are always mandatory (except that a combined start-and-end tag may be used for empty nodes)
  5. attribute values are always quoted

2.14. Parts of an XML document

<?xml version="1.0" ?> <greeting xmlns="http://www.greetings.org"> <hello type="sarcastic">hello world!</hello> </greeting>
  • The XML declaration
  • Namespace declarations
  • The root element of the document itself
  • Other elements and content
  • Attribute and value

2.15. Test your XML knowledge

  • Which are correct?
    • <seg>some text</seg>
    • <seg><foo>some</foo> <bar>text</bar></seg>
    • <seg><foo>some <bar></foo> text</bar></seg>
    • <seg type="text">some text</seg>
    • <seg type='text'>some text</seg>
    • <seg type=text>some text</seg>
    • <seg type ="text">some text</seg>
    • <seg type="text">some text<seg/>
    • <seg type="text">some text<gap/></seg>
    • <seg type="text">some text< /seg>
    • <seg type="text">some text</Seg>

2.16. What can the TEI do for you?

The TEI provides a framework for the definition of multiple schemas

  • it defines and names several hundred useful textual distinctions
  • it provides a set of modules that can be used to define schemas making those distinctions
  • it provides a customization mechanism for modifying and combining those definitions with new ones using the same conceptual model

2.17. Goals of the TEI

  • better interchange and integration of scholarly data
  • support for all texts, in all languages, from all periods
  • guidance for the perplexed: what to encode — hence, a user-driven codification of existing best practice
  • assistance for the specialist: how to encode — hence, a loose framework into which unpredictable extensions can be fitted

These apparently incompatible goals result in a highly flexible, modular, environment

2.18. TEI Deliverables

  • A set of recommendations for text encoding, covering both generic text structures and some highly specific areas based on (but not limited by) existing practice
  • A very large collection of element definitions with associated declarations for various schema languages
  • a modular system for creating personalized schemas or DTDs from the foregoing

for the full picture see http://www.tei-c.org/TEI/Guidelines/

2.19. Legacy of the TEI

  • a way of looking at what ‘text’ really is
  • a codification of current scholarly practice
  • (crucially) a set of shared assumptions and priorities about the digital agenda:
    • focus on content and function (rather than presentation)
    • identify generic solutions (rather than application-specific ones)

3. Infrastructure

  • The TEI encoding scheme consists of a number of modules
  • These declare XML elements and their attributes
  • An element's declaration assigns it to one (or more) model classes
  • Another part declares its possible content and attributes with reference to these classes
  • This indirection allows strength and flexibility
  • It makes it easy to add/exclude new elements by referencing existing classes

3.1. What is a module?

  • A convenient way of grouping together a number of element declarations
  • These are usually on a related topic or specific application
  • Most chapters focus on elements drawn from a single module, which that chapter then defines
  • A TEI Schema is created by selecting modules and add/removing elements from them as needed

3.2. Modules

Module name Chapter
analysis Simple Analytic Mechanisms
certainty Certainty and Responsibility
core Elements Available in All TEI Documents
corpus Language Corpora
dictionaries Dictionaries
drama Performance Texts
figures Tables, Formulae, and Graphics
gaiji Representation of Non-standard Characters and Glyphs
header The TEI Header
iso-fs Feature Structures
linking Linking, Segmentation, and Alignment
msdescription Manuscript Description
namesdates Names, Dates, People, and Places
nets Graphs, Networks, and Trees
spoken Transcriptions of Speech
tagdocs Documentation Elements
tei The TEI Infrastructure
textcrit Critical Apparatus
textstructure Default Text Structure
transcr Representation of Primary Sources
verse Verse

3.3. Defining a TEI Schema

  • A schema helps you know a document is valid in addition to being well-formed
  • A TEI schema is a combination of TEI modules, optionally including customizations of the elements/attributes/classes that they contain
  • This schema is defined in an application-independent manner with a TEI ODD (one document does it all) file which allows for:
    • creation of a schemas such as DTD, RelaxNG or W3C Schema
    • internationalized documentation which reflects your customization of the TEI
    • documentation of how your schema differs from tei_all that is suitable for long-term preservation
  • (But we will discuss this in more detail later today!)

3.4. A Simple Customization

A TEI ODD file can contain discursive prose, but needs a <schemaSpec> element to define the schema it documents

<schemaSpec ident="TEI-minimalstart="TEI">
 <moduleRef key="tei"/>
 <moduleRef key="header"/>
 <moduleRef key="core"/>
 <moduleRef key="textstructure"/>

3.5. Even more customisation

<schemaSpec ident="SleepyHollowstart="TEI">
 <moduleRef key="tei"/>
 <moduleRef key="header"/>
 <moduleRef key="core"/>
 <moduleRef key="textstructure"/>
 <moduleRef key="namesdates"/>
 <moduleRef key="transcr"/>
<!-- We don't need these drama elements: -->
 <elementSpec ident="spmode="deletemodule="core"/>
 <elementSpec ident="speakermode="deletemodule="core"/>
 <elementSpec ident="stagemode="deletemodule="core"/>

3.6. The TEI Class System

  • The TEI distinguishes over 500 elements,
  • Having these organised into classes aids comprehension, modularity, and modification.
  • Attribute class: the members share common attributes
  • Model class: they can appear in the same locations (and often are structurally or semantically related)
  • Classes may contain other classes
  • Elements inherit the properties from any classes of which they are members

3.7. Attribute Classes

  • Attribute classes are given (ususally adjectival) names beginning with att.
  • Members of the att.naming class get a @key attribute rather than have them define it individually
  • If another element needs a @key then the easiest way to provide it is to add it to the att.naming class
  • Classes can be grouped together into a super classes

3.8. att.global

The attributes provided by att.global include among others:
a unique identifier
the language of the element content
a number or name for an element
how the element in question was rendered or presented in the source text.
And att.global also contains att.global.linking so if the 'linking' module is loaded it provides attributes such as:
points to elements that correspond to the current element in some way
points to an element of which the current element is a copy
points to the next element of a virtual aggregate of which the current element is part.
points to the previous element of a virtual aggregate of which the current element is part

3.9. Model Classes

  • Model classes contain groups of elements allowed in the same place
  • If you are adding an element which is wanted wherever the <bibl> is allowed, we simply add it to the model.biblLike class
  • Model classes are usually named with a Like or Part suffix:
    • model.divLike: structural class grouping elements for divisions
    • model.divPart: structural class grouping elements used inside divisions
    • model.nameLike: semantic class grouping name elements
    • model.persNamePart: semantic sub-class grouping elements that are part of a personal name

3.10. Basic Model Class Structure

The TEI class system makes a threefold division of elements:
high level major divisions of texts
elements such as paragraphs appearing within texts or divisions, but not other chunks
phrase-level elements
elements such as highlighted phrases which can occur only within chunks
The TEI identifies the following groupings from these three:
inter-level elements
elements such as lists which can appear either in or between chunks
elements which can appear directly within texts or text divisions

3.11. Macros

content of paragraphs and similar elements
content of prose elements that are not used for transcription of extant materials
a sequence of character data and phrase-level elements
a sequence of character data and those phrase-level elements that are not typically used for transcribing extant documents
the content model of elements which either contain a series of component-level elements or else contain a series of phrase-level and inter-level elements

3.12. Datatype Macros

a coded value
a single word or token
an XML Name
a single XML name taken from a documented list
a W3C duration
a W3C date
a truth value
a language
human or animal sex

3.13. Next...?

Next James will tell us about the Overall Changes from P4 to P5.

James Cummings and Dot Porter. Date: 2007-10-31
Copyright University of Oxford