Text only | Skip links
Skip links||IT Services, University of Oxford


1. Overall P4 to P5

There have been many changes from TEI P4 to TEI P5, and we can't hope to cover them all now. However, we think it is important to highlight some of the most important of these changes.

1.1. changes P4 to P5

  • Since we can't mention all the changes since TEI P4, we'll look at the main types of things that have changed
  • Conversion will always be a process of trial and error
  • Where possible the TEI has adopted external standards rather than reinvent the wheel
  • A long-term attempt to do things the 'better' way rather than the 'easier' way
  • As always the prose of the TEI P5 Guidelines is the final source for the current recommendations

1.2. Eight new things about P5

  1. One specification language for extensions and documentation
  2. Support for multiple schema languages
  3. Support for namespaces
  4. Reliance on XML, and hence on Unicode
  5. Validation of attributes and datatyping
  6. Use of W3C pointers and paths
  7. Verifiable conformance
  8. Some old annoyances removed and some new topics added

1.3. 1. One Specification Language

  • A set of TEI documents is described by an ODD, which is itself a TEI document that combines:
    • references to existing declarations
    • formal declarations for elements and attributes
    • documentation and usage notes
  • Underlying this:
    • a conceptual model which abstracts from specific elements to generic classes
    • a modular architecture for combining sets of definitions
  • specifications are chainable; modifications are written in ODD with ODD as input and output
  • Roma is one interface to this: there will be others

1.4. 2. Support for many schema languages

  • TEI schemas can be generated for
    • Traditional XML DTD language
    • ISO RELAX NG language
    • W3C Schema Language
  • Content models are defined using RELAX NG syntax
  • Datatypes are defined in terms of W3C datatypes
  • Some facilities (e.g. alternation, namespaces) cannot be expressed in DTD
  • Additional constraints can be expressed in Schematron

1.5. 3. Support for many namespaces

  • A key design goal for P5 was interoperability with other standards
  • By defining — and insisting on — a TEI namespace we facilitate that interoperability
  • Examples: embedding of MathML, SVG, KML, GML...
  • Not to mention: embedding of TEI within Docbook

User-defined extensions must use their own namespace

1.6. For example

Embedding SVG within TEI:
<figure><svg xmlns="http://www.w3.org/2000/svg" width="6cm" height="5cm" viewBox="6 3 6 5">
<ellipse style="fill: #ffffff"
cx="9.75" cy="6.35" rx="2.75" ry="2.35"/>
A user-defined extension:
<div   xmlns:my="http://www.example.org/ns/nonTEI">
<!-- ... -->
 <p n="12my:topic="rabbits">Flopsy, Mopsy, Cottontail, and

1.7. 4. Reliance on XML and Unicode

  • Getting rid of &squiggle; in favour of the actual character (or the unicode reference &#xxxx;) is highly recommended
  • If you really need to use non-Unicode characters...
    • wherever text is possible as content, <g> can be used, either as a pointer, or to hold any convenient representation
    • nonstandard characters and glyphs can now be defined in the header
  • we now use xml:lang (just as we now use xml:id and xml:base)

1.8. 5. Validation of attributes and datatyping

Attribute values at P5 cannot contain markup. Consequently, the <choice> element replaces ‘mirror’ tags
<reg orig="yeere">year</reg>
Text-like attributes become child elements:
<event desc="transcriber dozes off"/>
 <desc xml:lang="en">transcriber dozes off</desc>
 <desc xml:lang="fr">transcripteur s'endort</desc>

1.9. Datatype validation

At P4, attribute values were ID, IDREF, enumerated list, or CDATA (only)

At P5, we introduce greater precision and variety, relying on more sophisticated schema processors able to check
  • validity of dates (ISO or W3C)
  • URIs
  • predefined ISO standards e.g. for sex and language
  • data facets and patterns
  • Schematron rules

1.10. 6. Support for W3C pointer mechanisms

  • P4 vs P5
    • P4 had two different ways of linking:
      • internal: <ptr>: using ID/IDREF
      • external: <xptr>: using TEI-specific syntax
    • In P5, all pointing is done in the same way, using URIs
    • A URI may be absolute …
    • … or relative
        <ref target="details/dogs.xml">Dogs</ref>
        <ref target="details/cats.xml">Cats</ref>
    • … or local identifiers can still be used
      <sp who="#Macbeth">
       <speaker>Mac.</speaker> ...
    • In addition, you can qualify a URI using an XPointer framework scheme such as xpath()

1.11. 7. Verifiable conformance

  • What might it mean to say that a document is ‘TEI conformant’?
    • conformance to the TEI abstract model
    • appropriate use of TEI namespace
    • appropriate use of a TEI ODD
    • ‘clean’ modifications only
    • compiled schemas not DTD subsets - a different way of thinking

Standardization does not mean ‘Do what I do’, but ‘Explain what you do’

1.12. 8: Other novelties at P5

P5 includes significant new material:
  • manuscript description
  • manuscript transcription
  • data about persons and places
  • floating and overlapping texts
  • integration of text and graphics
  • markup documentation
  • internationalization features

2. Converting from P4 to P5

There is no magic solution when converting from TEI P4 to P5. If you used plain vanilla TEI, then chances of pure mechanical conversion are better, but not assured! Most of the problems are because of attribute value datatypes whereas P4 just allowed text values.

2.1. Methods of converting

  • http://www.tei-c.org/P5/p4top5.xsl
  • http://www.tei-c.org.uk/wiki/index.php/Category:P4toP5
  • Roll your own
  • As with all data migration, small pipelined steps you can customize are generally a good idea
  • None of the available conversion methods, so far, is able to process earlier extension files (while ODD files are better documentation)
  • In highly extended schemas, consider reducing to standard TEI P4 before converting to P5

2.2. Some other conversion reminders

  • @id is now @xml:id (except when parent::lang) and @lang is @xml:lang
  • TEI.2 is TEI | teiCorpus.2 is teiCorpus
  • ODD means no need for @TEIform
  • Janus tags (corr/sic|abbr/expan, etc.) are inside <choice>
  • Most @url are now @target
  • @xptr/@xref are @ptr/@ref and uses xpointer schemes
  • date/@value and similar are @when (and W3C)
  • descriptive text attribute values are now child elements
  • lots of URI based pointers, start attribute value with '#'
  • everything has a namespace, and lots of values have datatypes

2.3. Next...?

Next Dot will tell us about TEI's Default Text Structure and Header.

James Cummings. Date: 2007-10-31
Copyright University of Oxford