Text only | Skip links
Skip links||IT Services, University of Oxford

1. Summary

How does a TEI user do the following?
  • Data capture
  • Editing
  • Schema design
  • Other forms of validation

2. What tools do we need?

  • Appropriately expressive vocabularies (eg TEI XML)
  • Syntax-checking document creation tools (ie editors)
  • Document transformation tools
  • Document delivery tools
  • Document storage and management tools
  • Programming interfaces
  • Specialized applications

3. Two stages to get a TEI text

  • capture the text
  • create the markup

Often they occur simultaneously; but often not.

Note that the markup does not necessarily all have to be in the same file.

4. Categories of creation tools

  • scanning/OCR
  • data-entry vendors
  • software to add tagging automatically
  • editors
followed by
  • validators, well-formedness checkers
  • proofing aids, data integrity checkers

5. Text capture ‘tools’

Scanning and OCR software generally produce only minimal HTML or Word (e.g., recognizing paragraph breaks, font changes etc).

Data-entry vendors in theory would insert whatever markup you wanted, but at a price. They generally prefer HTML or TEI Lite or some such well-known DTD.

TEI is negotiating for discounts for members who use a particular TEI DTD (‘TEI tight’), as undeveloped.

6. Auto-tagging software

  • fully automatic
    • no human intervention
    • typically either
      • very light markup
      • very poor markup
      • very predictable input
  • semi-automatic
    • some human intervention required
    • software attempts to limit human intervention to where it is needed, i.e. to interpretative decisions, not typing
    • makes the markup process go more efficiently
    • often special-purpose
  • data-entry forms
    • highly constraining environment like a web form
    • excellent for very regular structured entry, e.g. cataloguing a relatively homogeneous collection

7. Useful daily tricks for data conversion from funny formats

  • Can you get it to HTML? if so, run W3C tidy to clean the HTML, run a transformation / enricher to TEI XML
  • Does OpenOffice read it? Use simple TEI filters to export from OO
  • Is its native format XML under the hood? Write an XSL transformation
  • Can you make PDF? consider a PDF to text extractor and add back markup
  • Can you print? Consider OCRing a printout
  • Don't assume rekeying is too expensive

8. Editor types

Editing tools cover a wide spectrum:
  • Basic text editors
  • General programmers' editors
  • XML-aware programmers' editors
  • XML-specific editors
  • Word-processors which can export XML
  • Data-entry forms
it is likely that people in different roles need different tools.

9. Things to look for in specialist XML editors

  • schema-aware
  • constraining element entry
  • IDE features
  • customizable
  • validation, preferably continual
  • Multiple display views (as tree, with tags, formatted etc)
  • folding structures
  • context-sensitive help
Emacs, oXygen, jEdit, XMetaL, XMLSpy, Stylus Studio, Arbortext Adept are all worth a look.

10. Emacs (1)

11. Emacs (2)

12. OpenOffice

13. oXygen

14. XMetaL (1)

15. XMetaL (2)

16. XMetaL (3)

17. XMLspy

18. What is missing, or hard, in the TEI editing world

  • Editors like XMetaL which combine visual feedback with code editing
  • Visual, or WYSIWYG, editors in web applications (eg in a CMS); most web editors are for XHTML (cf Writely)
  • Reliable conversion to and from Word and OpenOffice styles. Note:
    • the general inability of word-processors to nest inline inside inline, or block inside block
    • the difficulty of extrapolating a hierarchical structure from a sequence of free-standing headings at assorted levels
    • the tedious programming required to trace the ancestry of styles in Word and OO
    • the lack of a facility in OO to stop the user formatting by hand

19. Design your schema

The TEI provides a wealth of elements arranged in classes and modules; many of them are useful to you, but not all. So:
  1. do your document analysis and tagging trials
  2. decide which elements you need from the TEI
  3. delete the rest
  4. decide where you need shorthand syntactic sugar: define new elements
  5. consider datatype constraints and adjust accordingly
  6. write human-readable manual
  7. implement non-schema checks for 2nd line of defence

20. Analysis example

Looking at the XML files comprising http://www.oucs.ox.ac.uk and http://www.oss-watch.ac.uk, we see
Documents
1698
Elements
233840
Unique elements
122
Elements occuring more than 10 times
94

Note that TEI Lite defines 151 elements

21. Decisions for an example schema

  • no numbered <div> elements
  • no elements for marking-up transcribed text (<gap>, <reg> etc)
  • simplified models for <front>, <body>, <back>, <div>, <list>
  • new high-level <uList>, <oList> and <glossList> to replace old <list>
  • add special elements for code, file paths etc
  • some header extensions
  • no more than 100 TEI elements

22. Before: block-level choice

23. Before: inline choice

24. After: block-level choice

25. After: inline choice

26. Tricksy ODD: syntactic sugar

<elementSpec ident="uList">
 <equiv name="listmimetype="text/xslfilter="equivs.xsl"/>
 <gloss/>
 <desc>A sequence of items organized as an
   unordered list.</desc>
 <classes>
  <memberOf key="model.listLike"/>
 </classes>
 <content>
  <rng:zeroOrMore>
   <rng:ref name="item"/>
  </rng:zeroOrMore>
 </content>
</elementSpec>

27. What was that <equiv> thing?

It pointed to a template in an XSL file:
<xsl:template name="list">
 <list xmlns="http://www.tei-c.org/ns/1.0"
 >

 <xsl:copy-of select="@xml:id|@n"/>
 <xsl:attribute name="type">
  <xsl:choose>
   <xsl:when test="local-name(.)='oList'">ordered</xsl:when>
   <xsl:when test="local-name(.)='uList'">unordered</xsl:when>
   <xsl:when test="local-name(.)='glossList'">gloss</xsl:when>
   <xsl:otherwise>
    <xsl:message terminate="yes">
     <xsl:value-of select="local-name(.)"/> is mapped to
         "list", but I do not know what to do to with it</xsl:message>
   </xsl:otherwise>
  </xsl:choose>
 </xsl:attribute>
 </list>
</xsl:template>

28. Types of validation

29. Beyond schemas and DTDs

You can:
  • run scripts to produce lists of values to check consistency
  • write Schematron rules to check things which schemas cannot
  • write checking scripts in XSLT
  • create an output format where tags are colour-coded, and ask humans to eyeball it

30. A list generator in XQuery

declare namespace tei="http://www.tei-c.org/ns/1.0";
for $text in //tei:text
return
<names>
{
for $name in distinct-values($text//tei:name/@key)
order by $name return
<name>{$name}</name>
}
</names>
which produces the following. Are they all spelt right?
<names>
 <name>Abanazar</name>
 <name>Aladdin</name>
 <name>Alexander</name>
 <name>Ali</name>
 <name>Badroulbadour</name>
 <name>Bahadur</name>
 <name>Bates</name>
 <name>Beetle</name>
 <name>Binjimin</name>
 <name>Borrow</name>
 <name>Braunton</name>
 <name>Braybrooke</name>
 <name>Brett</name>
 <name>Browning</name>
 <name>Burton</name>
 <name>Caesar</name>
 <name>Campbell</name>
 <name>Carson</name>
 <name>Carter</name>
 <name>Cathcart</name>
 <name>Catullus</name>
 <name>Chingangook</name>


</names>

31. A check written in XSLT

Are the target attributes sensible URLs?

<xsl:template name="checkThisLink">
 <xsl:param name="What"/>
 <xsl:choose>
  <xsl:when test="starts-with($What,'#')">
   <xsl:choose>
    <xsl:when
      test="not(key('IDS',substring-after($What,'#')))">

     <xsl:call-template name="Error">
      <xsl:with-param name="valueselect="$What"/>
     </xsl:call-template>
    </xsl:when>
   </xsl:choose>
  </xsl:when>
  <xsl:when test="starts-with($What,'mailto:')"/>
  <xsl:when test="starts-with($What,'http:')"/>
  <xsl:when test="starts-with($What,'https:')"/>
  <xsl:otherwise>
   <xsl:call-template name="Error">
    <xsl:with-param name="valueselect="$What"/>
   </xsl:call-template>
  </xsl:otherwise>
 </xsl:choose>
</xsl:template>

32. Schematron example

<sch:schema extension-element-prefixes="date">
 <sch:ns uri="http://exslt.org/dates-and-timesprefix="date"/>
 <sch:title>Schematron rules for TEI</sch:title>
 <sch:pattern name="Complexity">
  <sch:rule context="list">
   <sch:report test="count(item)>8">Do not make lists with more than 8 items
   </sch:report></sch:rule></sch:pattern>
 <sch:pattern name="Alt tags">
  <sch:rule context="figure">
   <sch:report test="not(figDesc) and not(head)">You must provide information from which I
       can construct an alt attribute
   </sch:report></sch:rule></sch:pattern>
 <sch:pattern name="Metadata">
  <sch:rule role="Rootcontext="TEI.2">
   <sch:report test="not(@lang)">The primary language of a document should
       be identified with a lang attribute. </sch:report>
   <sch:assert test="teiHeader/fileDesc/titleStmt/author">You have
       not provided an author name</sch:assert>
   <sch:assert
     test="teiHeader/fileDesc/editionStmt/edition/date">
You have
       not provided a date for the document</sch:assert>
   <sch:assert
     test="teiHeader/revisionDesc/change/date[contains(.,'$LastChanged')]">
You must have a Subversion $LastChanged$ field
       in a revision statement</sch:assert></sch:rule>
  <sch:rule
    role="Date"
    context="TEI.2/teiHeader/revisionDesc/change/date">

   <sch:assert
     test="translate(substring-before(substring-after(.,'$LastChangedDate: '),' '),'-','') >translate(substring-before(date:add(date:date(),'-P6M'),'T'),'-','')">
Date value of <value-of
      select="substring-before(substring-after(.,'$LastChangedDate: '),' ')"/>

       is more than 6 months older than <value-of select="date:date()"/></sch:assert></sch:rule></sch:pattern>
 <sch:pattern name="Tables">
  <sch:rule context="table">
   <sch:report test="not(head)">A table should have a caption</sch:report>
   <sch:report test="parent::body">Do not use tables to lay out the
       document body</sch:report></sch:rule></sch:pattern></sch:schema>

33. Conclusions

  • there are many ways to make a text
  • provide different tools for different people
  • the tightest schema is usually the best
  • do not just rely on the schema
  • analyse your texts


Sebastian Rahtz. Date: February 2007
Copyright University of Oxford