1 Getting started using TEI: What is TEI?

[Abstract:] This article is the first article in a series that should help its readers getting started in the TEI. It explains the principles of text encoding and explains what to expect from the promised series of articles. It examines the reasons for using or not using TEI, and gives a full, albeit simplified, example of a real TEI text, showing the overall structure of a TEI document: header, facsimile, text consisting of front, body and back, and textual divisions. It also shows a number of optional extras that can help explain the text (notes), reconstructing the text's history (addition and deletion), or indexing the text (names and person description).

The article you are reading is the first installment in a series of articles that should help you getting started using TEI. The following instalments will appear in upcoming issues of TEI-EJ. The series cannot provide a full tutorial, and it cannot explain all of the detail of the underlying technologies, but it should make life a little easier for someone who is confronted with the TEI (Text Encoding Initiative) for the first time. This article begins by giving a brief explanation of the TEI (1.1 TEI: a very high-level overview), and goes on to describe what you can expect from this article series (1.2 What this article series does, and what it does not). We then discuss the question of when to use and when not to use TEI (1.3 When to use and when not to use TEI?). We give a first example of a complete TEI text in 1.4 Overall structure of a TEI text.

1.1 TEI: a very high-level overview

If you are reading this article you have probably decided, or are considering, using the TEI to create a digital version of one or more texts. Someone may have told you you should do so, or you may have read about other projects that used the TEI. So what is TEI, and why is it important?

A common answer is to that question is that the TEI is a set of Guidelines for the encoding of text. But TEI is not just a set of encodings and Guidelines for using them. The TEI is also an organization that maintains these Guidelines, and a community of users that applies the Guidelines. In fact the users are the organization: the TEI is a consortium of universities, libraries and research organizations, and the TEI Board and Council are elected by the member organizations. Using TEI is taking part in the community of TEI users that meet on the TEI mailing list, in Special Interest Groups, at the yearly TEI member meeting and on other occasions.

But what is Text Encoding? Text encoding is the addition of codes to text in order to make it possible for a computer to process that text. Text encoding assumes, usually, that you already have a text, either by manually transcribing it, or by using an OCR program. The codes that you add to the text will describe some aspect of the text: either by adding information such as author or place of publication (metadata), or by characterizing the role, meaning or other properties of pieces of text. The title of this chapter, to give a very simple example, might be encoded as:

<head>TEI: A very high-level overview</head>

You might ask why we do this. Word processing software, like Microsoft Word, can after all handle text very well without us having to type codes in pointing brackets. The long answer is given in another article in this series (The rationale of declarative markup), but a shorter answer is that explicit coding gives you (the editor of the text) control over your text and its application, something you don't have when you use a commercial word processor. For an office document that may not represent a problem, but for a text that you edit you probably do not want to depend on the standard facilities that were created to handle business documents. And while it is true that there are programs that can compute wordlists and complex concordances, without any need for explicit codes, that is only true for individual texts without structure and only provides limited means for interrogation of these texts. Once you have multiple texts you need a way to associate texts with titles, and if your text consists of poetry with a prose commentary, you may want to have word counts by type of text. That is to say: you need a structured representation of text properties, and that is what text encoding provides.

If you know something about web pages, you may think by now that text encoding is like using HTML, the language of the web. Isn’t the <head> that we saw just now very much like the various <h1>, <h2>, … elements that HTML uses to encode headers at various levels? And indeed, the encodings that the TEI uses (so-called tags between angle brackets), is very similar to HTML's formalism, and some of the tags it shares with HTML (such as <p> for paragraph and <div> for textual division). The formalism is called XML (more about it in the Technical Background article in this series). What makes TEI different from HTML is that the TEI provides tags for many different text types (e.g. verse, drama, dictionaries, spoken texts) and many different sorts of properties (to document e.g. a bibliographical description, or textual genesis), and that each of these tags has an explicit, well-documented meaning. To give an example: you can represent a play in HTML, but HTML has no special tags to indicate a speech, or a stage direction. TEI does have these, and a computer program that reads a TEI representation of a play can use that knowledge, e.g. to create an index of speeches by character or search only within a certain character's lines. It would be significantly more difficult to do so from an HTML representation.

Your next question might be: so, once I have a TEI-encoded version of my text (i.e. an XML file that uses TEI tags), can I put that text on the web? The answer is: yes you can, but not without one further step. You need some sort of program that can turn your encoded text into a web page or collection of web pages. Remember, during the encoding process you explicitly describe the structure of your text, such as verse lines, headings, notes, etc. But you say nothing about how you want these to be represented on the web, or how you want them to be printed. There is a very good reason for this. For most texts, there is not a single best way to represent their content. One of the reasons why we do text encoding is that we want to do multiple things with the texts (e.g. show them on the web, present a search facility that can search by chapter, index personal names, etc.); for that very reason, presentation functionality is kept outside of the text. The idea is that the encoding should not be created for a single application program. Programs come and go; the encoding remains.

There are many programming languages and other tools that can be used to create a web page out of an XML file, but by far the most common is XSLT. We apply what is called an XSLT stylesheet to an XML file to transform it into one or more web pages. XSLT stylesheets can do much more than apply what one would ordinarily call 'style' to a document. They can define complex transformations including sorting, grouping, filtering, aggregation, and selection. The TEI provides a set of general sample stylesheets that can handle many types of text, but usually some stylesheet programming is necessary to create a suitable set of web pages for a project. The amount of programming depends very much on the ambition level and the properties of the text. More about XSLT in the Technical Background and Running stylesheets articles in this series. (By the way, there is also another type of stylesheet, viz. CSS stylesheets. CSS stylesheets define simpler rendering styles for XML or HTML (web) pages, including things like margins, colours, font sizes, etc.)

Two more questions that often come up when discussing TEI are (1) must I create this encoding by hand? And (2) can I modify the TEI encodings and add my own tags? As to the first question, the answer is: to some extent. Usually, people create XML files with a program called an XML editor (more in the Choosing and installing an editor article in this series). Most modern XML editors can help you create the encoding, based on a schema. The schema defines what tags are allowed and where they are allowed, and the XML editor can prompt the user with a list of the allowed tags. The program can also check whether the tags that are used are actually allowed, in a process called validation. Some XML editors can show the definitions of these tags during editing, or adjust the layout and formatting of the text based on the tags that are used. Some offer a tagless interface for editing. All of these help very much in making XML editing a fairly straightforward process, depending on your background.

As to the second question: yes, you can modify the TEI encoding when your text has properties that the existing encodings really do not handle adequately. You can describe the extra tags that you need, or the ones that you need to change, in a special document called a TEIODD file. The TEI provides a web application called Roma (http://www.tei-c.org/Roma/) which helps create such a document. Roma can also generate a schema from the ODD document. That schema, as we just saw, helps you create and validate your TEI document. Much more about this in the Defining, Creating and Using TEI Schemas article in this series.

The technologies that we have mentioned in this subsection (TEI, XML, HTML, XSLT, Schema’s, and XML editors) are more or less what will be discussed in the remainder of this series of articles.

1.2 What this article series does, and what it does not

If you have toyed around with downloading computer programs, you will have read 'Getting started with xyz' documents. Usually they are one or two pages long. The Getting Started-series that you are reading now may be about 50 pages long. That means that, depending on your purposes, it will probably take you longer to get started using TEI than it takes to get to use your average computer program. The reason for this is that you are not just learning to use a computer program; you are learning a new view of text, a vocabulary for expressing that view, a formalism for writing that vocabulary, and a set of programs for manipulating that formalism. The effort will depend on your existing skills, on your environment (do you have people around that can help you when you're stuck) and on your ambitions.

If you have had no previous experience of some of the more technical aspects of computing, getting to use TEI is going to present you with a sometimes steep learning curve. But then, having learned to use TEI will not just enable you to encode, present and analyse your documents, you will also have acquired a deeper understanding of your texts, and text in general, you will have mastered a set of technologies that will be of use throughout your professional life, and you will have met with a crowd of friendly and innovative scholars, programmers and librarians that may also prove to be an asset in your career as well as a reward in itself.

So, recognising that learning TEI is never going to be easy, this series of articles will help you get started. It is not a full course. Neither can it provide a substantive tutorial in the main technologies that TEI uses, such as XML, XSLT and HTML. We refer to tutorials and online training opportunities elsewhere wherever possible.

The article series aims to serve a diverse audience. Its readers may include academics (postgraduate, PhD, researcher, professor) who want to start understanding and using the TEI: e.g. a graduate student who wants to make an edition of a book he/she is writing a thesis on. The audience may also include people who are not themselves scholars but who are, or will be, working as encoders with a project. These may be undergraduates, editorial assistants, etc. We assume readers that are willing to learn something new, ready to work hard and to try various solutions. We also assume someone that is reasonably comfortable using computers, but do not expect experience in programming or previous knowledge of XML or HTML.

By 'reasonably comfortable using computers' we mean, among other things, that we expect you to have basic computer skills, such as create disk folders, move files, download and install programs, and use zip/unzip software. There is no point in this document trying to explain very elementary computer tasks. If you feel your basic computer skills are inadequate, you may want to attend elementary computer training or study a beginner's tutorial.

After this initial article, the following article will describe a number of technologies that the TEI uses in text encoding (2 Technical background). The reasons for encoding properties of the text rather than the desired output are examined in 3 The rationale of declarative markup. The series then goes on to discuss (in 4 Choosing and installing an editor) XML editing software, the principal tool for creating and modifying TEI texts. Once we have an editor we can load, modify and validate TEI documents (5 Load, modify, validate a complete ready-made document). We use stylesheets (6 Running stylesheets ) to transform XML documents into other documents, e.g. to read them on the web. Once you have seen this, you will want to begin using this on your own texts (7 Getting this to work on sample of own text).

A final piece of technology that is explained in the series are XML schemas 8 Defining, Creating and Using TEI Schema's. By using schemas we describe exactly which elements can occur in a given XML document, and what elements they may contain. Using the description, we can check the validity of the XML document. That is one of the reasons that we want to define schemas that closely fit the documents that we are encoding. The TEI provides a mechanism for creating such schemas. What remains then is to ask what to do once you have studied the series (9 Where to go from here). The article series comes with a Glossary (getstartgloss.xml) of technical terms.

This article series should be useful irrespective of the computer platform you are using (Windows, Unix, Apple). In fact, an important reason why the TEI was created was that computer files tended to be specific to a single platform. Work that scholars did to encode historic texts was lost when platforms changed. Platform independence is therefore very important to the TEI. The software that we discuss runs on all major platforms, though things may not look exactly the same everywhere. Where we have to describe things that are specific to one or more platforms, we will clearly indicate this.

1.3 When to use and when not to use TEI?

The TEI Guidelines have been in development over a period of more than 20 years. They have grown into a quite formidable apparatus for tackling textual issues that may appear intimidating at first. It is not uncommon for people to wonder: do I have to do this? Isn't there an easier alternative, something closer to the techniques that I already know? This section will discuss situations when using TEI is appropriate and when it is less so.

To be clear, you can encode any form of text-bearing object, and even objects that don't have any text, in TEI. People also use the TEI to record only metadata about objects. If you are undertaking a text-encoding project in linguistics, social sciences, library sciences, or humanities then it is likely that the TEI is a good fit for your needs. Nonetheless there are situations where the use of TEI may be overkill or otherwise less appropriate.

1.3.1 Why not just HTML?

Quite frequently people know that their main publication output will be in HTML on the web, and they question whether they should not just author directly in HTML. As mentioned earlier in this document the TEI has more and more specific elements than HTML. Experience has taught that it is more useful to describe what something is, rather than what it looks like. Therefore, while in HTML encoding you might want to display in italics something that is a title, emphasized, a foreign phrase, or a specialized term, in TEI one would mark this with specific elements for titles, emphasis, foreign phrases, or terms. The benefit of this is the easy ability to repurpose the material for different uses. Also, by separating out the presentation from the encoding it allows one to easily change the form of presentation for all similar elements at a later date. Another example would be line numbers. If you edit in HTML, you either use line numbers or you do not. If you use TEI you mark the line-breaks in your text as such, creating the basis for producing an HTML text with line numbers and one without.

1.3.2 What is your material like?

For many genres (prose, poetry, plays, dictionaries, charters, language corpora and more) examples exist of successful TEI encoding. The TEI website contains a list of projects using TEI that can inspire prospective encoders. But one reason not to use TEI could be that for your genre or discipline there is an existing standard which suits your needs better. For example, if you are encoding a mathematical manuscript then the Mathematical Markup Language (MathML; Carlisle 2003) recommendation from the W3C is mostly likely a better choice. That said, if you are editing documents which contain mathematical formulas but also exhibit textual variation, TEI would be a better choice, as you can encode the textual variation using TEI features and embed MathML in your document to describe the formulas. This ability to be able to incorporate other standards as part of a TEI document is one of its many strengths. Many other standards exist and the TEI has tried not to reinvent the wheel where a perfectly good one exists, instead allowing you to embed markup from other vocabularies in your TEI document where necessary. Additionally if the TEI does not cater for some textual phenomena important to your work, then it is possible and even recommended to extend the TEI (in an approved and documented manner) to add new elements to deal with these.

Another technology that people use frequently instead of the TEI, or other forms of XML, is relational databases. It makes sense to use a standard relational database when the nature of the data benefits from this form of storage and retrieval. One way to think about this division of information is whether it is data-like or document-like. Relational databases are good at storing single static fields where no markup is present in any of the fields. XML is good at deeply-nested, arbitrary, fields. An example might be an address and telephone list. If one is simply recording a flat list of names, phone numbers, and some address lines then a relational database is possibly certainly sufficient. If the individual fields contain anything you want to record inside any of them then using a structured markup language like TEI XML is probably more suitable. So-called 'document-like' information is often characterised by multiple levels of nesting of structures, for example divisions containing paragraphs containing various phrase-level markup such as names, abbreviations, or foreign phrases. See Bradley & Short (2005) for a discussion of some of these issues.

In some cases, source material may contain structures which cannot be represented in TEI markup (say you are visually representing a ballet performance). In these cases TEI is clearly not the answer (though it might be an ingredient in a solution). However, it may also be the case that your material could be represented in TEI but the discipline you are working in has a significant number of tools and expertise available for another format. This will necessarily be a factor in the format you choose.

1.3.3 What is your desired output?

It is one of the basic ideas of descriptive markup that the presentation and output should be left as a separate step from the encoding of the information. Part of the reason for this is that one is able to leverage the markup available in a TEI document to produce all sorts of outputs. (e.g. not only a rendering of a text, but indices of certain aspects, rearrangements or different views, and any number of statistics) The TEI provides basic customisable XSLT stylesheets for transforming TEI texts to HTML and PDF, and additionally there are ways to import and export TEI to/from word processors. One of the benefits of using XML as a format is that it is fairly easy to transform XML into other formats required for later software. The need to create output in multiple forms or formats is a clear reason for choosing TEI. Conversely, if your output is to be accessed only in a single format, other than TEI, then choosing that other format may be more suitable. Still, we would argue for carefully considering the potential advantages of multiple output forms.

1.3.4 How hard is learning TEI, and where do I get help?

The competence of the person undertaking the encoding is necessarily a consideration in choosing technologies. However, it should be emphasized that a fairly good understanding of basic TEI XML can be obtained quite quickly in comparison to many of the other skills people learn in order to undertake similar scholarly activities. Spending a few days learning TEI XML is an investment of time that is well worthwhile. That said, such an undertaking does presuppose a certain amount of basic computer literacy. It is worth remembering that in some cases it may also be possible to out-source the required TEI encoding of your documents. One good reason for using the TEI is that there is a large international community who are willing to help you as you encounter problems. Nonetheless, if the context in which you are undertaking the encoding (workplace, institution, etc.) provides no support for TEI or XML, and actively supports alternative solutions, then that should certainly be borne in mind. What would be more beneficial, of course, is educate your colleagues as to the benefits of the TEI, and perhaps host a TEI Workshop with invited speakers to help train your local technical support. There are many scholars with basic computer literacy who have been more than able to teach themselves the TEI. There are also materials from many TEI Workshops available online.

1.3.5 Do long-term benefits outweigh initial costs?

One of the benefits of using the TEI is that what you do is specifying the textual distinctions that are useful for your purposes. This is intellectually more satisfying than only encoding for one platform such as web distribution. And yet, if you have no experience with the TEI, and simply need to do a single web page very quickly, then deciding to learn the TEI, and investigating the various options for transformation to HTML, is probably not as quick. We would argue, of course, that learning TEI is never time spent poorly.

1.3.6 But can markup do justice to my text?

There is a school of thought that holds that XML markup cannot do justice to the phenomena that literary scholars study. The main reasons for these doubts are that

  • (i) XML encodes hierarchical structures, while most of the phenomena that literary scholars are interested in cross hierarchical boundaries: Quotations run across paragraph boundaries, speeches in plays overlap with metrical units in poetry, and textual variation is also insensitive to structural boundaries;
  • (ii) XML encoding is often perceived as inflexible and rigid, not suited to represent the creative process of scholarly discovery;
  • (iii) the need to learn and consider the technicalities of XML encoding is sometimes thought to be at odds with the scholarly work which the encoding should after all facilitate.

See for a forceful statement of these views e.g. Smith et al. (2006).

Should these criticisms keep you from using TEI XML? It should not come as surprise that we do not think so. This tutorial is not the right place for an in-depth discussion of these fundamental criticisms of XML as a format. It is true that overlap between textual phenomena can be a cause of complexity in text encoding, but there are a numerous of ways of dealing with them and for the majority of people it is usually not a problem. As to the perception of inflexibility: it is clearly unwise to begin TEI encoding of a text without previous thought about the phenomena worth researching and worth tagging (Lavagnino 2006). However, when preparing a (collection of) text(s) for publication, some measure of uniformity is necessary, whether one uses XML or not. And in fact the use of XML can be argued to increase flexibility, as it facilitates multiple presentations of the text. Finally, while it is undoubtedly true that heavily marked up text can become difficult to read, this is more of an interface problem than an XML limitation. The problem may be alleviated by a tagless editing interface in XML editors or by editing applications that hide the existence of XML altogether. Another solution may be the use of so-called stand-off markup, i.e. markup that is not embedded within the text that it comments on but that points to segments of that text.

1.3.7 TEI and the history of digital text

Summing up, there is no single answer to the question when to use TEI. The decision will depend on characteristics of the texts, the discipline, the persons and the project involved. A final consideration may be the unique place of the TEI in the history of the development of digital textual studies, digital humanities and the way we conceive of electronic text in the modern age. As the TEI was originally a pre-web endeavour, its decisions have helped to influence many of the people who have designed the systems, formats and processes which make up the web, including XML itself (see DeRose 1999 and Barnard et al. 1995).

If embarking on a text encoding project of any sort, it will do one no harm to examine the current TEI Guidelines. They have been developed and updated over the last couple of decades with input from countless scholars and experts from many fields. They have been revised again and again to introduce new concepts, expand older ones, and correct mistakes. This does not mean that there is no room for improvement, but simply that they have benefited from a long history of many eyes and experience with all sorts of texts. The Guidelines then provide a good codification of knowledge about the markup of various textual phenomena. It is quite likely that your problems and concerns are not unique and that others have encountered similar textual phenomena before you. After all, damage to the carving on a stone tablet is very similar, in a structural sense, to damage to a text in a medieval manuscript, or indeed a modern printed book rescued from a fire. For many text types, if you decide not to use TEI, you are likely to have to reinvent several wheels.

1.4 Overall structure of a TEI text

To introduce the structure of a TEI document, we will begin at the level of a smaller text fragment, and then gradually add the larger document structure around it. Our example is based on the Diary of Robert Graves. For educational purposes, the encoding that we will show here is slightly different from the encoding applied in that project.

1.4.1 A text fragment

Suppose we were to create an edition of the diary of Robert Graves. We have a facsimile of a diary page, which looks like this:

Page from the diary of Robert Graves, Oct. 10
                    1938

An initial transcription of the text might look like this:

Oct 10 Monday Ghost, completing ch IX Dictionary with Alan. A lot of time goes to making charcoal for 'Marthe', Beryl's now using this fugon for warming her attic. Went to Montauban with David – first visit for about 10 days – ordered small wood for Dorothy's cresset. Now almost always win at Cambeluk: we are playing a correspondence game with Harry. Nono broke Laura's particular coffee cup, [sketch of cup] and she her blue glass bottle given by Karl.

The first thing to remark here is that in a diary transcription we will want to identify the date that the entry belongs to. We may also want to say that this piece of text actually is a diary entry. The next thing we notice is that the entry has a heading (‘Oct 10 Monday’) and consists of a number of paragraphs. We introduce some XML elements and attributes to account for these facts.

<div type="diaryentry" n="1938-10-10">
 <head>Oct 10 Monday</head>
 <p>Ghost, completing ch IX</p>
 <p>Dictionary with Alan.</p>
 <p>A lot of time goes to making charcoal for 'Marthe',
   Beryl's now using this fugon for warming her attic.</p>
 <p>Went to Montauban with David – first visit for about 10
   days – ordered small wood for Dorothy's cresset.</p>
 <p>Now almost always win at Cambeluk: we are playing a
   correspondence game with Harry.</p>
 <p>Nono broke Laura's particular coffee cup, [sketch of cup]
   and she her blue glass bottle given by Karl.</p>
</div>

A fuller explanation of XML will be given in the Technical Background article in this series. For now, we will limit the explanation to saying that that XML elements such as 'p' (paragraph) are delimited by what is called tags. An opening tag looks like '<p>', a closing tag looks like '</p>'. In running prose, we use the opening tag in discussing the element. Elements can have properties attached to them by specifying attributes, as in the <div> element in the example. When discussing an attribute, we will prefix its name with '@', as in 'type'.

The <div> element is what is used to describe textual divisions. We use here the n attribute to indicate the date, and the type attribute to state this piece of text is a diary entry. <head> is used to describe a heading, <p> describes a paragraph. A question might be why we don't simply indicate paragraphs by newline characters, the way many programs do (such as Notepad on Windows). One reason is that different operating systems use different characters to indicate the beginning of a new line. Another reason is that an accidental new line character in our source would cause a rendering application to begin a new paragraph. The best way to avoid ambiguity and to make our encoding portable between platforms is to explicitly indicate new paragraphs. We might, in fact, even have indicated the locations where Graves begins a new line. TEI provides the <lb> element for that purpose. This, as most other decisions about encoding, is a matter of editorial choice. Is the beginning of a new line important enough to be retained in an edition? The TEI does not prescribe editorial policies.

Before moving on to the larger document structure, we will add a few more refinements. Not every project will find these refinements necessary, but they give an indication of the kind of phenomena TEI can handle. To begin with, the word ‘fugon’ is a word taken from Spanish, and should probably be italicised in an edition. We use the <foreign> element to indicate this. To explain what it means, we add a <note>. Then, there is a small drawing of a cup, which our transcription renders as ‘[sketch of cup]’. From the transcription, no-one would know that these are not Graves' own words. We will use a <figure> element and an embedded <figDesc> (figure description) to place the description. We will also indicate that ‘ch’ is an abbreviation (for 'chapter'). And finally, we will want to indicate a number of changes that Graves made in his text. We will use <del> and <add> elements to indicate deletions and additions to the text. The result is as follows:

<div type="diaryentry" n="1938-10-10">
 <head>Oct 10 <del>Tuesday.</del>
  <add>Monday</add>
 </head>
 <p>Ghost, completing <abbr>ch</abbr> IX</p>
 <p>Dictionary with Alan.</p>
 <p>A lot of time goes to making charcoal for 'Marthe',
   Beryl's now using this
 <foreign>fugon</foreign>
  <note>charcoal-burner</note>
   for warming her attic.</p>
 <p>Went to Montauban with David – first visit for about 10
   days – <del>got</del>
  <add>ordered</add> small wood for
   Dorothy's cresset.</p>
 <p>Now almost always win at Cambeluk: we are playing a
   correspondence game with Harry.</p>
 <p>Nono broke Laura's particular coffee cup, <figure>
   <figDesc>sketch of cup</figDesc>
  </figure> and she her blue glass bottle given by
   Karl.</p>
</div>

Let's now put this fragment into context.

1.4.2 The larger document structure

Texts do not live in isolation. They are often introduced by title pages, prefaces, tables of contents and dedications; they are often followed by an index, appendices, and similar material. We represent this structure by a <text> element that contains a <front>, <body> and <back>. In our case, the editors of Graves' correspondence have created monthly abstracts of the diary entries. These might very well have been placed in the <front> element. Supposing for the moment that there is no other front matter and no back matter, that would give us this:

<text>
 <front>
  <div type="abstract">
   <head>OCTOBER 1938</head>
   <p>The rains set in, and Graves works in his bedroom
       with the fire going. ...</p>
  </div>
<!-- abstracts for other months -->
 </front>
 <body>
  <div type="diaryentry" n="1938-10-10">
<!-- text of entry -->
  </div>
<!-- other entries -->
 </body>
</text>

Notice, by the way, the use of <!-- ... --> to write comments in XML text. Comments are used to explain something about the encoding to a human reader and will usually be ignored by programs that process the XML.

Apart from the texts that surround a main text, such as forewords and appendices, TEI texts are also provided with metadata, information about the text. All TEI texts come with a header that contains these metadata. The element is called <teiHeader>. The <teiHeader> and the <text> are children of the top level <TEI> element. So the overall structure looks like this:


<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader></teiHeader>
<text>
<front></front>
<body></body>
<back></back>
</text>
</TEI>

Notice the 'xmlns' on the <TEI> element: all TEI elements are part of the TEI namespace, http://www.tei-c.org/ns/1.0.

We will not go into details here about the content of the TEI header, but show a very minimal example. Again, this is borrowed, with simplifications, from the Diary of Robert Graves.

<teiHeader>
 <fileDesc>
  <titleStmt>
   <title>Diary of Robert Graves 1935-39 and ancillary
       material</title>
   <author>Robert Graves</author>
   <editor>...</editor>
  </titleStmt>
  <publicationStmt>
   <publisher>...</publisher>
   <pubPlace>...</pubPlace>
   <availability status="unknown">
    <p>...</p>
   </availability>
   <date>...</date>
  </publicationStmt>
  <sourceDesc>
   <p>...</p>
  </sourceDesc>
 </fileDesc>
</teiHeader>

The elements that are used here are largely self-explanatory. <availability> is used to describe the (legal) conditions under which the text is available. <sourceDesc> describes the source from which the electronic document (the TEI document) was created.

1.4.3 Two refinements

The encoding that we have shown up to now is of course only a very partial encoding. We will show two ways of creating a more informative document.

1.4.3.1 Attaching a facsimile

First, we may want to make explicit the relation between the image file that we transcribed and the transcribed content. For that purpose, the TEI provides the <facsimile> element, stored between the <teiHeader> and the <text> elements. The <facsimile> element describes the transcribed object in terms of <surface>s and (optionally) <zone>s within these surfaces. In our case, each surface will correspond with a diary page, and we have no need for zones. For each surface, we can define one or more <graphic> elements, which will hold information about the images of that page. The surface element is provided with an xml:id attribute. From the transcription we can point at the <surface> using the facs attribute. Let us see what this looks like:

<TEI xmlns="http://www.tei-c.org/ns/1.0">
 <teiHeader> ... </teiHeader>
 <facsimile>
  <surface xml:id="graves1938-10-10-1">
   <graphic url="graves1938-10-10.jpg"/>
  </surface>
 </facsimile>
 <text>
  <front>...</front>
  <body>
   <div type="diaryentry" n="1938-10-10" facs="#graves1938-10-10-1"> ... </div>
  </body>
 </text>
</TEI>

You see we have added a <facsimile> element containing a <surface> for the page that we transcribed. The <surface> has an xml:id attribute. On the <div> element in the transcription we have added a facs attribute. The value of the facs attribute is a url. The value '#graves1938-10-10-1' points to the element with xml:id attribute 'graves1938-10-10-1', that is, the <surface> element. The <surface> element contains the <graphic> element that is an image of the current page.

The reason we do this is that we have now explicitly defined the relation between the transcription and the corresponding image files. Later users of the transcription will know what belongs together. What may be more important: applications that want to render the transcription will no longer need to know about the names of the image files. They can just fetch the files that are needed from the location as it is specified in the Guidelines. It is an important step towards making general-purpose TEI applications.

1.4.3.2 Repurposeable notes about persons

A further enhancement of the document's value would be to provide some explanations about the many persons mentioned, but not identified, in the diary entries. We could create <note> elements, the way we did to explain fugon, but most persons recur many times throughout the diary. What we would like to have is a single explanation per person, that we can show whenever it is needed.

One way to create such reusable explanations is to make use of the participant description in the TEI header. The <particDesc> element can contain a list of persons (a <listPerson> element containing <person>s). For each <person> we could, if we wanted to do so, give a structured description in terms of a number of characteristics (sex, age, etc). But we can also limit ourselves to an informal description using a <p> element. Whenever a specific person occurs in the diary, we can then refer to the description of that person in the header. The element that we will use to refer to the description is the <rs> element. The <rs> element (referring string) is somewhat like a name element, but more generic: names are referring strings, but so are 'her oldest son' or 'the gardener'. Which gives us something like this:

<TEI xmlns="http://www.tei-c.org/ns/1.0">
 <teiHeader>
  <fileDesc/>
  <profileDesc>
   <particDesc>
    <listPerson>
     <person xml:id="AH">
      <p>Alan Hodge. Oxford history graduate.
             Became close friends with Laura Riding
             &amp; Robert Graves. First husband of
             Beryl Graves.</p>
     </person>
     <person xml:id="BP">
      <p>Beryl Pritchard. Daughter of Harry
             and Amy Pritchard, Robert Graves's
             second wife. Formerly married to Alan
             Hodge. Robert and Beryl had four
             children: William, Lucia, Juan and
             Tomas.</p>
     </person>
     <person xml:id="DR">
      <p>David Reeves. Brother of James
             Reeves.</p>
     </person>
    </listPerson>
   </particDesc>
  </profileDesc>
 </teiHeader>
 <facsimile>..</facsimile>
 <text>
  <front>...</front>
  <body>
   <div type="diaryentry" n="1938-10-10" facs="#graves1938-10-10-1">
    <head>Oct 10
    <del>Tuesday.</del>
     <add>Monday</add>
    </head>
    <p>Ghost, completing <abbr>ch</abbr> IX</p>
    <p>Dictionary with <rs ref="#AH">Alan</rs>.</p>
    <p>A lot of time goes to making charcoal for
         'Marthe', <rs ref="#BP">Beryl</rs>'s now
         using this
    <foreign>fugon</foreign>
     <note>charcoal-burner</note>
         for warming her attic.</p>
    <p>Went to Montauban with <rs ref="#DR">David</rs> – first visit for about ...
    </p>
   </div>
  </body>
 </text>
</TEI>

For three persons in the diary fragment, we have created a <person> element. We have surrounded the names in the diary with <rs> tags and let the <rs> elements point to the <person>s using the ref attribute.

This concludes the discussion of our first TEI example. The choices made in the encoding of this diary fragment are not meant to suggest what is mandatory and what not: the editorial policies of individual projects must decide, e.g., whether to retain or to expand abbreviations or to do both. Similarly, a transcription that ignores deleted text can be perfectly valid TEI. Decisions about what to encode should be driven by research interest (and, inevitably, practical feasibility).

1.5 Summary

In this article, we have given an overview of the TEI for those that want to get started applying it. We gave a high-level overview of what TEI encoding is, and explained the thinking behind the series of articles of which this one is the first instalment. We discussed the circumstances under which applying the TEI makes sense, and when it does not. Finally, we gave an extended example of the overall structure of a TEI text. We saw that a TEI text always contains a header (<teiHeader>) that describes that text. The text itself (<text>) may, in addition to the text <body>, contain <front> and <back> matter. Additionally, a TEI encoded text may contain a <facsimile> element that relates the transcription to the pages that have been transcribed and to images of these pages. The transcription itself is usually structured using <div> element for the textual divisions, which in their turn will contain <p> elements (if we are dealing with prose).

We also saw a number of extras: we identified abbreviations, we encoded additions and deletions, and annotated references to persons. As said earlier, there is nothing in the TEI which obliges us to provide that information. Decisions about the whether to encode such features depend, among other things, on the text being edited, type of edition that is being planned, and the resources that are available. For those, however, that want to encode them, the TEI provides the necessary mechanisms.

But what exactly is XML, and why are we using it? Why do we provide our texts with this type of encoding at all? Why don't we use a wysiwyg editing environment, the way modern word processors do? How can we process files such as the one we saw? The next instalments in this series will deal in depth with some of the questions that we could only touch upon in this article.

1.6 Literature

Overviews of the TEI are given in Cummings (2007), Romary (2009) and Vanhoutte and Van den Branden (2010). A cogent case for using the TEI is made by Sperberg-McQueen (2009). An argument against the use of embedded markup is made by Schmidt (2010).

  1. Barnard, David T., et al. (1995), Lessons for the World Wide Web from the Text Encoding Initiative, in: World Wide Web Journal, 1 , 349-57.
  2. Bradley, John and Short, Harold (2005), Texts into Databases. The Evolving Field of New-style Prosopography, in: Literary and Linguistic Computing, 20 (Suppl. 1) , 3-24.
  3. Carlisle, David, et al. (2003), Mathematical Markup Language (MathML) Version 2.0 (Second Edition). W3C Recommendation 21 October 2003. http://www.w3.org/TR/MathML2, accessed 2010-05-24.
  4. Cummings, James (2007), The Text Encoding Initiative and the Study of Literature, in Ray Siemens and Susan Schreibman (eds.), Blackwell Companion to Digital Literary Studies (Blackwell: Malden (Ma)), 451-76.
  5. DeRose, Steven (1999), XML and the TEI, in: Computers and the Humanities, 33 (1-2) , 11-30.
  6. Graves, Robert (1895-1985), (2002), Diary of Robert Graves 1935-39 and ancillary material, compiled by Beryl Graves, C.G. Petter, L.R. Roberts. University of Victoria Libraries, Victoria B.C., Canada. http://graves.uvic.ca/.
  7. Lavagnino, John (2006), When Not to Use TEI, in: Lou Burnard, Katherine O'Brien O'Keeffe, and John Unsworth (eds.), Electronic Textual Editing (New York: MLA), 334-38.
  8. Romary, Laurent (2009), Questions & Answers for TEI Newcomers, in: Jahrbuch für Computerphilologie. http://computerphilologie.de/jg08/romary.pdf, accessed 2010-05-21.
  9. Schmidt, Desmond (2010), The inadequacy of embedded markup for cultural heritage texts, in: Lit Linguist Computing, [advance access] , fqq007.
  10. Smith, Jeff, Deshaye, Joel, and Stoicheff, Peter (2006), Callimachus--Avoiding the Pitfalls of XML for Collaborative Text Analysis, in: Literary and Linguistic Computing, 21 (2) , 199-218.
  11. Sperberg-McQueen, C. M. (2009), How to teach your edition how to swim, in: Lit Linguist Computing, 24 , 27-39.
  12. Vanhoutte, Edward and Van den Branden, Ron (2010), The Text Encoding Initiative, in: Marcia J. Bates and Mary Niles Maack (eds.), Encyclopedia of Library and Information Sciences (1).
  13. Wisneski, R. and Dressler, V. (2009), Implementing TEI Projects and Accompanying Metadata for Small Libraries: Rationale and Best Practices, in: Journal of Library Metadata, 9 (3) , 264-88.

1.7 Tutorials

Date: 2013-03-21