“I shall lose my life for want of language”: Shakespeare and digital formats

The Will’s World registry is intended to record three categories of data:

  1. Metadata about services
  2. Metadata harvested or contributed from those services
  3. Annotated plays

The first will describe the services that are available, and will support searching for services that have specific features, and that produce particular types of data about aspects of Shakespeare’s work, history and contemporaries. These data will be contributed by a service during registration; and in the initial stage of the project will be added manually.

The second category will be metadata that we have retrieved from services through queries or scraping from HTML or search interfaces, and will be searchable as an aggregated metadata resource.

The third type of data is different because it involves storing target data rather than just metadata – there will be the actual structure and content of the literary resource. Several questions arise: How will the data be searched, retrieved and visualised? Will it be possible to attach metadata search results to it? Which language will be used to markup the text?

Why store data as well?

Will’s World should be a useful access point for a variety of Shakespeare resources. It should be possible to run a search and retrieve metadata on items from a range of services. What will a user then do with this metadata? The answer to this question is something we don’t know and do not intend to dictate, but one likely use is to link the metadata to the text of the plays. A marked up electronic text can be seen as a backbone which a user might develop with multimedia resources from a number of services, to produce anything from an enriched script for theatre or teaching use, to an online application linking plays to performances, to a mobile phone app. Some examples of what developers and creative types can devise with little more than basic marked-up text were produced at the recent hackday.

What about formats and schemas?

There are a handful of marked up versions of Shakespeare’s plays:

  1. Jon Bosak provides a full set of annotated Shakespeare as an XML test package, which is potentially a great starting point with a basic DTD.
  2. Shakespeare play schema by Susan Kelsch. This appears to be quite comprehensive on representing aspects of Shakespeare plays (including the major play groupings), to the extent that it is not generic enough for any other application. I am not aware of the DTD being put into use in any data.
  3. Open Source Shakespeare (OSS) provides comma-delimited, tilde-demarcated database fields representing the Globe Shakespeare. We converted this into XML using a simple Play-Act-Scene-Paragraph-Line hierarchical schema which provides a simple to understand version of the play structure. The CharID element within Paragraphs indicates the speaker of speech.
  4. Perseus has plays from the Globe Shakespeare, encoded using TEI and available under Creative Commons 3.0.

Jon Bosak’s XML is useful to have, but as he says in his caveat “should not be relied upon for scholarly purposes” and is intended “purely as a learning exercise … a benchmark … and as a resource for testing”. The other schemas are somewhat ad hoc and very Shakespeare specific. It is good that they are tailored closely to their purpose, but it also makes them non-transferrable. Plays of different authors are marked up using subtly or markedly different schemes and become mutually non-comparable, or at least not without difficulty. At the same time it makes it unlikely that such schemas will find a wider audience unless they really describe Shakespeare better than any other option out there.

The XML that we generated from the OSS texts retains typographical elements that should be encoded into TEI, such as square brackets indicating that a line is a stage direction. It fulfilled the requirements of the Culture Hack event and provided a usable baseline of marked up text for the participants, but by its nature it has limited usage unless people find it more generally useful.

The question: is it better to use a schema developed specifically for Shakespeare’s plays (or plays in general), or something more general like TEI?

Literature Markup Languages (LML and XLDL)

Literature Markup Language (along with its predecessor and foundation, Literature Description Language) are attempts to provide an XML markup specifically for literature. It is designed to cover literature at all levels – pamphlets, prose, poetry, plays, criticism. These MLs look promising because they are pitched at a level wider than the works of a specific author, but within a particular field of writing that exhibits a range of common properties – things like figurative language, a variety of structures (some very well defined, such as haiku; others looser) and a range of utterance.

It is encouraging that there are efforts to define the appropriate terminology and structural elements of literature. The LML definition provides for act, scene and genre elements, and the ability to specify the tone of utterances. It is intended to use RDFa to enhance the semantics available in XHTML elements. It should be of much wider application than Shakespeare, though it is not clear whether it is capable of representing complex non-hierarchical structures.

It looks like the creator of LML, Dr Olaf Hoffmann, has thought carefully about the diverse requirements for representing literary works. However it seems an omission that there is no mention of the Text Encoding Initiative (TEI), whose long-lived and widely-used standard makes several recommendations for its application to dramatic and literary texts. I must also admit that I don’t understand why the rationale for LML’s creation is so closely linked with the vector graphics format SVG – I’m not sure of Dr Hoffmann’s use case for marking up text in SVG files.

These schemas have apparently shown little development in the past few years and I am not aware of any projects that have applied the languages.

Is LML/XLDL more appropriate than TEI?

LML contains hard definitions of what it considers the appropriate hierarchical levels of plays to be – namely act, scene, speech and line. While these are more or less universal in drama and are certainly appropriate for Shakespeare’s plays, they are not essential. For example, the act is a typically western construct, and modern literature regularly plays with the structural elements of literature. Thus defining the levels as elements is potentially limiting the application of LML.

TEI does not define scene and act elements, but rather keeps the elements abstract and specifies their role through a type attribute. This allows for more possibilities but can render the markup somewhat verbose and less readable, and subsequent querying more convoluted. TEI is sprawling; it attempts to cover so many possibilities that its flexibility tends to render it too complex and too vague for many applications, and the ways it is applied so disparate that it can still be hard to compare marked up texts like-for-like.

Although TEI has its limitations, it does have provenance and a large and active community.

Text Encoding Initiative (TEI)

Electronic Textual Editing: Drama Case Study: The Cambridge Edition of the Works of Ben Jonson by David Gants provides a good discussion of some of the challenges involved in efforts to markup dramatic texts, and how TEI attempts to handle them. See in particular the section on Encoding Drama. Quotes in this section are taken from this article.

TEI provides a set of guidelines for encoding texts at a semantic level which can be turned into contextually-appropriate representation when necessary. Within those guidelines it describes “encoding strategies designed specifically for drama”. This allows more details to be encoded than the baseline of straight act/scene/line markup, for example the shifting perception of a character. Attributes of a speaker can be modified through the course of the play, characters can share lines. Even Ben Jonson’s plays provide a wide range of complex structures which can be difficult to properly represent. Typographical schemes were devised in the sixteenth century for representing these features on the page, but they can be difficult to transform into a structural representation in XML, which is essentially hierarchical.

The broad hierarchical features common to many plays are pretty straightforward to represent – acts, scenes and lines; speeches and speakers; individuals and chorus. However sometimes these features can overlap or otherwise become complicated. A handful of real-life textual features that TEI aims to support are:

  • The difference between the character/actor speaking and who is perceived to be speaking. For example “plays that deal with English historical subjects will often alter the speech prefix assigned a character as the title and status of that character changes, such as Bolingbroke/Henry IV or Gloucester/Richard III.”
  • Simultaneity and interweaving of utterances. The case study describes markup allowing the reconstruction of a letter (a fictional entity in the imagined world of the play, or a prop on stage) whose contents are read out, distributed over several character utterances with speeches overlapping. With appropriate markup it is then possible to extract objects/structures from the text.
  • Rhyme scheme <rhyme> and meter <met>.
  • Stage directions <stage> and character movement <move>.
  • Poetic sub-structures (stanzas, line groups, verse paragraphs).
  • The rend attribute describes how to render the text of an element, for inline encoding of typographical features.

Although a minimal set of tags tends to get used from any given TEI schema, it can be used to enrich the text with all sorts of supplementary metadata, to provide further semantic and analytical richness to the text (short of providing critical commentary), thus supporting a variety of literary analyses. This is not something we aim to represent, or have the knowledge or resources to produce, but it is one enrichment activity that someone might like to perform given the basic markup.

I’m not convinced about the use of number-suffixed div elements <div0>, <div1> which results in an arbitrary number of new element names. It seems more in the spirit of XML to put these indices in an attribute.

Editions of Shakespeare

There are several editions of Shakespeare, including the following well-known publisher editions:

  • The Arden Shakespeare is described as “the world’s most recognised scholarly edition”. It is commercial and comes at a cost.
  • The Riverside Shakespeare is a long-running series started at the tail end of the 19th century, and includes a scholarly edition. I believe it also established some typographical standards for representation of aspects of the text.
  • The Globe Shakespeare, a 19th century edition produced by Cambridge editors Clark and Wright. Some dispute their inclusion of particular textual variations.

Which of these is the best option for inclusion depends very much on which edition is available in a marked up form. Ultimately it would be great to be able to store several versions (with concordances) and provide parallel access to them with comparison tools.


With such a variation in possible digital representations of the plays, we should consider what the purpose is of storing marked up texts, and what the possible applications are.

What is required in an electronic representation?

Searchability, the ability to distinguish between each component and reconstruct any hierarchy that inheres in the text.

Why is it important to agree on a scheme?

There is little need to stress the benefits of having a standardised representation. If a representation scheme is agreed and then adhered to by a number of people, the data of literary artefacts become well defined and their structure can be used predictably. Tools and interpretations can be built up around and from the common format.

What is required of a Shakespeare play schema?

Probably nothing too different to what is required of any other dramatic or literary schema.

Why is there no agreed format?

TEI is the most recognised format; the LML/XLDL effort is interesting but appears not to have been applied to anything as yet.


This post has looked at some of the options out there for Shakespeare mark up and annotation. There are some interesting XML definitions, but apparently a lack of a core set of professionally-annotated texts employing a well-defined schema. We do not have the resources to markup a text ourselves, but we can process whatever editions are available.

The Perseus markup seems to provide a mix of an appropriate edition, a well-established and flexible encoding scheme, and a richness of annotation, and will form the primary text of our registry. It is more complex than other schemas, and so may take some more processing to render for presentation, however it appears worth it to provide as much well-organised information as we can.

In the future, it would be good to see a single encoding scheme become the standard for Shakespeare’s plays, and thereby contribute to a unified effort combining crowd-sourcing, annotation, and a rich diversity of interpretations, transformations, visual representations and unforeseen recombinations of texts, metadata and other resources.

Good Grieffe, More Robot Suits

Finally, a word of warning from T-Rex, about ensuring the long term preservation of Shakespeare in appropriate formats.



2 Responses to ““I shall lose my life for want of language”: Shakespeare and digital formats”

  1. Owen says:

    Thanks for a very interesting write-up on this – especially the detailed analysis of the use of TEI vs LML (I hadn’t come across the latter before).

    I wonder if you looked at any of the work to describe stories/narrative. I’m thinking specifically OntoMedia (http://contextus.net/ontomedia), the Stories Ontology (http://contextus.net/stories/) and related work at http://contextus.net. Of course the aims are slightly different, but hopefully it is of interest. Contextus.net has an RDF version of the story of A Midsummer Night’s Dream available from http://contextus.net/datastores

  2. nmayo says:


    Glad you found this useful, and thanks for the links to Contextus. I hadn’t come across that but am interested in how literary works may be marked up, particularly in the capacity of critical analysis – how does one mark up a text to indicate figurative language, cultural references, irony and so on? There is a rich set of options for marking up linguistic features, one example being the NITE XML Toolkit which supports overlapping hierarchies using ‘standoff’ annotation (disclaimer: this is a project I have worked on). Here is an example corpus annotation. However the semantics of a text are necessarily somewhat nebulous, and can shift with context.

    I like the use of XPath xpointers into the Perseus TEI texts; I didn’t realise that was supported. The RDF of A Midsummer Night’s Dream illustrates a great way to add value to the base text by adding a semantic level of ‘events’ without duplicating the content. This standoff annotation allows for the overlaying of multiple semantic or linguistic levels, which must be considered a requirement in a discipline which by its nature has multiple interpretations and a variety of levels of engagement (linguistic, critical, theatrical, narrative).

    It’s good that the stories ontology supports ‘narrative’ in a variety of contexts – fiction, history, news – while supporting the annotation of structural features like sub_story, and relationships between texts/things (contextualises, interprets).

    I may look at these resources in more detail in another blog post – thanks again!

Leave a Reply

Your email address will not be published. Required fields are marked *

Powered by WordPress | Designed by Elegant Themes