The Cavalry isn’t Coming

Of course, no computer program can define or create theoretical concepts for language description. They can only make it easier to handle the evidence of how the language is used. This amounts to making that data as findable, explorable, portable, and future-proof as possible. If software can assist the linguist in showing how well their grammatical categorizations characterize a corpus, then it is serving to advance science. Data which are linked to categorizations are more accountable, and thus, more susceptible to peer review.

Computer programs should model information about the real world, but information and the ways we interact with it are inherently complex. Thus, the process of learning to program must by the nature of the information which programs model, be an iterative one. There is no single “correct” way to represent information digitally—there are only specific computer programs which model information and make that model manipulable in a more or less usable and useful way.

If we wish to create software to serve the goals of documentary linguists, then what are the specific problems that software should assist us in solving? As we have discussed in the previous chapter, documentary linguists have already developed a set of concepts and accompanying nomenclature for describing human language. These include the very general notions (“word”, “sentence” and so forth) described in that chapter, but also the endless array specialized theoretical concepts used through out the field (grammatical categories, comparative concepts, universal tendencies, etc., etc).

Traditionally, software development has been considered to be outside the domain of linguistics itself. To be sure, there is a broad and fairly long history of software development designed specifically for linguists (or at least adopted by adopted by them), but the process of actually implementing software is has mostly been treated as a black box by the linguistics community: linguists are expected only to use software tools, not to create them. But one change in the landscape of software development has proven to be epochal: that was the appearance and subsequent evolution of the World Wide Web. The web began primarily as a system for sharing and interlinking documents over computer networks (as “hypertext”), but it has acquired important new capabilities which are not necessarily familiar to the average user of the internet. The web and its surrounding technologies are more than sufficient to address all of thek kinds of data that documentary linguists need in their work.

the problem of maintenance


The source is open, but the project is no longer active, and hasn't been edited for 5 years

evaluating docling.js in terms of bird & simons

Seven Problems for Portability

how well does docling.js meet these criteria? [@@foreshadowed in ch 1]

Existing tools do not address the needs of documentary linguists

The problems in workflow

No generic data-editing software is designed to handle the particular nested structures necessary for documentary linguistics.

ELAN was not designed for Linguistic data.

The DLx model of linguistic data is quite different from that used in the ELAN Annotation Format (EAF).

Most strikingly, EAF strives to make timestamps optional on a given annotation. Any annotation without timestamps, represented as a REF_ANNOTATION element in EAF, however, is required to have an ANNOTATION_REF attribute whose value is itself the ANNOTATION_ID of an annotation which does have timestamps, namely, an ALIGNABLE_ANNOTATION. Other than this distinction with regard to anchoring within the media timeline, ALIGNABLE_ANNOTATIONs and REF_ANNOTATIONs are identical.

Here is a sample REF_ANNOTATION:

    <ANNOTATION_VALUE>my home is your home</ANNOTATION_VALUE>

And here is the ALIGNABLE_ANNOTATION to which it refs via its ANNOTATION_REF value:


Note that the value for the ANNOTATION_REF attribute the first example is identical to the ANNOTATION_ID value for the ALIGNABLE_ANNOTATION. In this case, the first is an English translation of second, which is a time-aligned transcription of an utterance in Spanish.

To think this through, we can invent a much simpler example, and rename all the elements and attributes:

<translation annotationID=1 translationOf=a1 start=ts1 end=ts2>
   my home is your home

<transcription annotationID=a1 start=0 end=5>
   my home is your home

This is a little weird. Why not think of <transcription>s and <translation>s as being associated together with a single set of time stamps? Like this:

<utterance start=0 end=5>
  <transcription>mi casa es su casa</transcription>
  <translation>mi casa es su casa</translation>

The documentation of the EAF format is a highly technical document, and doesn’t give much in the way of explanation for the design decisions which went into it. Brugman et al (2004) give some clarification (emphasis added):

An informal use-case driven method was used to create this design. In contrast to the previously discussed relational model that merely models concepts and their relations, this object oriented model is expressed mostly as a set of related interface definitions. It is therefore more of an operational model than a data model. The model is called Abstract Corpus Model (ACM) since it models concepts from the domain of annotated corpora in an abstract way. It is realized in first instance as a set of abstract classes that implement common behavior. These abstract classes each have concrete subclasses, one for each of the annotation file formats that ACM currently supports (CHAT (MacWhinney, 1999) Shoebox [3], MediaTagger’s relational database, Tipster via the GATE API [4], several varieties of XML).

EAF was explicitly not designed to model linguistic data, much less fieldwork data. It was designed as a generic transcription tool whose data format is designed to allow users to design a system of tiers which expresses their own constraint system for their data.

Looking forward

we can define operations on instances of such classes of data which correspond to searching, sorting, and other useful kinds of a analysis of documentary data. But such work presupposes the existence of a well-structured body of data — a corpus — with which to work. The construction of a corpus, whether that construction consist of a re-structuring of existing digital data or ab ovo keyboarding, is a significant part of the (digital) documentary linguist’s work. If we wish to build tools which are useful at this stage of the process, then, we must first consider how data “flows into” a digital corpus. This process of inputting documentary data is closely related to the linguist’s chosen workflow. Modeling data consists of specifying what objects should be represented, and what their constituent attributes should be: we chose to conceptualize the individual ‘words’ in the Mespotamian tablet as being representable as a conjunction of a ‘form’ attribute and a ‘gloss’ attribute. That these attributes were defined as criterial was a documentary decision[^context] about how to represent that lexical resource: to qualify as a ‘word’, both attributes must be present.

[^context]: Note that the context of such documentary decisions are far from obvious. To Ešguzi-gin-a, the ‘form’ was a Sumerian form in cuneiform, whereas to a modern linguistic typologist, a Latin-alphabet transcription with morphological boundaries ‘is’ the form. We shall see later that adding additional properties is perfectly possible (the modern typologist may very well wish to record a cuneiform attribute with the original signs in addition to a form with a modern transliteration.)

This is more than a side-issue: whatever model we choose for the data itself (e.g., “a word shall be defined as having a form and a gloss”), that data model is itself distinct from process of recording particular values in accordance with that model. Concretely, merely stating that a ‘word’ should include ‘form’ and ‘gloss’ attributes says nothing at all about the order in which those values ‘should’ be recorded. Either could be recorded first, and indeed, both correspond to recognizable techniques in linguistic fieldwork. Furthermore, a linguist might prepare a list of glosses beforehand in order to elicit responses within a set domain. In that case the “input data” in the workflow is a set of partially-documented words — as they have glosses but no forms — and some of those prompts may turn out to be irrelevant if there is no corresponding form in the language. These sorts of approaches are given many names in manuals of documentary fieldwork: “elicitation”, “transcription”, “questionnaires”, “surveys”, etc. And in practice the linguist may switch from one workflow to another during the process. Questions of “naturalness” are often brought into the discussion of choice of workflow. It is true that a request to a speaker like “Can you tell me all of the words you can think of that have to do with ‘sheep’?” is “less natural” in some sense than a transcription of a story that happens to involve sheep-shearing, simply because reciting a list of varieties of sheep isn’t typical usage. But it is perfectly possible that at least some of the very same instances of a ‘word’ could be documented through the application of either workflow.

We shall aim for flexibility in associating applications with workflows. It is probably impossible to enumerate all possible documentary workflows. Instead, we will take a more pragmatic approach, setting forth the following workflow-first “mantra”:

Use documentary workflows as the basis for defining user interfaces.

We shall see that this principle may result in subtly or drastically different interface designs, even if the data from using those different designs is nearly identical. In order to see how this mantra plays out in practice, we will attack a more complicated documentary workflow than simply recording forms and glosses.

Brugman, Hennie, Albert Russel & Xd Nijmegen. 2004. Annotating Multi-media/Multi-modal Resources with ELAN. LREC.