Structure in data

The origins of language documentation predate the conception of computerized databases. But creating a digital database today still requires decision-making about what kinds of information are to be captured, and what kinds of interrelationships exist within that information. All databases must by necessity be selective, since the “real world” is in effect infinitely complex. Decisions about which particular characteristics of the world are should be recorded are made by a database designer, and depend on their research goals. In a complex domain such as that of language documentation, it might seem that standardized answers to decisions of what counts as “worth recording” might be untenable: why should we expect, for example, that the way of arranging information to successfully describe one language would be reusable for another, perhaps very typologically distinct, language? Might it not be the case that research projects are so diverse in goals and methodology that data structures are effectively unique to each project?

It is indeed not the case that all descriptions can be expected to share a single exhaustive list of characteristics (repeatedly answering “none” to a questionnaire about tone characteristics in a language without tone is hardly an effective use of effort). But there are a small number of possible categories of documentary data which can be shown to be nearly universally applicable in documentation. And we need not start from scratch in enumerating those categories, we have “inherited” from the non-digital predecessors to modern digital language documentation, a fairly consistent, high-level “notional” design for the structure of a documentary database. While earlier practitioners of documentation and linguistic scholarship did not have access to computers, they did work to find ways to describe and notate documentary information in a manner which was as non-redundant as possible. That work is precisely the kind of analysis that a database designer today would have to carry out in order to produce a robust design for storing and using information in a digital form. The process of database design is, after all, firstly one of enumerating “kinds” of information — each kind with its specified list of relevant characteristics — and secondly, defining the specifics of such interrelationships between such kinds. (We shall have more to say about the nature of such interrelationships below.) Exactly such a process of information analysis has already evolved as the processes of grammatical analysis and language documentation have evolved; this “design” has evolved organically over the long arc of of linguistic scholarship: written conventions, the layout of “physical” documents, and modern linguistic notations (including, for example, the Leipzig Glossing Rules) all reflect a kind of simple, useable abstract “data model” of documentary data — a conventionalized approach to arranging documentary information. Below we will investigate the structure of this abstract data model in concrete terms, tracing its outlines across pre-digital documentary and scholarly artifacts, drawing conclusions about what sorts of information have proven “worth recording”.

Note that there is a difference between specifying a structured way to think about data and actually implementing those data structures in such a way that they may be stored in a computer. There are endless avenues to pursue in modeling information in a computer, in this work, an implementation will be developed which is built on top of the technology on which web browsers (about which more below) are designed. Most particular programming languages come with a few very general “data structures” of their own — standardized abstractions for arranging information in a useful way. The web browser’s default programming language is called Javascript, and we will use just two of its “built-in” data structures for the lion’s share of our data modeling: the object and the array.1

So our goal is to travel from the “notional” realm of data as it is embodied in familiar documentary artifacts, to an explicitly expressed realm which is clear enough to be recorded in terms of objects and arrays.2 There are two possible approaches to understanding how to use these two data structures: we might begin with how they are implemented in a technical sense, looking at the notions of objects and arrays as they are used in computer programming, and then proceeding to investigate how the notions we use in documentation (“words”, “morphemes”, etc.) could be fitted into those mechanisms; or else we migh begin with the linguistic notions first, and proceed to using the computational concepts using familiar documentary examples as building blocks. Because it is expected that linguistic concepts will be more familiar to readers of this work than programming concepts, here I will follow the second route: we will, in effect, “trust our predecessors,” and look into the evidence of basic structures in familiar and not-so-familiar formats. We will then approach the task of organizing those in terms of objects and arrays.

I will begin this chapter with some further background about the state of computer use within language documentation specifically. This chapter begins with some observations made by modern linguists on how they relate to their own digital work. There is considerable evidence for frustration among practitioners, but what exactly is the nature of the complaint? The answer to this question is more of a human issue than a technical one. Where frustration arises in the modern era is when those familiar and intuitive mental models of linguistic structure are not “usable” when transferred to a digital milieu. Linguists intuitively understand how the information they want want to record is structured, but they find that the software they use does not allow them to search their corpora in terms of those intuitive notions. To explore each of these requirements, we will consider one artifact from ancient linguistic scholarship (a Mesopotamian clay tablet dictionary), and two physical artifacts from the history of language documentation proper (a fileslip dictionary and an unusual “edge-notched card” database), and finally the typographical and layout structure of a documented text which was published in two formats, first in a “parallel” form, and later in as an interlinear glossed text. Each of these four objects affords certain uses and precludes others, and by interpreting those affordances we will be in a position to see just what kind of utility a purely digital representation of the very same data affords.

We will discuss how to implement simple but usable interactive interfaces in Chapter 3. In the current chapter, we will focus on the question of what it means to say that documentary data is “structured” in a way that can serve as the basis for those interactive applications. We will see that four characteristics are necessary for designing such a database. These characteristics are:

  1. labeled - Each “piece” of information has an unambiguous label. §2.3
  2. sortable - There must be defined attributes for sorting (and re-sorting) objects in arrays. §2.4
  3. searchable - There must be defined procedures for finding objects which match particular criteria in arrays. §2.5
  4. hierarchical - The notion of “containment” must be encoded into the data. §2.6

Large, Windless Apartments

Swimming as we all are in a sea of information, it is difficult for modern linguists to extract themselves from the present and imagine a time when their work was not mediated in some way by computers. But a pair of quotes from early practitioners of language documentation bring the sea change from the age of paper to the digital era crashing home. Kathryn Klar (2002) tracked down a revealing insight into the working methods of J.P. Harrington, a linguist whose vast and ramshackle output needs no introduction. Klar recovered an outline by Harrington for a 1914 lecture on fieldwork methodology, perhaps the only explicit record of Harrington’s working methods. In the following quote, he describes how would-be field linguists should sort their fileslips:

The sorting of slips should be done in a large, windless appartments [sic] where the work will be undisturbed. Many Large [sic] tables are most convenient, although it often becomes necessary to resort also to placing slips on the floor. The sorting of many thousands of slips is a most tedious and laborious task. To many it is mere drudgery. It cannot well be done by anyone other than the collector. It is easy and interesting to record words on these slips by the thousand. The sorting however takes many times as long as the recording.

We imagine Harrington in a bleak rented room somewhere in California, surrounded by his innumerable slips of paper, deathly afraid of a breeze blowing away a year’s worth of careful organization of precious notes on Chumash or Ohlone.

He was not alone in suffering this curse of “sorting”: it also plagued Leonard Bloomfield, whose complaints of the ailment were recounted by Hockett in the introduction to Bloomfield (1967), (emphasis added):

Bloomfield was speaking of the tremendous difficulty of obtaining a really adequate account of any language, and suggested, half humorously, that linguists dedicated to this task should not get married, nor teach: instead they should take a vow of celibacy, spend as long a summer as feasible each year in the field, and spend the winter collating and filing the material. With this degree of intensiveness, Bloomfield suggested, a linguist could perhaps produce good accounts of three languages in his lifetime.

It turns out that Bloomfield did not live up to his own expectations: he did not live to see the publication of his one full-scale grammar, The Menomini Language, and it was Hockett who would edit and see through the posthumous publication of that work (Bloomfield 1969). While Bloomfield seems to have been speaking somewhat tongue-in-cheek, Harrington’s true evaluation of his own claim can of course be measured in (mostly unpublished) tons. But whatever the correct numerical interpretation of these admonitions, the import is clear enough: two well-known linguists agree in advising aspiring linguists to expect to spend a significant portion of their careers — perhaps one half of their careers — engaged in ‘sorting’ data.

It is humbling to imagine what such productive and prolific linguists could have achieved had they not been constrained in this way. One is reminded of the 18th-century lexicographer Samuel Johnson’s self definition as a ‘a harmless drudge’. Surely our easy modern access to relatively cheap and powerful computers obligate us to view hand-sorting in the era of computers to be drudgery of a “harmful” sort? And yet, some modern linguists seem to exhibit a curious reluctance to let go of the old paper-based ways of documentation, or at least a lingering nostalgia for those methods:

The first dictionary I did (a preliminary version of my Mojave dictionary) was compiled in three-inch by five-inch slips (some linguists, I know, prefer four-inch by six-inch slips!) - not cards (too thick!), but slips of ordinary paper, which were arranged alphabetically in a file box (one hundred slips take up only a little more than half an inch). Reluctantly, I have stopped introducing field methods classes to the joys of using file slips, which I still feel are unparalleled for their ability to be freely manipulated and arranged in different ways. But I don’t use paper slips much myself any more, so it doesn’t seem right to require students to make a slip file, as I once did. (Munro 2001)

In a more explicit example of reluctance to embrace technology in fieldwork, hear R.M.W. Dixon:

In pre-computer days I’d fill in a 5 inch by 3 inch (12.5 cm by 7.5 cm) card for each lexeme. I continue with the same procedure today, as do many other linguists. Electricity supply is non-existent in many field locations, or else it is intermittent and unreliable. Reliance on a computer leads to frustration and a feeling of impotence when it ceases to work properly. This is why many linguists, working in difficult field situations, prefer to leave their computer back at the university. (Dixon 2009:297)

Both of these voices accept some role for computers (even if it’s only “on campus”), but resignation is not far from the surface: here we see experienced linguists looking dimly at the shift to the digital milieu: there is a loss of “joy,” replaced with “frustration,” and “a feeling of impotence.” Why is this? Would not Harrington and Bloomfield have given a king’s ransom for one of today’s mid-range, off-the-shelf laptops?

Those seeking support for carrying out fieldwork today are not permitted to vacillate on the question of whether to work digitally. Funding sources stipulate that corpora be deposited in digital form — fileslips and notebooks, nostalgia notwithstanding, are hardly acceptable academic output today. As the field of documentary linguistics has quickly expanded — there are now many conferences focusing on or foregrounding language documentation — we have seen considerable progress toward the principles of “portable documentation” articulated in Bird and Simons’ seminal 2003 paper. In particular, significant progress has been made in archival practice. (For a useful history of these developments, see Henke and Berez-Kroeker (2016).)

But despite this intense activity in broadening the scope and improving the methodology of language documentation, the feelings of “impotence” and “frustration” seem to persist. Why do so many documentary linguists remain vexed by the difficulties of using computers to improve their research? Why is so much of our effort dedicated to troubleshooting software for documentation, as opposed to simply doing documentation? Is there something more behind Munro and Dixon’s complaints that the way we use technology in documentary linguistics is “joyless”?

I will suggest that the reason underlying such frustration has nothing to do with a reconceptualization of what kinds of information should be collected and organized in language documentation. The basic “mental models” of the information structures used to describe human languages have proven remarkably stable — insofar as “words,” for instance, are a basic unit of linguistic analysis, one might even say that some of the basic conceptualizations of linguistic structure have histories measured in milennia: while the notations with which such structures were recorded have evolved considerably, basic notions of sounds, components of words (roots, affixes, etc), words, sequences of words (“sentences”), texts, corpora, etc, are notions which are at least roughly contemporaneous to the earliest days of writing, if not earlier. In a sense, language documentation is nothing new.

Clay tablets - words as objects

Consider for a moment one of many distant philologicals ancestor to language documentation: a Mesopotamian scribe, many are known by name, but let’s acquaint ourselves with a particular scribe whose name is known to have been Ešguzi-gin-a. He would have almost certainly have been familiar with what have come to be called Lexical Lists like the one below:

Scribes like Ešguzi-gin-a were concerned with the preservation of the Sumerian language for religious and cultural reasons. There was a fairly organized educational system — including schools known as “edubba” — where scribes such as Ešguzi-gin-a would have once been trained in reading and writing both their own Akkadian language (a Semitic language), as well as the unrelated Sumerian (thought of now as an isolate, for lack of evidence of any related language). The lexical list was a conventionalized dictionary arranged on a semantic basis (the tablet above is from the section enumeriating various kinds of sheep). The leftmost image is a photograph of the actual tablet, the middle image is a realistic drawing, and the rightmost is a somewhat normalized representation which bears a resemblance to a modern, two-column spreadsheet. Borrowing transliterations of the cuneiform text from Von Dassow & Jursa (1988), the first five lines read:

[
  {
    "Sumerian": "udu.níta",
    "Akkadian": "im-me-ru",
    "English": "sheep"
  },
  {
    "Sumerian": "udu.ni-gu ŠE",
    "Akkadian": "MIN ma-ru-ú",
    "English": "grain-fed sheep"
  },
  {
    "Sumerian": "udu.ŠE.sig₅",
    "Akkadian": "MIN MIN dam-qa",
    "English": "grain-fed beautiful sheep"
  },
  {
    "Sumerian": "udu.gír.gu.la",
    "Akkadian": "ar-ri",
    "English": "sheep “for the big knife”"
  },
  {
    "Sumerian": "udu.gír.ak.a",
    "Akkadian": "kaṣ-ṣa",
    "English": "shorn sheep"
  }
]

It is helpful to put ourselves into Ešguzi-gin-a’s sandals for a moment: how did he think of such a tablet? For Ešguzi-gin-a, the Sumerian column contained the linguistic forms for which he “needed” documentation. Namely, he needed the second column, with glosses3 in Akkadian. I suspect that the current reader was not trained in the interpretation of lexical lists in an edubba, and so you will conceptualize this table in roughly the same way that this author does: it is mostly useless, and largely meaningless. Without a translation of some kind into a language which is recognized by a reader, documentary data is largely unusable. Had Ešguzi-gin-a added a column of glosses into English, the Sumerian forms would have been made intelligible to modern readers:

Armed with this data, we are equipped to begin participating in Mesopotamian sheep scholarship alongside Ešguzi-gin-a. But simply labeling the columns with a language name is insufficient in an important respect: it doesn’t specify what role each piece of information is playing in the documentation as documentation. Namely, there is no explicit statement of the fact that while both Ešguzi-gin-a and we moderns may consider the Sumerian form to be the linguistic datum which is being documented, the Akkadian and English columns play different roles for each of us. It is even possible to imagine (although chronologically unlikely) that a Sumerian scholar might have used this document for the inverse purpose: to study Akkadian. In all of these cases, the key distinction is not between names of the the languages of the content of each column. It is, rather, the distinction between how a documentarian is using the content of those columns with respect to each other. For this reason, it makes more sense to compare these permutations by changing the column labels from language names to a label which specifies what kind of role each column is playing in documentation. Thus, we might label the first column as a which column is a “form” in the language of study, and which is a “gloss” into the meta-language used in documentation:

There is a deeper issue than mere labeling at play here: simply changing the column labels may seem to be mere terminological juggling, but it reflects a crucial step in meeting the definition of (digital) documentary data as defined in the previous chapter:

all recorded information about a language, organized in such a way that all of its component parts (and component subsets) may be recovered, either by retrieving the value of an attribute by property or by numerical offset within an array.

As the terminology here will be ubiquitous in this work, let us pause to define attributes, properties, and values:4

Note the distinction between attribute as a whole and property, which is the name itself of the attribute. Thus, we specify a particular Sumerian word (for an English-speaking documentarian) by listing two attributes, one whose property is form, and whose value is udu.níta, and another whose property is gloss and whose value is ‘sheep’. This object contains enough information to identify “documentary word,” in the same way that Ešguzi-gin-a did: although we are using a different meta-language to translate the content of the Sumerian words, in both cases there are two “attributes” of each word object, and each of the two attributes plays the same documentary role.

Of course, lugging the full several-dozen-tablet-edition of the Sumerian Lexical Lists was hardly a convenient way to store that information, although it may have proven unbelievably effective in preserving it.6 But it’s not just the physical medium which would have proven unwieldy over time — it was the affordances that that medium allowed and did not allow which are the true measures of utility. One glaring shortcoming of impressing such a dictionary into clay is that doing so simultaneously freezes the order of its entries. Not only was there no “finder list” which Ešguzi-gin-a could have used to look up terms in his own language (say, all glosses that contained the string knife or shorn), creating such a resource would have taken nearly as much effort as producing the original.

To summarize, then, we may think of the “data structure” of each of the words in this tablet (one per row) as being representable by two attributes, one called form, and another called gloss:

{
  "form": "udu.níta",
  "gloss": "im-me-ru"
}

These two documentary roles of “form” and “gloss” are not somehow “inherent” in the clay tablet itself. Because we know something about the way that such artifacts were used, we can infer that scribes such as Ešguzi-gin-a would have thought of the tablet as expressing something like those documentary roles. This will be a recurrent phenomenon as we analyze more artifacts: different objects are able to more or less readily mirror some of the ways in which we can think about documentary data. In the next example, we shall see an artifact where the order of words is not frozen in place.

Slipfile dictionaries - arrays of words

Prior to the widespread availability of personal computers, many — perhaps most — documentary linguists relied on a mechanism for recording documentary data called a “slipfile” or “fileslip dictionary.” This mechanism consisted of a series of paper slips sorted into one or more boxes of the sort seen below. This particular slipfile is one of about a dozen used to describe Kashaya Pomo, a Pomoan language of Northern California. These tools were used as a working catalog of ongoing lexicographical work on a language.

(Usually each “word” slip was followed by multiple examples of “usage slips” — for the purposes of the current discussion I focus only on the fileslip as a dictionary).

A fileslip dictionary. This artifact corresponds to an (ordered) array of objects. Oswalt/Kashaya Pomo Box #16

Consider this rather typical fileslip from a dictionary of

Kashaya Pomo: k̓ili  ‘black, dark in color’

This fileslip is from the archives of Robert L. Oswalt, who documented the Pomoan languages of northern California (especially Kashaya Pomo), over multiple decade beginning in the late 1950s). The typed content records the simple fact that the Kashaya Pomo word k̓ili may be glossed ‘black, dark in color.’ This physical format is hardly different from that of a spreadsheet layout in import:

k̓ili

black, dark in color

In this regard it is, entirely equivalent in “structure” to the content of a single row of the tablet above: We may immediately identify this structure with that of each word in the tablet above. Namely, this single Kashaya Pomo word may be represented the same type of object (here I have abbreviated the full definition ‘black, dark in color’ to the sort of single-form representation that might be used in an interlinear gloss):

{
  "form": "k̓ili",
  "gloss": "black"
}

It is instructive to consider that adding a duplicate of this card to the fileslip dictionary would probably be a mistake, or at least, redundant. This is because the object uniquely identifies a single Kashaya Pomo word. Note that even on this simple, two-datum fileslip, a few “extra” annotations have been added (they happen to describe cognates in other Pomoan languages). In practice, the accumulation of attributes beyond the identifying form/gloss pair is typical. The example below demonstrates how a single fileslip could acquire information far beyond that necessary to uniquely identify a word:

Kashaya Pomo: síːṭóṭo ‘robin’

Here the form in question is the Kashaya word síːṭóṭo, and the gloss is ‘robin’. But like a palimpsest, the fileslip functioned as a surface onto which all manner of annotations were added. Oswalt continued to edit his fileslips for years. By my count this single fileslip accumulated over a dozen distinct assertions about the word for ‘robin’:

Labeled version of the image above.
Labeled version of the previous figure. (Full resolution version)
  1. the form síːṭóṭo
  2. the (English) gloss ‘robin’
  3. Initials of speaker who attested the form (Alan James)
  4. A phonetic note-to-self: Oswalt was wavering over whether he was hearing a long vowel in the first syllable
  5. References to a cognate which Oswalt knows to exist in two other Pomoan languages, in Oswalt’s special notation — PS indicates Southern Pomo, PD is most probably the Dry Creek variant of Southern Pomo (Alexander Walker, personal communication).
  6. The cognate form itself, with a distinct geminate consonants.
  7. A variant form by another speaker of Kashaya, where the second stop was geminate.
  8. The speaker’s initials (Essie Parrish)
  9. A “timestamp” of some sort, which Oswalt added with a mechanical stamp.
  10. A citation to another source, which may or may not be the same word.
  11. The source citation, abbreviated Gf 77
  12. The related form from that source
  13. Another attestation from the very same speaker specified in #3 (Alan James), this time with a form matching more closely to that in 7.
  14. The date of the attestation (fourteen years after the timestamp in #9)
  15. James’ second form.
  16. The semantic domain of the form: a type of bird.

We could even imagine representing all this information in a manner like that with which we displayed the core attributes, something like:

{
  "form": "síːṭóṭo",  
  "gloss": "robin",
  "speaker": "AJ",
  "notes": [
   "check whether the first syllable has a long vowel [Oswalt]"
  ],
  "variants": [
    {
      "form": "síːṭóṭṭo",
      "gloss": "robin",
      "language": "Kashaya Pomo",
      "speaker": "EP",
      "recorded": "UNKNOWN",
      "notes": [ "note gemination of ṭ on last syllable"]
    },
    {
      "form": "síːṭóṭṭo",
      "gloss": "robin",
      "language": "Kashaya Pomo",
      "speaker": "AJ",
      "notes": [ "note gemination of ṭ on last syllable"],
      "recorded": "August, 1974"

    }
  ],
  "recorded": "July 4, 1960",
  "citations": [
    "“Gf 77 wrote chitoto — another bird?” [Oswalt]"
  ],
  "semanticDomains": ["bird"]
}

(On the complex value for the variants property, which contains a “nested” array of two variant attestations, see section 2.5.1, below.)

Even this rather more complicated tabulation fails to express all of the information contained in this fileslip — there are also faint pencil marks around the main form, in multiple colors, which record Oswalt’s evolving understanding of the complicated stress and length patterning in Kashaya. Obviously, this is a far cry from the rather bald facts of a form with a gloss, as encoded into the Mesopotamian tablet above. But only those two attributes are required to identify the word as a distinct entity. Put another way, all information aside from the first two attributes (form and gloss) could be removed from the fileslip, and it would still be meaningful to say that the word had been “documented”.7

(We shall see in later chapters that this distinction between “criterial” attributes and other “annotative” attributes is important for the way in which computer programs are written, and especially for sorting and searching.)

Note also that this word contains the first instance we’ve seen of a phenomenon which might be called nesting. The value for the variants property is itself an array, with two distinct values

Another key difference between the tablet and the fileslip is that the order of words within the tablet is immutable. A fileslip dictionary like the one below is, of course, physically sortable, and in all probability that is one of the features of the format that led Munro to praise their ability to be to be “freely manipulated and arranged.” Most commonly the sorting criterion is some form of alphabetization, but one can imagine grouping the Kashaya Pomo forms by semantic class (#16) simply by making a pass through the entire dictionary and creating a single pile for each class: birds here, trees there, etc.8

And so beyond the mere representation of a single word object, the second abstraction for data becomes apparent: each word is representented by as an individual object on a single fileslip, but an ordered sequence of such objects constitute a list or “array” of objects. Note that while the order of attributes within an object is not meaningful, the order of objects within an array is meaningful. Searching for anything at all in an array of objects which is shuffled is highly inefficient, and that inefficiency grows as the database grows. Alphabetization and other forms of ordering (such as grouping by semantic domain) is a bulwark against such inefficient arrangements.

But even in a well-ordered fileslip dictionary (a physical “array” of words), the feasibility of re-ordering data for the purposes of research becomes increasingly infeasible as the dictionary grows. A relevant anecdote: this author photographed the entirety of the Oswalt Kashaya Pomo dictionary, over 14,000 fileslips, and it took over a week of mind-numbing work. While it is easy to state that in principle any array my be rearranged in terms by any attribute of its constituent objects, in practice actually doing the sorting with a large filelslip dictionary is clearly just as excruciatingly slow as Harrington and Bloomfield’s earlier statements warned.

Purely digital “arrays” thus have many advantages over their physical correlates. By learning to programmatical sort and search the contents of arrays, the blinding speed of computers may be used to significantly increase the efficiency with which linguists investigate their data. The latter half of chapter 3 will provide the bare minimum of guidance as to how this may be done with the Javascript programming language.

Edge-notched cards: searching attributes

Samarin (1967) is perhaps unique in the history of linguistic fieldwork manuals, as it gives a highly detailed account of how to produce and sort fieldwork data in a time when computers existed, but were not widely available. Samarin and other linguists (see for example Grimes (1959)) understood that arranging lexical or textual data on fileslips into a sorted slip file dictionary was not terribly helpful for actually searching for content. As Samarin was interested in a class of words called ideophones, and wished to study the possible semantic sub-categorizations of such words, he wanted to essentially create a checklist of semantic domains, and then use that checklist to characterize each ideophone in his database.

So, of the semantic domains:

  1. design
  2. dimension
  3. emotion
  4. form
  5. intensity
  6. movement
  7. number
  8. odor
  9. order
  10. quality
  11. sound
  12. state
  13. taste
  14. temperature
  15. time
  16. touch
  17. weight

Samarin wished to be able to find all instances of words which contained a given such value of “true” or “false” for these properties. Note that attributes with only “true” or “false” as possible values have a special name in programming: booleans. Simple batch filtering by these domains values was of crucial importance to Samarin’s research, and accordingly, Samarin made use of another revealing “physical database,” although one which is far less well-remembered today than the fileslip dictionary. This mechanism was known as an “edge-notched card file”.

A typical edge-notched card produced by Samarin for Bambara data was as follows:

Bambara edge-notched card (Samarin 1967)

In Samarin’s usage, each card corresponded to an individual words — specifically, ideophones — in the Bambara language. A set of predetermined categories are represented as holes in a specific location on each card. If a given word “has” a given value for a given category, then the hole which corresponds to that value was “notched”. One might say that a particular word with a notch at a particular value is “tagged” for that value (just as modern social media “hashtags” prefixed with an number sign (#) indicate that a post “has” some characterizing label). A collection of such cards could then be searched for all words with a particular value, using a batch-lookup with a knitting needle as shown in the following image:

Looking up a single attribute in an edge-notched card file.

In the case of Figure 8, to select those word-cards whose word is known to contain an [m], one would insert the needle through the corresonding hole, and lift up. Only those cards that are notched to indicate that they do contain [m] will escape the filtering action of the needle. Put simply, this physical action is the essence of search over an array.

Not depicted in the previous image is the fact that the filtered cards could be filtered again, perhaps to find those that were “marked” (notched) for the phoneme spelled, in Samarin’s orthography, «ny». The intersection of the two searches would match the card above, which represents the Bambara ideophone mɔnyɔ mɔnyɔ.

Of course, to query this edge-notched card database for those words which contain an [m] and an [ny] is ambiguous as to the order of occurence of those phones: the “needling” procedure Samarin describes does not indicate where each phone occurred in a word, so an (imaginary) form nyɔmɔ nyɔmɔ would have matched as well as mɔnyɔ mɔnyɔ.

In this example query, Samarin was looking for the intersection of two phonetic values as recorded in his notched-card database. But in another publication (Samarin 1965), describing a distinct language, Gbeya, he seems to have recognized that some kinds of query were beyond such physical databases, and would require, as he rather charmingly put it, a “computer machine”:

…is it only coincidental that in all 15 of the Gbeya ideophones I collected to describe the surface of a brush, every tone is low? Is it not significant that several of the forms meaning "many" or "many different" have the phoneme /k/? For example, vɔk vɔk, ɗɛ́k ɗɛ́k, ŋgboŋ ŋgbok. It would be relatively easy to study the correlation between form and meaning. Punch-cards or computer machines would show us all the combinations with little difficulty. (Samarin 1966)9

Consider the small application below, which implements a rudimentary computational equivalent to Samarin’s card database:

Samarin’s technique was a true proxy for a “search engine” in the days before search engines were available. Conceptually, the notion of “filtering” data by repeatedly asking whether each element in a collection of some kind (in this case, an array of ideophones) “meets” some criterion or criteria is entirely unchanged in the digital milieu. (E.g., Does the first word have the domain “touch” as a property? Does the second? …) We shall see in the next chapter that actually implementing these “boolean” kinds of criteria requires another computational notion: the function.

The use of edge-notched cards was not unique to Samarin. Grimes (1959), to mention one source, also used such card to analyze Huichol tone systems. So they proved useful for their time. But while this “concrete” data structure brought some new functionality to the table, it had physical limitations as well. The attribute values Samarin used had to be defined before he printed his cards, and once they were printed they were hard to extend. (He also notes that the were quite expensive to produce.) By contrast, changing the data behind the interface above is a matter of typing a few words. Note also that he is only encoding values from attributes which are more-or-less “closed”. It is not really practical to carry out text search using this mechanism: what combination of needling could specify that you want to find all words which include the sequence «ʔd» as such? Programming languages (including Javascript), however, do have mechanisms for finding “substrings” in longer strings.

Hierarchical structure in interlinears

Complex objects as a model of hierarchy

Unlike the three artifacts described so far in this chapter, whose data were conceptualized either as individual objects representing words (section 2.2), or arrays of objects representing words (sections 2.3-2.4), we shall see that the nature of the data involved in representing an entire interlinear text is not easily reduced to an array of objects whose attributes are simple values. This is because the data required to encode interlinearized text is hierarchical by nature: morphemes are “contained” in words, words are “contained” in sentences, etc.

This notion of nested data is so key to the digitalization of language documentation that I will begin with metaphorically, using a non-linguistic explanation. To visualize the notion of hierarchical data is to picture matryoshka dolls, a traditional Russian toy where hollow wooden dolls are nested together such that they may be opened and revealed in sequence.

Matryoshka dolls are nested by size.

Note that in this arrangement, each doll is essentially alike in kind to every other, and furthermore, each level contains only one element (a single doll). As we saw in 2.4, arrays are useful for representing an ordered list of objects, but the hierarchical dolls do not constitute a simple ordered list, because a “flat” array of objects does not express the idea of nesting, only that the objects are in a specified linear order. Even so, we may think of each doll as an object with two attributes, name and child, where name is simply a label and child references the name of the doll which it is the direct container of.

Five independently labeled matryoshkas encoding nesting hierarchy.

How then to represent matryoshka dolls (or any data containing a hierarchical structure) in terms of the two generic data structures we have been discussing thus far — objects and arrays? The information in this diagram would seem to be entirely equivalent to the strcture in the clay tablet in section 2.3: we see an ordered array of “doll” objects, each with two attributes, one with a property called name, and another called contains.

A “spreadsheet”-style representation of data describing five named matryoshka dolls. Hierarchical relationships are not directly captured, only referenced by value.10

But this representation fails to encode the most important feature of this data: the fact that it is hierarchical. Note that the value of the contains attribute is represented textually, with a string which matches the value for name of another doll object, but that relationship is only implicit. A more useful way to nest data is to actually embed another object as a value directly, rather than indirectly referring to a simple textual value such as a string of text which happens to be the value of one of that object’s attributes. This technique is sufficient to capture hierarchial relationships, because it expresses a tree structure directly.

Considering the dolls in the image above, there is a nesting of “depth” of five. By specifying for each doll which other doll “nests” inside, we can encode the whole hierarchy. Returning to the visualization of objects used in Chapter One and above, notice how in the tabulation below, objects are themselves serving as values. Again, in contrast to the previous diagram, this structure is not an array. This is a hierarchical data structure.

{
  "name": "Irina",
  "contains": {
    "name": "Svetlana",
    "contains": {
      "name": "Yulia",
      "contains": {
        "name": "Nataliya",
        "contains": {
          "name": "Ekaterina",
          "contains": {}
        }
      }
    }
  }
}
An abstract representation of the Matryoshka dolls. The hierarchy is directly encoded.

Arrays of words in attributes of a sentence

In Chapter One the crucial role of hierarchical data relationships to task of digitizing language documentation was introduced. In the current section, we will apply the notion of “nested objects” as described in the previous section to the domain of interlinear texts. Again, we will begin with some concrete artifacts from the history of documentation, in this case, two print versions of a text in Takelma (Isolate, Oregon) entitled How a Takelma house was built.

“How a Takelma house was built” from Sapir’s ‘Takelma Texts’

The image above is a scan of the first version, published by Edward Sapir in his 1909 Takelma Texts. The book was published in “parallel” form: printed with transcription and translation on facing pages, aligned into paragraphs.

There is much value in this parallel format: one key benefit is the visual representation conveys an interpretation of the source language’s equal importance to the target language: that the Takelma is presented with exactly the same constraints of space, punctuation, capitalization, and as the Englishl. But in terms of tracing in close comparison from that same language to the translation, the format is less usable, and may give rise difficulties in interpreting the associations between transcriptions and translations, or even to errors in recording that alignment, if the parallel format is the original format of recording.

In order to use such a text, a read (at least, any reader who does not know Takelma) must read it in “chunks,” zig-zagging between the Takelma and the English. Most probably, the reader would attempt to follow the text by reading one sentence at a time in each language, since the only available points of alignment between the running transcription and the running translation are 1) paragraph boundaries delimited by indentation and line spacing, and 2) sentence boundaries, delimited by standard punctuation. The table below, then, encodes essentially all the recoverable information about how the Takelma and English versions of the text relate to each other; the “misaligned” Takelma sentence which happens to correspond to two English sentences is highlighted.

 <tr>
   <th>¶</th>
   <th>#</th>
   <th>transcription</th>
   <th>translation</th>
 </tr>
 <tr>
   <td>1</td>
   <td>1</td>
   <td lang="tkm">Yap!a wíliⁱ k!emèi.</td>
   <td lang="en">The people are making a house.</td>
 </tr>
 <tr>
   <td>1</td>
   <td>2</td>
   <td lang="tkm">Bẽm p!a-idī<sup>ɛ</sup>lóᵘk‘, emé<sup>ɛ</sup>s·i<sup>ɛ</sup> hono<sup>ɛ</sup> p!a-idī<sup>ɛ</sup>lóᵘk‘, hé<sup>ɛ</sup>me<sup>ɛ</sup> honó<sup>ɛ</sup> p!a-idī<sup>ɛ</sup>lóᵘk‘, hagamgamàn p!a-idī<sup>ɛ</sup>- lóᵘk‘.</td>
   <td lang="en">A post the set in the ground, and here again they set one in the ground, yonder again they set one in the ground, in four places they set them in the ground.</td>
 </tr>
 <tr>
   <td>1</td>
   <td>3</td>
   <td lang="tkm">Hé<sup>ɛ</sup>ne hono<sup>ɛ</sup> hangilíp‘ gadàk‘ hagamgamàn, gadák‘s·i<sup>ɛ</sup> mǖ<sup>ɛ</sup>xdánhi hangilíp‘.</td>
   <td lang="en">Then also they place beams across on top in four places, and above (these) they put one across just once.</td>
 </tr>
 <tr>
   <td>1</td>
   <td>4</td>
   <td lang="tkm">He<sup>ɛ</sup>ne yáᵃs·i<sup>ɛ</sup> wíli s·idibíⁱ k!emèĩ; he<sup>ɛ</sup>ne gadák‘s·i<sup>ɛ</sup> mats!àk‘ wiliⁱ heᵉlàm, t‘gàl ga hᵉlám k!emèĩ.</td>
   <td lang="en">And just then they make the house wall; and then on top they place the house boards, those they make out of sugar-pine lumber.</td>
 </tr>
 <tr>
   <td>1</td>
   <td>5</td>
   <td lang="tkm">Ganē dak‘dát‘ dat!abàk‘, hā′<sup>ɛ</sup>ya dat!abàk‘.</td>
   <td lang="en">Then they finish it on top, on either side they finish it.</td>
 </tr>
 <tr>
   <td>1</td>
   <td>6</td>
   <td lang="tkm">Ganē dede- wilíⁱdadís k!emèĩ dak‘dat‘s·í<sup>ɛ</sup> dahók‘wal k!emèĩ k!iyī′x ganàu ba-igináxdaᵃ.</td>
   <td lang="en">Then they make the door, and on top they make a hole for the going out of the smoke.</td>
 </tr>
 <tr>
   <td>1</td>
   <td>7</td>
   <td lang="tkm">Gan<sup>ɛ</sup>s·i<sup>ɛ</sup> gák!an k!emèĩ, xā<sup>ɛ</sup>īsgip!ísgap‘, gwelt‘gāũ gináx k!emèi; wili s·idibíⁱs·i<sup>ɛ</sup> k!emèī.</td>
   <td lang="en">And then they make a ladder, they notch out (a pole), for going down to the floor they make it; and the house wall they make.</td>
 </tr>
 <tr>
   <td>2</td>
   <td>8</td>
   <td lang="tkm">Ganē dat!abàk‘ ha<sup>ɛ</sup>īt‘bǖ′xt‘bixik‘ʷ.</td>
   <td lang="en">Then they finish it, all cleaned inside.</td>
 </tr>
 <tr style="background:lemonchiffon">
   <td>2</td>
   <td>1</td>
   <td lang="tkm">Ganē lep!ẽs hahū- wúᵘ<sup>ɛ</sup>k‘i, ganát‘ gidĩ alxalĩ yap!à; p!iⁱ yogáᵃ has·s·õᵘ, gas·i<sup>ɛ</sup> alxalīyaná<sup>ɛ</sup> hᵃ′<sup>ɛ</sup>ya p!iyà.</td>
   <td lang="en">Now rush mats they spread out inside, on such the people sit. The fireplace is in the center, so that they are seated on either side of the fire.</td>
 </tr>
 <tr>
   <td>2</td>
   <td>2</td>
   <td lang="tkm">Gana<sup>ɛ</sup>néx hop!è′<sup>ɛ</sup>n yap!a<sup>ɛ</sup>a wílⁱ; lep‘níxa wilíⁱ ganàt‘.</td>
   <td lang="en">In that way, indeed, was the house of the people long ago; in winter their house was such.</td>
 </tr>
 <tr>
   <td>2</td>
   <td>3</td>
   <td lang="tkm">Samáxas·i<sup>ɛ</sup> ana<sup>ɛ</sup>néx alxalĩ, ánī<sup>ɛ</sup> wíli ganàu.</td>
   <td lang="en">But in summer they were sitting like now, not in the house.</td>
 </tr>
 <tr>
   <td>2</td>
   <td>4</td>
   <td lang="tkm">Gwás· wili yaxa wit‘géyeᵉ<sup>ɛ</sup>k‘i, gas·i<sup>ɛ</sup> p!iⁱ yogáᵃ k!emèĩ habinì.</td>
   <td lang="en">Just a brush shelter they placed around, so that the fireplace they made in the middle.</td>
 </tr>
 <tr>
   <td>2</td>
   <td>5</td>
   <td lang="tkm">Gana<sup>ɛ</sup>nex samáxa alxalĩ, anī<sup>ɛ</sup> lep‘níxa nat‘ wíli ganàu.</td>
   <td lang="en">Thus they dwelt in summer, not as in winter in a house.</td>
 </tr>
 Aligned sentences from Sapir 1909

The highlighted row (paragraph 2, sentence 1) is of interest because it encodes information which is arguably mis-aligned in the source. While the transcription and translation of the first paragraph contains the same number of sentences (seven), the second paragraph contains a differing number of sentences in the Takelma and English equivalents (six and seven, respectively). It is unclear why Sapir set the Takelma sentence ¶2.1 with a semicolon instead of a period (after the word yap!à). He may have been trying to foreground some sort of difference in clausal linkage in Takelma as opposed to English. More likely, the period on the English “side” of sentence ¶2.1 was simply a typographical error, or a printing error. This seems all the more likely since in the interlinear version below, the semicolons appear on both sides. That one transcription/translation pair in this parallel text is of little consequence to its worth or usability as documentation. Rather, we may view this inconsistency as a sort of side-effect of the format in which the data was presented: the constraints of particular arrangements of data can affect both how the linguist and a reader conceptualizes the information that is presented in the format.

There are of course two main problems with such a document as documentation: first, it says nothing about word-level meaning (there are no glosses), and second, it says nothing about analysis “below” the word level. Both lexical and morphological information is present, but it is also implicit, and as can only be recovered by a level of deductive reasoning which puts an enormous burden on the reader, who must essentially attempt to carry out a new analysis of which Takelma forms correspond to the meanings expressed in the English translation, and of the identity Takelma morphological elements. Sapir, of course, was aware that the parallel text format did not suffice to effectively explain the grammatical structure of Takelma. The existence of a second, later version of the same text with interlinear annotations and extensive footnotes is evidence of that.

It is worth comparing the table above to the bilingual Sumerian/Akkadian lexicon in section 1.3. The resources are structurally identical: they are both tables. In the terminology used here, they may both be represented as arrays of objects. The differences between the two are thus not in the data structure per se (both the tablet and the parallel text may be “captured” by an array of objects), but rather in the class11 of the objects in each case: in the lexicon, the objects are words, and in the current case, the objects are “sentences”. Whereas we have represented words with the attributes form and gloss, those in the parallel text are represented with attributes transcription, and translation. [^optional-attributes]

Thus, our inventory of classes of objects has increased from one (Words) to two (Words and Sentences). These two classes of data are intertwined in documentation: sentences, in some sense, are “made up of” words, and words are in some sense “part of” sentences.

Turning then, Sapir also published a second version of this text in his grammar of Takelma, The Takelma language of southwest Oregon (Sapir:1912):

First page of interlinear version of ‘How a Takelma house was built’ Second page of interlinear version of ‘How a Takelma house was built’

Interlinear version of Takelma Text (Sapir 1912:294-296). Glosses are periphrastic, and there are no morpheme boundaries in forms or glosses.

Let us consider sentence of the previous text, viewed in the familiar four-tier interlinear format (I have used a serif typeface here to simulate appearance in a typical print journal):

{
  "transcription": "yap!a wi’lī' k!emèi",
  "translation": "The people are making a house.",
  "words": [
    {
      "form": "yap!a",
      "gloss": "people"
    },
    {
      "form": "wi’lī'",
      "gloss": "house"
    },
    {
      "form": "k!emèi",
      "gloss": "they make it"
    }
  ]
}

It is unsurprising that such a presentation is often described as interlinear, or as consisting of “lines” or “tiers”. But all such terminology de-emphasizes the fact that the relationships between each line are not equivalent for all pairs of lines. That is to say, the relationship between transcription yap!a wi’lī' k!emèi and the free translation ‘The people are making a house’ is in a sense a symmetrical one: both are directly attributes of a “sentence.” But the second and third “lines” do not relate to the sentence in the same way.

A fairly simple way to see this is to consider not the shortest, but the longest sentence in the text, and how it appears as an interlinear:

{
  "transcription": "Bẽm p!a-idīɛlóᵘk‘, eméɛs·iɛ honoɛ p!a-idīɛlóᵘk‘, héɛmeɛ honóɛ p!a-idīɛlóᵘk‘, hagamgamàn p!a-idīɛ- lóᵘk‘.",
  "translation": "A post the set in the ground, and here again they set one in the ground, yonder again they set one in the ground, in four places they set them in the ground.",
  "words": [
    {
      "form": "bēm",
      "gloss": "post"
    },
    {
      "form": "p!a-idīɛlō’k‘",
      "gloss": "they set it down"
    },
    {
      "form": "eme’ɛs·iɛ",
      "gloss": "and here"
    },
    {
      "form": "hono",
      "gloss": "again"
    },
    {
      "form": "p!a-idīɛlō’k‘",
      "gloss": "they set it down"
    },
    {
      "form": "he’ɛmeɛ",
      "gloss": "yonder"
    },
    {
      "form": "hono",
      "gloss": "again"
    },
    {
      "form": "p!a-idīɛlō’k‘",
      "gloss": "they set it down"
    },
    {
      "form": "hagamgama`n",
      "gloss": "in four places"
    },
    {
      "form": "p!a-idīɛlō’k‘",
      "gloss": "they set them down"
    }
  ],
  "tags": [],
  "metadata": {}
}

This presentation is no longer “literally” a four-tier gloss: while the transcription (first line) and the free translation (last line) wrap individually, the second and third “lines” wrap together. It is not difficult to see emphasize that this is the case by treating the sequences of forms and glosses as if they truly were independent tiers on the same level as the transcription and translation, as follows:

Bẽm p!a-idīɛlóᵘk‘, eméɛs·iɛ honoɛ p!a-idīɛlóᵘk‘, héɛmeɛ honóɛ p!a-idīɛlóᵘk‘, hagamgamàn p!a-idīɛ- lóᵘk‘.


bēm p!a-idīɛlō’k‘ eme’ɛs·iɛ hono p!a-idīɛlō’k‘ he’ɛmeɛ hono p!a-idīɛlō’k‘ hagamgama`n p!a-idīɛlō’k‘
post they set it down and here again they set it down yonder again they set it down in four places they set them down

A post the set in the ground, and here again they set one in the ground, yonder again they set one in the ground, in four places they set them in the ground.

We may visualize this special relationship of the form and gloss “tiers” by placing a dashed border around the form/gloss “tier”, and a dotted line around each individual word. This is a visual expression of the notion of hierarchy: the contents of the second and third “tiers” are not tiers at all — or if we choose to think of them in terms of tiers, they constitute a single tier which is complex: it contains a a distinct sequence @@

{
  "transcription": "Bẽm p!a-idīɛlóᵘk‘, eméɛs·iɛ honoɛ p!a-idīɛlóᵘk‘, héɛmeɛ honóɛ p!a-idīɛlóᵘk‘, hagamgamàn p!a-idīɛ- lóᵘk‘.",
  "translation": "A post the set in the ground, and here again they set one in the ground, yonder again they set one in the ground, in four places they set them in the ground.",
  "words": [
    {
      "form": "bēm",
      "gloss": "post"
    },
    {
      "form": "p!a-idīɛlō’k‘",
      "gloss": "they set it down"
    },
    {
      "form": "eme’ɛs·iɛ",
      "gloss": "and here"
    },
    {
      "form": "hono",
      "gloss": "again"
    },
    {
      "form": "p!a-idīɛlō’k‘",
      "gloss": "they set it down"
    },
    {
      "form": "he’ɛmeɛ",
      "gloss": "yonder"
    },
    {
      "form": "hono",
      "gloss": "again"
    },
    {
      "form": "p!a-idīɛlō’k‘",
      "gloss": "they set it down"
    },
    {
      "form": "hagamgama`n",
      "gloss": "in four places"
    },
    {
      "form": "p!a-idīɛlō’k‘",
      "gloss": "they set them down"
    }
  ],
  "tags": [],
  "metadata": {}
}

The first word (bēm ‘post’) is not itself represented directly as attribute of the sentence; it is, rather, represented as the first object of an array which is the value of a property of the sentence, which we shall choose to name words. Such a representation is at the heart of the notion of hierarchy, and any software for documentary data must address it. In principle, it is entirely comparable to the kind of nesting we saw in the matryoshka dolls above, except that rather than nesting a hierarchy of objects, in this case we have nested a single level of an array, which in turn contains objects representing words. Both the word objects and their collection into a list is captured in the data structure.

We may also view this sentence data structure in the now-familiar tabulation format, with visible property labels:

{
  "transcription": "Bẽm p!a-idīɛlóᵘk‘, eméɛs·iɛ honoɛ p!a-idīɛlóᵘk‘, héɛmeɛ honóɛ p!a-idīɛlóᵘk‘, hagamgamàn p!a-idīɛ- lóᵘk‘.",
  "translation": "A post the set in the ground, and here again they set one in the ground, yonder again they set one in the ground, in four places they set them in the ground.",
  "words": [
    {
      "form": "bēm",
      "gloss": "post"
    },
    {
      "form": "p!a-idīɛlō’k‘",
      "gloss": "they set it down"
    },
    {
      "form": "eme’ɛs·iɛ",
      "gloss": "and here"
    },
    {
      "form": "hono",
      "gloss": "again"
    },
    {
      "form": "p!a-idīɛlō’k‘",
      "gloss": "they set it down"
    },
    {
      "form": "he’ɛmeɛ",
      "gloss": "yonder"
    },
    {
      "form": "hono",
      "gloss": "again"
    },
    {
      "form": "p!a-idīɛlō’k‘",
      "gloss": "they set it down"
    },
    {
      "form": "hagamgama`n",
      "gloss": "in four places"
    },
    {
      "form": "p!a-idīɛlō’k‘",
      "gloss": "they set them down"
    }
  ]
}

So we have addressed the fact that viewing an array of word objects as a nested attribute of a sentence object is helpful. But this document is by no means equivalent in information to the way it would be presented in a modern grammar. The publication of this text is, after all, from an early date in the history of Americanist linguistics, and conventions for interlinear texts were still evolving. Despite the fact that individual words have word-level translations, the word-by-word, paraphrastic glossing style now seems quite old-fashioned, if not inadequate. Why is this?

Arrays of morphemes in attributes of a word

The primary shortcoming of such analysis is, of course, that while something about the lexical meaning of individual words has been captured, the meanings of individual morphemes has not. Modern styles of notating grammatical categories such as the system now described by the Leipzig Glossing Rules did not arise in a vacuum. Those conventions evolved through a process of typographical experimentation in the typographical formatting of interlinear glossing.

Indeed, in Europe in the same year as the first version of Sapir’s text was published — 1909 — one of his contemporaries in Europe was already using something quite close to what would become the modern notation for morpheme-level glossing. As pointed out in Lehmann (2004), Franz Nikolaus Finck12 used morpheme delimiters in his interlinear analysis of Turkish (Finck 1909:83):

Franz Nikolaus Finck’s interlinear analysis of a Turkish sentence. Finck (1909:83), cited in Lehmann (2004). Morpheme boundaries are found within forms and glosses.

Even accounting for the antique Fraktur typeface, Finck’s glossing style does look remarkably modern:13

{
  "transcription": "xodža-da esbāb-ın dzümle-si-ni ateš-e vur-up yak-ar",
  "translationRoman": "Der Meister warf nun sämtliche Kleider ins Feuer und verbrannte sie.",
  "translation": "𝔇𝔢𝔯 𝔐𝔢𝔦𝔰𝔱𝔢𝔯 𝔴𝔞𝔯𝔣 𝔫𝔲𝔫 𝔰ä𝔪𝔱𝔩𝔦𝔠𝔥𝔢 𝔎𝔩𝔢𝔦𝔡𝔢𝔯 𝔦𝔫𝔰 𝔉𝔢𝔲𝔢𝔯 𝔲𝔫𝔡 𝔳𝔢𝔯𝔟𝔯𝔞𝔫𝔫𝔱𝔢 𝔰𝔦𝔢.",
  "words": [
    {
      "form": "xodža-da",
      "gloss": "𝔐𝔢𝔦𝔰𝔱𝔢𝔯=𝔞𝔲𝔠𝔥",
      "tags": [],
      "metadata": {},
      "roman": "Meister=auch"
    },
    {
      "form": "esbāb-ın",
      "gloss": "𝔎𝔩𝔢𝔦𝔡𝔢𝔯=(𝔡𝔢𝔯)",
      "tags": [],
      "metadata": {},
      "roman": "Kleider=(der)"
    },
    {
      "form": "dzümle-si-ni",
      "gloss": "𝔊𝔢𝔰𝔞𝔪𝔱𝔥𝔢𝔦𝔱=𝔦𝔥𝔯𝔢=𝔡𝔦𝔢",
      "tags": [],
      "metadata": {},
      "roman": "Gesamtheit=ihre=die"
    },
    {
      "form": "ateš-e",
      "gloss": "𝔉𝔢𝔲𝔢𝔯=𝔷𝔲",
      "tags": [],
      "metadata": {},
      "roman": "Feuer=zu"
    },
    {
      "form": "vur-up",
      "gloss": "𝔴𝔢𝔯𝔣=𝔢𝔫𝔡𝔢𝔯𝔴𝔢𝔦𝔰𝔢",
      "tags": [],
      "metadata": {},
      "roman": "werf=enderweise"
    },
    {
      "form": "yak-ar",
      "gloss": "𝔳𝔢𝔯𝔟𝔯𝔢𝔫𝔫=𝔢𝔫𝔡",
      "tags": [],
      "metadata": {},
      "roman": "verbrenn=end"
    }
  ],
  "tags": [],
  "metadata": {}
}

While the theoretical concept of the “morpheme” was coined by Baudouin de Courteny in the 1880s, and wouldn’t be popularized within the field until the publication of Bloomfield’s Language in 1933 (Aronoff and Volpe 2006), the more basic notion of words being analyzeable into constituent parts has very ancient antecedents: Babylonian scholars, for instance, produced systematic tabulations of the the structure which amount to paradigmatic analyses of Sumerian words (Black 1984). Indeed, Sapir’s and Finck’s “mental models” of the structure of words were probably almost identical. It was only the technology of glossing notation that differed between these two displays, not the intuitive model. That this is the case can be inferred by a close reading of Sapir’s critical apparatus — his voluminous footnotes, where morphemes are delimited with hyphens just as in Finck’s text. The form p!a-idīɛlóᵘk‘, for instance, periphrastically glossed “they set it down,” is analyzed in footnote five as follows:

The original footnote reads:

  1. p!a-i- DOWN § 37, 13; dīⁱ- § 36, 10. lōʹᵘkᵉ third personal subject, third personal object aorist of verb lōʹᵘgwᵋn Type 6 I SET IT; §§ 63; 40, 6.

Associating each morpheme with the relevant part of the analysis and a new (Leipzig-style) gloss, we may summarize as follows:

[
  {
    "morpheme": "p!a-i-",
    "footnote": "DOWN § 37, 13;",
    "grammar": "§37.13",
    "gloss": "DOWN"
  },
  {
    "morpheme": "dīɛ-",
    "footnote": "§ 36, 10.",
    "grammar": "§36.10",
    "gloss": "BEHIND"
  },
  {
    "morpheme": "lóᵘk‘",
    "footnote": "third personal subject, third personal object aorist of verb lō´ᵘgwaɛn Type 6 I set it; §§ 63; 40, 6.",
    "grammar": "§63, §40.6",
    "gloss": "3S.3O.AOR.set"
  }
]

Accumulating each of the added morphemic glosses into a single word-level gloss, we might expect to see the form glossed something like this:

{
  "form": "p!a-i-dīɛ-lóᵘk‘",
  "gloss": "DOWN-BEHIND-3S.3O.AOR.set"
}

…where the grammatical category abbreviations stand for:

{
  "3": "third person",
  "S": "singular",
  "O": "object",
  "AOR": "aorist tense"
}

This reformatted presentation of (some of) the information which is encoded into footnotes in Sapir’s publication is thus not very different in kind from the information that Finck encoded: both were simply using different typographical conventions to capture word structure. Both (as have many linguists since their time) used typographical means to express the notion of hierarchy in documentary data. But it cannot have failed to occur to Sapir that his footnotes were highly redundant. For instance, the annotation third personal subject, third personal object aorist of verb ~ appears in no fewer than five footnotes. See:

  1. Third personal subject, third personal object aorist of verb k!emēᵋn Type 3 I MAKE IT; §§ 63; 65.
  2. p!a-i- DOWN § 37, 13; dīⁱ- § 36, 10. lōʹᵘkᵉ third personal subject, third personal object aorist of verb lōʹᵘgwᵋn Type 6 I SET IT; §§ 63; 40, 6.
  3. han- ACROSS § 37, 1. -gili`p' third personal subject, third personal object aorist of verb -gilibaᵉn
  4. Third personal subject, third personal object aorist of verb mats!aga'ʹᵋn Type 3 1 put it; §§ 63; 40, 3.
  5. da- § 36, 2 end; -t!aba`kᵉ third personal subject, third personal object aorist of verb -t!abagaʹᵋn Type 3

This approach is unwieldy at best, both in production (individually typesetting each repeated structure) and in use (the unfortunate reader who must trace each repetitive structure’s description between the primary text and the footnote).

Finck’s approach of juxtaposing grammatical category labels directly beside the morphemes to which they referred is in a sense more “cryptic” in that it requires readers to remember and mentally expand abbreviations, but in the same way that keeping the forms and glosses of individual words is more visually meaningful, the “visual closeness” of the individual morphemes and their glosses is less ambiguous, more readable, and less error-prone. We will not trace out the full hierarchy from sentences through morphemes here, which will consist of two levels of nesting, because at this point we have seen enough of the “physical” or “visual” expressions of hierarchical representations that we can proceed to the task of implementation. While here we have used only tabular representations to investigate hierarchical documentary data in this chapter, in the the next chatper, we will see that the “overhead” to encoding these structures into a computer program is very direct indeed. Furthermore, this process of representing nested structures as attributes whose values are arrays of objects can work in both directions. Thus, we shall see that this simple mechanism is sufficient to represent texts, corpora, lexicons, and other useful documentary structures.

It is these data structures that will serve as the basis for applications which allow users to create, manipulate, and analyze significant amounts of documentary data. Thus, in the next chapter we will proceed to some basic programming, using the browser’s “built-in” programming language, Javascript. We will begin by learning the simple programming syntax for writing down objects and arrays in a machine-readable way. Then we will proceed to learn the basics of 1) “variables” or program-internal names for particular objects; 2) functions, or named procedures which can transform data; 3) classes, a sort of “factory” for creating particular types of data object.

  1. Dereferencing (attributes of objects are retrievable by property name)
  2. Sorting (an array of such objects may be grouped or sorted by dereferencing one or more attributes)
  3. Searching (subsets of an array may be retrieved by iterating through the array and retrieving all items which match one or more attribute queries)
  4. Hierarchy (the three previous criteria must be usable with nested data: find all sentences which contain a word which is ergative; find all words which contain a morpheme which is an exponent of the “pronoun” system (which might in turn be defined as “words with explicit number, person, and case marking))

References


  1. I will use a monospace font to point out terms which have a specific meaning within a programming context.

  2. In Chapter 3 we will add a handful of more “primitive” data types, including “strings”, “numbers”, and true/false values (“booleans”).

  3. The distinction between a “gloss” and a “definition” is being handled loosely here, by design. The Akkadian content in this tablet is not a “gloss” in the modern sense as employed, for instance, in the Leipzig Glossing Rules. An implementation of the difference between the two concepts is detailed in chapter 3, but for now we we define a “gloss” as “a relatively short translation of a form” — it is recognized that such a unit would often be referred to as a “definition.” We shall see further on that words may be represented as containing both glosses and definitions, if desired.

  4. In online documentation for Javascript the term property is sometimes used in the sense given to the term attribute here.

  5. Terms with double gray underlines are defined terms which may be found in the index.

  6. Modulo a few razings of cities and incinerations of libraries of tablets.

  7. The sense of documentation that I am using here is a restricted one, in the sense that I am referring to documentation which is intended for an audience which includes those who are not speakers of the language being studied. This is by no means meant to imply that monolingually-oriented documentation projects by native speakers of a language should not be considered to be documentation. To the contrary, such projects are of great benefit. In addition to their inherent value as records of the speaker/linguist’s language, such work may also serve as the basis for building “cross-lingual” documentation in the restricted sense described here

  8. Interestingly, because of a peculiarity of Pomoan grammar, Oswalt’s own sort-order was not a direct alphabetization. Pomoan has a series of some twenty verbal prefixes which are so common in the lexicon that Oswalt chose to sort forms by their second syllable.

  9. Samarin, William J. 1965. Perspective on African ideophones. African Studies 24(2). 117–121.

  10. This table may look familiar to a “join table” if the reader has experience with relational (SQL) databases: in such a system this table of data representing matryoshka dolls could be joined on itself, with each name acting as a “primary key”. In effect, that approach emulates a tree structure from “flat” tables like the one here. The approach taken in this dissertation for the sake newcomers to programming is to represent the core data as a tree structure without the use of join tables. While there may be a cost in efficiency for this approach, I believe is pedagogically defensible sufficiently efficient for the scale of databases described here. For some thoughts on scaling this system to “bigger” data, see Chapter 6.

  11. For the notion of class in Javascript, see chapter 3.

  12. Interestingly, Finck was J.P. Harrington’s advisor until Finck’s untimely death one year after the publication of this work. See Harrington’s obituary in American Anthropologist.

  13. Note that the double hyphens in the glosses are unrelated to the use of equals signs to represent clitic boundaries in modern glosses; rather, the so-called “double oblique hyphen” is the only typographical form of hyphens in the Fraktur typeface used by Franck to transcribe German.