Understanding data and data files

1.1 Some simple “delimited” data formats: CSV and TSV

There are two kinds of “delimited text files” in common use: .csv for “comma-separated values” and .tsv for “tab-separated values”. The difference might seem trivial, but tab-delimited seems a bit easier to me than csv, so we’ll use that here.

The input file we’ll be using is called DCA_se1_ag2_f_01_1.txt. You can look at it here: here. Here are the first few lines:

Line Spkr    StTime  Content EnTime
        1   DCA_int_06  0.5165  What kind of games did you play as a child? 2.1765
        2   DCA_int_06  2.1765  (pause 1.74)    3.9158
        3   DCA_int_06  3.9158  Perhaps, did you play uh, marbles, Red Rover, Kick the Can, May I,  8.6242
        4   DCA_int_06  8.6242  (pause 0.40)    9.0252
        5   DCA_int_06  9.0252  Capture the Flag, Hide and Seek, games with bottle caps?    12.3860
        6   DCA_int_06  12.3860 (pause 2.27)    14.6521
        7   DCA_int_06  14.6521 What type of games did you play?    16.2748
        8   DCA_se1_ag2_f_01    16.4136 Baseball (laughing),    17.9290
        9   DCA_se1_ag2_f_01    17.9290 (pause 0.73)    18.6611
        10  DCA_se1_ag2_f_01    18.6611 marbles, you know.  19.9713
        11  DCA_se1_ag2_f_01    19.9713 (pause 0.81)    20.7827
        12  DCA_se1_ag2_f_01    20.7827 Uh, 21.4029
        13  DCA_se1_ag2_f_01    21.4029 (pause 2.87)    24.2694
        14  DCA_se1_ag2_f_01    24.2694 May I?  24.9408
        15  DCA_int_06  25.9713 Yeah.   26.4283
          

Kendall, Tyler and Charlie Farrington. 2020. The Corpus of Regional African American Language. Version 2020.05. Eugene, OR: The Online Resources for African American Language Project. http://oraal.uoregon.edu/coraal (File: http://lingtools.uoregon.edu/coraal/explorer/display.php?what=DCA_se1_ag2_f_01_1.txt&audio=1)

(The accompanying audio file is here, we’ll use that in the next section.)

You can pretty much tell just by looking at it that it’s a TSV file, not a CSV file, since there aren’t commas all over the place.

1.2 Our parsing strategy

Here’s our plan:

  1. SEGMENT Split (on the newline character) into an array of lines.
  2. TOKENIZE LINES For each line:
    1. TOKENIZE Split (on the tab character) in to an array of tokens.
    2. ISOLATE HEADER LINE The first line of tokens is the header line, the rest of the lines are content. As we process each content line, we’re going to need to reference the header line, but we don’t want to treat the header line as data.
  3. ITERATE THROUGH LINES For each line:
    1. ALIGN Pair up each header with its corresponding token.
    2. OBJECT-IZE Convert that array of pairs into a single object, with the header as key and line token as value.
  4. RETURN Return the array of objects.

To keep ourselves sane, let’s try to write little functions that carry out these tasks, and then combine them into something called parseTSV. So our code will look like this:

Let’s say we’re starting out with some code like this:

let tsv = `Line Spkr StTime Content EnTime 1 DCA_int_06 0.5165 What kind of games did you play as a child? 2.1765 2 DCA_int_06 2.1765 (pause 1.74) 3.9158 3 DCA_int_06 3.9158 Perhaps, did you play uh, marbles, Red Rover, Kick the Can, May I, 8.6242 4 DCA_int_06 8.6242 (pause 0.40) 9.0252 5 DCA_int_06 9.0252 Capture the Flag, Hide and Seek, games with bottle caps? 12.3860 6 DCA_int_06 12.3860 (pause 2.27) 14.6521 7 DCA_int_06 14.6521 What type of games did you play? 16.2748 8 DCA_se1_ag2_f_01 16.4136 Baseball (laughing), 17.9290 9 DCA_se1_ag2_f_01 17.9290 (pause 0.73) 18.6611 10 DCA_se1_ag2_f_01 18.6611 marbles, you know. 19.9713 11 DCA_se1_ag2_f_01 19.9713 (pause 0.81) 20.7827 12 DCA_se1_ag2_f_01 20.7827 Uh, 21.4029 13 DCA_se1_ag2_f_01 21.4029 (pause 2.87) 24.2694 14 DCA_se1_ag2_f_01 24.2694 May I? 24.9408 15 DCA_int_06 25.9713 Yeah. 26.4283`

Here comes our first function.

let segmentTSV = tsv => tsv.split("\n") // step 1 let segmentTSV = tsv => tsv.split("\n") // step 1

You could also write this:

  let segmentTSV = tsv => {
            return tsv.split("\n") // step 1
          }
          

That’s exactly the same as the previous version. You have to use brackets if your function takes more than one line to write down, but ours is so simple we can make it pretty. 💕

So what do we end up with?

An array of strings, one per line:

If we run:

segmentTSV(tsv)

…in the console, then we end up with an array of strings. Note that things look a little weird, here. Specifically, suddenly all the alignment is gone and we see /code> instead of an “actual” tab.

[
          "Line\tSpkr\tStTime\tContent\tEnTime",
          "1\tDCA_int_06\t0.5165\tWhat kind of games did you play as a child?\t2.1765",
          "2\tDCA_int_06\t2.1765\t(pause 1.74)\t3.9158",
          "3\tDCA_int_06\t3.9158\tPerhaps, did you play uh, marbles, Red Rover, Kick the Can, May I,\t8.6242",
          "4\tDCA_int_06\t8.6242\t(pause 0.40)\t9.0252",
          "5\tDCA_int_06\t9.0252\tCapture the Flag, Hide and Seek, games with bottle caps?\t12.3860",
          "6\tDCA_int_06\t12.3860\t(pause 2.27)\t14.6521",
          "7\tDCA_int_06\t14.6521\tWhat type of games did you play?\t16.2748",
          "8\tDCA_se1_ag2_f_01\t16.4136\tBaseball (laughing),\t17.9290",
          "9\tDCA_se1_ag2_f_01\t17.9290\t(pause 0.73)\t18.6611",
          "10\tDCA_se1_ag2_f_01\t18.6611\tmarbles, you know.\t19.9713",
          "11\tDCA_se1_ag2_f_01\t19.9713\t(pause 0.81)\t20.7827",
          "12\tDCA_se1_ag2_f_01\t20.7827\tUh,\t21.4029",
          "13\tDCA_se1_ag2_f_01\t21.4029\t(pause 2.87)\t24.2694",
          "14\tDCA_se1_ag2_f_01\t24.2694\tMay I?\t24.9408",
          "15\tDCA_int_06\t25.9713\tYeah.\t26.4283"
        ]

Now, the first line is special, right? It’s line the row of column headers in a spreadsheet: it’s the names of data, not values of data. So we need to

parseTSV = tsv => tsv. .split('\n') .map(line => line.split('\t')) .map((tokens,i,lines) => lines[0].map((header,j) => [header, tokens[j]]) ) .reduce((object, [property,value]) => Object.fromEntries, {})