1.1 Some simple “delimited” data formats: CSV and TSV
There are two kinds of “delimited text files” in common use: .csv
for “comma-separated values” and .tsv
for “tab-separated values”. The difference might seem trivial, but tab-delimited seems a bit easier to me than csv, so we’ll use that here.
The input file we’ll be using is called DCA_se1_ag2_f_01_1.txt
. You can look at it here: here. Here are the first few lines:
Line Spkr StTime Content EnTime
1 DCA_int_06 0.5165 What kind of games did you play as a child? 2.1765
2 DCA_int_06 2.1765 (pause 1.74) 3.9158
3 DCA_int_06 3.9158 Perhaps, did you play uh, marbles, Red Rover, Kick the Can, May I, 8.6242
4 DCA_int_06 8.6242 (pause 0.40) 9.0252
5 DCA_int_06 9.0252 Capture the Flag, Hide and Seek, games with bottle caps? 12.3860
6 DCA_int_06 12.3860 (pause 2.27) 14.6521
7 DCA_int_06 14.6521 What type of games did you play? 16.2748
8 DCA_se1_ag2_f_01 16.4136 Baseball (laughing), 17.9290
9 DCA_se1_ag2_f_01 17.9290 (pause 0.73) 18.6611
10 DCA_se1_ag2_f_01 18.6611 marbles, you know. 19.9713
11 DCA_se1_ag2_f_01 19.9713 (pause 0.81) 20.7827
12 DCA_se1_ag2_f_01 20.7827 Uh, 21.4029
13 DCA_se1_ag2_f_01 21.4029 (pause 2.87) 24.2694
14 DCA_se1_ag2_f_01 24.2694 May I? 24.9408
15 DCA_int_06 25.9713 Yeah. 26.4283
Kendall, Tyler and Charlie Farrington. 2020. The Corpus of Regional African American Language. Version 2020.05. Eugene, OR: The Online Resources for African American Language Project. http://oraal.uoregon.edu/coraal (File: http://lingtools.uoregon.edu/coraal/explorer/display.php?what=DCA_se1_ag2_f_01_1.txt&audio=1)
(The accompanying audio file is here, we’ll use that in the next section.)
You can pretty much tell just by looking at it that it’s a TSV file, not a CSV file, since there aren’t commas all over the place.
1.2 Our parsing strategy
Here’s our plan:
- SEGMENT Split (on the newline character) into an array of lines.
- TOKENIZE LINES For each line:
- TOKENIZE Split (on the tab character) in to an array of tokens.
- ISOLATE HEADER LINE The first line of tokens is the header line, the rest of the lines are content. As we process each content line, we’re going to need to reference the header line, but we don’t want to treat the header line as data.
- ITERATE THROUGH LINES For each line:
- ALIGN Pair up each header with its corresponding token.
- OBJECT-IZE Convert that array of pairs into a single object, with the header as key and line token as value.
- RETURN Return the array of objects.
To keep ourselves sane, let’s try to write little functions that carry out these tasks, and then combine them into something called parseTSV
. So our code will look like this:
Let’s say we’re starting out with some code like this:
Here comes our first function.
You could also write this:
let segmentTSV = tsv => {
return tsv.split("\n") // step 1
}
That’s exactly the same as the previous version. You have to use brackets if your function takes more than one line to write down, but ours is so simple we can make it pretty. 💕
So what do we end up with?
An array of strings, one per line:
If we run:
segmentTSV(tsv)
…in the console, then we end up with an array of strings. Note that things look a little weird, here. Specifically, suddenly all the alignment is gone and we see /code> instead of an “actual” tab.
[
"Line\tSpkr\tStTime\tContent\tEnTime",
"1\tDCA_int_06\t0.5165\tWhat kind of games did you play as a child?\t2.1765",
"2\tDCA_int_06\t2.1765\t(pause 1.74)\t3.9158",
"3\tDCA_int_06\t3.9158\tPerhaps, did you play uh, marbles, Red Rover, Kick the Can, May I,\t8.6242",
"4\tDCA_int_06\t8.6242\t(pause 0.40)\t9.0252",
"5\tDCA_int_06\t9.0252\tCapture the Flag, Hide and Seek, games with bottle caps?\t12.3860",
"6\tDCA_int_06\t12.3860\t(pause 2.27)\t14.6521",
"7\tDCA_int_06\t14.6521\tWhat type of games did you play?\t16.2748",
"8\tDCA_se1_ag2_f_01\t16.4136\tBaseball (laughing),\t17.9290",
"9\tDCA_se1_ag2_f_01\t17.9290\t(pause 0.73)\t18.6611",
"10\tDCA_se1_ag2_f_01\t18.6611\tmarbles, you know.\t19.9713",
"11\tDCA_se1_ag2_f_01\t19.9713\t(pause 0.81)\t20.7827",
"12\tDCA_se1_ag2_f_01\t20.7827\tUh,\t21.4029",
"13\tDCA_se1_ag2_f_01\t21.4029\t(pause 2.87)\t24.2694",
"14\tDCA_se1_ag2_f_01\t24.2694\tMay I?\t24.9408",
"15\tDCA_int_06\t25.9713\tYeah.\t26.4283"
]
Now, the first line is special, right? It’s line the row of column headers in a spreadsheet: it’s the names of data, not values of data. So we need to