First off, don’t do it the way I’m about to explain to you, use a real library like PapaParse. There, full disclosure.
Okay, now for the way I was about to explain. Because what fun is using a library?
There are two kinds of “delimited text files” in common use: .csv
for “comma-separated values” and .tsv
for “tab-separated values”. The difference might seem trivial, but I think there’s a good argument that tab-separated is a better choice.
Let’s assume that you have read your file into a string. It might look like this:
Line Spkr StTime Content EnTime
1 DCA_int_06 0.5165 What kind of games did you play as a child? 2.1765
2 DCA_int_06 2.1765 (pause 1.74) 3.9158
3 DCA_int_06 3.9158 Perhaps, did you play uh, marbles, Red Rover, Kick the Can, May I, 8.6242
4 DCA_int_06 8.6242 (pause 0.40) 9.0252
5 DCA_int_06 9.0252 Capture the Flag, Hide and Seek, games with bottle caps? 12.3860
6 DCA_int_06 12.3860 (pause 2.27) 14.6521
7 DCA_int_06 14.6521 What type of games did you play? 16.2748
8 DCA_se1_ag2_f_01 16.4136 Baseball (laughing), 17.9290
9 DCA_se1_ag2_f_01 17.9290 (pause 0.73) 18.6611
10 DCA_se1_ag2_f_01 18.6611 marbles, you know. 19.9713
11 DCA_se1_ag2_f_01 19.9713 (pause 0.81) 20.7827
12 DCA_se1_ag2_f_01 20.7827 Uh, 21.4029
13 DCA_se1_ag2_f_01 21.4029 (pause 2.87) 24.2694
14 DCA_se1_ag2_f_01 24.2694 May I? 24.9408
15 DCA_int_06 25.9713 Yeah. 26.4283
Kendall, Tyler and Charlie Farrington. 2020. The Corpus of Regional African American Language. Version 2020.05. Eugene, OR: The Online Resources for African American Language Project. http://oraal.uoregon.edu/coraal (File: http://lingtools.uoregon.edu/coraal/explorer/display.php?what=DCA_se1_ag2_f_01_1.txt&audio=1)
You can pretty much tell just by looking at it that it’s a TSV file, not a CSV file, since there aren’t commas all over the place.
Here’s our plan:
- SEGMENT Split (on the newline character) into an array of lines.
- TOKENIZE LINES For each line:
- TOKENIZE Split (on the tab character) in to an array of tokens.
- ISOLATE HEADER LINE The first line of tokens is the header line, the rest of the lines are content. As we process each content line, we’re going to need to reference the header line, but we don’t want to treat the header line as data.
- ALIGN Pair up each header with its corresponding token.
- OBJECT-IZE Convert that array of pairs into a single object, with the header as key and line token as value.
To keep ourselves sane, let’s try to write little functions that carry out these tasks, and then combine them into something called parseTSV
. So our code will look like this:
Let’s say we’re starting out with some code like this:
let tsv = `Line Spkr StTime Content EnTime
1 DCA_int_06 0.5165 What kind of games did you play as a child? 2.1765
2 DCA_int_06 2.1765 (pause 1.74) 3.9158
3 DCA_int_06 3.9158 Perhaps, did you play uh, marbles, Red Rover, Kick the Can, May I, 8.6242
4 DCA_int_06 8.6242 (pause 0.40) 9.0252
5 DCA_int_06 9.0252 Capture the Flag, Hide and Seek, games with bottle caps? 12.3860
6 DCA_int_06 12.3860 (pause 2.27) 14.6521
7 DCA_int_06 14.6521 What type of games did you play? 16.2748
8 DCA_se1_ag2_f_01 16.4136 Baseball (laughing), 17.9290
9 DCA_se1_ag2_f_01 17.9290 (pause 0.73) 18.6611
10 DCA_se1_ag2_f_01 18.6611 marbles, you know. 19.9713
11 DCA_se1_ag2_f_01 19.9713 (pause 0.81) 20.7827
12 DCA_se1_ag2_f_01 20.7827 Uh, 21.4029
13 DCA_se1_ag2_f_01 21.4029 (pause 2.87) 24.2694
14 DCA_se1_ag2_f_01 24.2694 May I? 24.9408
15 DCA_int_06 25.9713 Yeah. 26.4283`
Here comes our first function.
let segmentTSV = tsv => tsv.split("\n") // step 1
You could also write this:
let segmentTSV = tsv => {
return tsv.split("\n") // step 1
}
That’s exactly the same as the previous version. You have to use brackets if your function takes more than one line to write down, but ours is so simple we can make it pretty. 💕
So what do we end up with?
An array of strings, one per line:
If we run:
segmentTSV(tsv)
…in the console, then we end up with an array of strings. Note that things look a little weird, here. Specifically, suddenly all the alingment is gone and we see \t
instead of an “actual” tab.
[
"Line\tSpkr\tStTime\tContent\tEnTime",
"1\tDCA_int_06\t0.5165\tWhat kind of games did you play as a child?\t2.1765",
"2\tDCA_int_06\t2.1765\t(pause 1.74)\t3.9158",
"3\tDCA_int_06\t3.9158\tPerhaps, did you play uh, marbles, Red Rover, Kick the Can, May I,\t8.6242",
"4\tDCA_int_06\t8.6242\t(pause 0.40)\t9.0252",
"5\tDCA_int_06\t9.0252\tCapture the Flag, Hide and Seek, games with bottle caps?\t12.3860",
"6\tDCA_int_06\t12.3860\t(pause 2.27)\t14.6521",
"7\tDCA_int_06\t14.6521\tWhat type of games did you play?\t16.2748",
"8\tDCA_se1_ag2_f_01\t16.4136\tBaseball (laughing),\t17.9290",
"9\tDCA_se1_ag2_f_01\t17.9290\t(pause 0.73)\t18.6611",
"10\tDCA_se1_ag2_f_01\t18.6611\tmarbles, you know.\t19.9713",
"11\tDCA_se1_ag2_f_01\t19.9713\t(pause 0.81)\t20.7827",
"12\tDCA_se1_ag2_f_01\t20.7827\tUh,\t21.4029",
"13\tDCA_se1_ag2_f_01\t21.4029\t(pause 2.87)\t24.2694",
"14\tDCA_se1_ag2_f_01\t24.2694\tMay I?\t24.9408",
"15\tDCA_int_06\t25.9713\tYeah.\t26.4283"
]
Now, the first line is special, right? It’s line the row of column headers in a spreadsheet: it’s the names of data, not values of data. So we need to
parseTSV = tsv => tsv.
.split('\n')
.map(line => line.split('\t'))
.map((tokens,i,lines) =>
lines[0].map((header,j) => [header, tokens[j]])
)
.reduce((object, [property,value]) => Object.fromEntries, {})