How to convert a CSV (or TSV) file into JSON

First off, don’t do it the way I’m about to explain to you, use a real library like PapaParse. There, full disclosure.

Okay, now for the way I was about to explain. Because what fun is using a library?

There are two kinds of “delimited text files” in common use: .csv for “comma-separated values” and .tsv for “tab-separated values”. The difference might seem trivial, but I think there’s a good argument that tab-separated is a better choice.

Let’s assume that you have read your file into a string. It might look like this:

Line	Spkr	StTime	Content	EnTime
1	DCA_int_06	0.5165	What kind of games did you play as a child?	2.1765
2	DCA_int_06	2.1765	(pause 1.74)	3.9158
3	DCA_int_06	3.9158	Perhaps, did you play uh, marbles, Red Rover, Kick the Can, May I,	8.6242
4	DCA_int_06	8.6242	(pause 0.40)	9.0252
5	DCA_int_06	9.0252	Capture the Flag, Hide and Seek, games with bottle caps?	12.3860
6	DCA_int_06	12.3860	(pause 2.27)	14.6521
7	DCA_int_06	14.6521	What type of games did you play?	16.2748
8	DCA_se1_ag2_f_01	16.4136	Baseball (laughing),	17.9290
9	DCA_se1_ag2_f_01	17.9290	(pause 0.73)	18.6611
10	DCA_se1_ag2_f_01	18.6611	marbles, you know.	19.9713
11	DCA_se1_ag2_f_01	19.9713	(pause 0.81)	20.7827
12	DCA_se1_ag2_f_01	20.7827	Uh,	21.4029
13	DCA_se1_ag2_f_01	21.4029	(pause 2.87)	24.2694
14	DCA_se1_ag2_f_01	24.2694	May I?	24.9408
15	DCA_int_06	25.9713	Yeah.	26.4283

Kendall, Tyler and Charlie Farrington. 2020. The Corpus of Regional African American Language. Version 2020.05. Eugene, OR: The Online Resources for African American Language Project. http://oraal.uoregon.edu/coraal (File: http://lingtools.uoregon.edu/coraal/explorer/display.php?what=DCA_se1_ag2_f_01_1.txt&audio=1)

You can pretty much tell just by looking at it that it’s a TSV file, not a CSV file, since there aren’t commas all over the place.

Here’s our plan:

SEGMENT Split (on the newline character) into an array of lines.
TOKENIZE LINES For each line:
1. TOKENIZE Split (on the tab character) in to an array of tokens.
2. ISOLATE HEADER LINE The first line of tokens is the header line, the rest of the lines are content. As we process each content line, we’re going to need to reference the header line, but we don’t want to treat the header line as data.
ITERATE THROUGH LINES For each line:
1. ALIGN Pair up each header with its corresponding token.
2. OBJECT-IZE Convert that array of pairs into a single object, with the header as key and line token as value.
RETURN Return the array of objects.

To keep ourselves sane, let’s try to write little functions that carry out these tasks, and then combine them into something called parseTSV. So our code will look like this:

Let’s say we’re starting out with some code like this:


let tsv = `Line	Spkr	StTime	Content	EnTime
1	DCA_int_06	0.5165	What kind of games did you play as a child?	2.1765
2	DCA_int_06	2.1765	(pause 1.74)	3.9158
3	DCA_int_06	3.9158	Perhaps, did you play uh, marbles, Red Rover, Kick the Can, May I,	8.6242
4	DCA_int_06	8.6242	(pause 0.40)	9.0252
5	DCA_int_06	9.0252	Capture the Flag, Hide and Seek, games with bottle caps?	12.3860
6	DCA_int_06	12.3860	(pause 2.27)	14.6521
7	DCA_int_06	14.6521	What type of games did you play?	16.2748
8	DCA_se1_ag2_f_01	16.4136	Baseball (laughing),	17.9290
9	DCA_se1_ag2_f_01	17.9290	(pause 0.73)	18.6611
10	DCA_se1_ag2_f_01	18.6611	marbles, you know.	19.9713
11	DCA_se1_ag2_f_01	19.9713	(pause 0.81)	20.7827
12	DCA_se1_ag2_f_01	20.7827	Uh,	21.4029
13	DCA_se1_ag2_f_01	21.4029	(pause 2.87)	24.2694
14	DCA_se1_ag2_f_01	24.2694	May I?	24.9408
15	DCA_int_06	25.9713	Yeah.	26.4283`

Here comes our first function.


let segmentTSV = tsv => tsv.split("\n") // step 1

You could also write this:


  let segmentTSV = tsv => {
    return tsv.split("\n") // step 1
  }

That’s exactly the same as the previous version. You have to use brackets if your function takes more than one line to write down, but ours is so simple we can make it pretty. 💕

So what do we end up with?

An array of strings, one per line:

If we run:

segmentTSV(tsv)

…in the console, then we end up with an array of strings. Note that things look a little weird, here. Specifically, suddenly all the alingment is gone and we see \t instead of an “actual” tab.

[
  "Line\tSpkr\tStTime\tContent\tEnTime",
  "1\tDCA_int_06\t0.5165\tWhat kind of games did you play as a child?\t2.1765",
  "2\tDCA_int_06\t2.1765\t(pause 1.74)\t3.9158",
  "3\tDCA_int_06\t3.9158\tPerhaps, did you play uh, marbles, Red Rover, Kick the Can, May I,\t8.6242",
  "4\tDCA_int_06\t8.6242\t(pause 0.40)\t9.0252",
  "5\tDCA_int_06\t9.0252\tCapture the Flag, Hide and Seek, games with bottle caps?\t12.3860",
  "6\tDCA_int_06\t12.3860\t(pause 2.27)\t14.6521",
  "7\tDCA_int_06\t14.6521\tWhat type of games did you play?\t16.2748",
  "8\tDCA_se1_ag2_f_01\t16.4136\tBaseball (laughing),\t17.9290",
  "9\tDCA_se1_ag2_f_01\t17.9290\t(pause 0.73)\t18.6611",
  "10\tDCA_se1_ag2_f_01\t18.6611\tmarbles, you know.\t19.9713",
  "11\tDCA_se1_ag2_f_01\t19.9713\t(pause 0.81)\t20.7827",
  "12\tDCA_se1_ag2_f_01\t20.7827\tUh,\t21.4029",
  "13\tDCA_se1_ag2_f_01\t21.4029\t(pause 2.87)\t24.2694",
  "14\tDCA_se1_ag2_f_01\t24.2694\tMay I?\t24.9408",
  "15\tDCA_int_06\t25.9713\tYeah.\t26.4283"
]

Now, the first line is special, right? It’s line the row of column headers in a spreadsheet: it’s the names of data, not values of data. So we need to


parseTSV = tsv => tsv.
  .split('\n')
  .map(line => line.split('\t'))
  .map((tokens,i,lines) => 
    lines[0].map((header,j) => [header, tokens[j]])
  )
  .reduce((object, [property,value]) => Object.fromEntries, {})