Memory-efficient CSV transformation in Node.js

Date: Fri Jun 16 2017 Node Web Development »»»» Node.JS
nodejs-dark.pngThose of us who consume/edit/modify/publish CSV files must from time to time transform a CSV file. Maybe you need to delete columns, rearrange columns, add columns, rename volumes, or compute some values taking one CSV file and producing another. In my case, I have a raw CSV file with no column headers that's organized in a way which makes sense for one team in our company, but we need that same data organized a different way, with different column names and containing selected fields. The following is what came from that need, which I managed to write in a fairly generic way. It not only extracts and renames columns, but with a bit of coding could perform other transformations.

As such this script performs a map operation, meaning it takes an input CSV and produces an output CSV with the same number of rows. The row contents are of course different, but the count of datums in the CSV is the same for input and output. With this script it would be difficult to perform a reduce or filter operation, because both decrease the number of rows, which would be difficult with this script as it is written.

The script relies on the CSV Suite for Node.js: http://csv.adaltas.com/


/*
 * This script demonstrates a simple CSV transformation that's
 * formulated to use minimal memory.  The processing is done via
 * piping using the Node.js Streams interface.
 *
 * This transformation is to extract selected columns from the
 * input file, then write to another file using different column names.
 *
 * The `transform` section could make other changes such as adding
 * columns together. 
 */
'use strict';

const parse     = require('csv-parse');
const stringify = require('csv-stringify');
const transform = require('stream-transform');
const fs        = require('fs-extra-promise');

const infname   = process.argv[2];
const outfname  = process.argv[3];

const inputFields = [
    // List field names for input file
];

const extractFields = [
    // List field names to extract from input
];

const outputFields = [
    // List field names in the output file
]

fs.createReadStream(infname)
.pipe(parse({
    delimiter: ',',
    // Use columns: true if the input has column headers
    // Otherwise list the input field names in the array above.
    columns: inputFields
}))
.pipe(transform(function(data) {
    // This sample transformation selects out fields
    // that will make it through to the output.  Simply
    // list the field names in the array above.
    return extractFields
    .map(nm => { return data[nm]; });
}))
.pipe(stringify({
    delimiter: ',',
    relax_column_count: true,
    skip_empty_lines: true,
    header: true,
    // This names the resulting columns for the output file.
    columns: outputFields
}))
.pipe(fs.createWriteStream(outfname));

The input file name and output file name are given on the command line. It's a good idea if the input file has CSV headers, but as written the script does not require column headers. What we mean by that is a feature not used in all CSV files. In some cases the first row of a CSV file gives a name for each column. Such a file is more useful since documentation of the fields are in the file. But obviously not everyone does this, and perhaps some software would choke on the column names.

In this script, if your input file has column names then name a change in the first stage:


.pipe(parse({
    delimiter: ',',
    columns: true
}))

Otherwise, list the column names in the

inputFields
 array.

The second stage is the transformation. The algorithm shown here simply extracts the fields named in the

extractFields
 array.  You can rename columns, reorder columns, and eliminate columns this way.

Other transformations can be performed. This function will be called once per row, and the return value from the function constitutes the new value for the row. Hence, the transformation cannot add nor delete rows, meaning the transformed file has the same number of rows on output as for input.

The last stage outputs the CSV using the column names you specify in outputFields.

Since the process uses pipes it is extremely memory efficient. In an earlier version of this script I used a variant of the CSV parser which read the entire CSV into an array before processing could occur. For a large CSV file the Node.js process ran out of memory, and I had to learn how to adjust the Node.js heap size. With pipes the memory footprint at any one time is minimal.