Parsing huge logfiles in Node.js - read in line-by-line

node.js parsing logfile-analysis

I searched for a solution to parse very large files (gbs) line by line using a stream. All the third-party libraries and examples did not suit my needs since they processed the files not line by line (like 1 , 2 , 3 , 4 ..) or read the entire file to memory

The following solution can parse very large files, line by line using stream & pipe. For testing I used a 2.1 gb file with 17.000.000 records. Ram usage did not exceed 60 mb.

First, install the event-stream package:

npm install event-stream

Then:

var fs = require('fs')    , es = require('event-stream');var lineNr = 0;var s = fs.createReadStream('very-large-file.csv')    .pipe(es.split())    .pipe(es.mapSync(function(line){        // pause the readstream        s.pause();        lineNr += 1;        // process line here and call s.resume() when rdy        // function below was for logging memory usage        logMemoryUsage(lineNr);        // resume the readstream, possibly from a callback        s.resume();    })    .on('error', function(err){        console.log('Error while reading file.', err);    })    .on('end', function(){        console.log('Read entire file.')    }));

enter image description here

Please let me know how it goes!

node.js parsing logfile-analysis

You can use the inbuilt readline package, see docs here. I use stream to create a new output stream.

var fs = require('fs'),    readline = require('readline'),    stream = require('stream');var instream = fs.createReadStream('/path/to/file');var outstream = new stream;outstream.readable = true;outstream.writable = true;var rl = readline.createInterface({    input: instream,    output: outstream,    terminal: false});rl.on('line', function(line) {    console.log(line);    //Do your stuff ...    //Then write to outstream    rl.write(cubestuff);});

Large files will take some time to process. Do tell if it works.

node.js parsing logfile-analysis

I really liked @gerard answer which is actually deserves to be the correct answer here. I made some improvements:

Code is in a class (modular)
Parsing is included
Ability to resume is given to the outside in case there is an asynchronous job is chained to reading the CSV like inserting to DB, or a HTTP request
Reading in chunks/batche sizes thatuser can declare. I took care of encoding in the stream too, in caseyou have files in different encoding.

Here's the code:

'use strict'const fs = require('fs'),    util = require('util'),    stream = require('stream'),    es = require('event-stream'),    parse = require("csv-parse"),    iconv = require('iconv-lite');class CSVReader {  constructor(filename, batchSize, columns) {    this.reader = fs.createReadStream(filename).pipe(iconv.decodeStream('utf8'))    this.batchSize = batchSize || 1000    this.lineNumber = 0    this.data = []    this.parseOptions = {delimiter: '\t', columns: true, escape: '/', relax: true}  }  read(callback) {    this.reader      .pipe(es.split())      .pipe(es.mapSync(line => {        ++this.lineNumber        parse(line, this.parseOptions, (err, d) => {          this.data.push(d[0])        })        if (this.lineNumber % this.batchSize === 0) {          callback(this.data)        }      })      .on('error', function(){          console.log('Error while reading file.')      })      .on('end', function(){          console.log('Read entirefile.')      }))  }  continue () {    this.data = []    this.reader.resume()  }}module.exports = CSVReader

So basically, here is how you will use it:

let reader = CSVReader('path_to_file.csv')reader.read(() => reader.continue())

I tested this with a 35GB CSV file and it worked for me and that's why I chose to build it on @gerard's answer, feedbacks are welcomed.

CodeHunter

Parsing huge logfiles in Node.js - read in line-by-line

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last