Parsing large xml files (1G+) in node.js Parsing large xml files (1G+) in node.js xml xml

Parsing large xml files (1G+) in node.js


The most obvious, but not very helpful answer, is that it depends on the requirements.

In your case however it seems pretty straightforward; you need to load large chunks of data, that may or may not fit into memory, for simple processing before writing it to the database. I think this is good reason alone why you would want to externalise that CPU work as separate processes. So it would probably make more sense to first focus on which XML parser does the job for you rather than which Node wrapper you want to use for it.

Obviously, any parser that requires the entire document to be loaded into memory before processing is not a valid option. You will need to use streams for this and parsers that supports that kind of sequential processing.

This leaves you with a few options:

Saxon seems to have the highest level of conformance to the recent W3C specs, so if schema validation and such is important than that might be a good candidate. Otherwise both Libxml and Expat seems to stack up pretty well performance wise and comes already preinstalled on most operating systems.

The are Node wrappers available for all of these:

My Node implementation would look something like this:

import * as XmlStream from 'xml-stream'import { request } from 'http'import { createWriteStream } from 'fs'const xmlFileReadStream = request('http://external.path/to/xml')const xmlFileWriteStream = new XmlStream(xmlFileReadStream)const databaseWriteStream = createWriteStream('/path/to/file.csv')xmlFileWriteStream.on('endElement: Person', ({ name, phone, age }) =>  databaseWriteStream.write(`"${name}","${phone}","${age}"\n`))xmlFileWriteStream.on('end', () => databaseWriteStream.end())

Of course I have no idea what your database write stream would look like, so here I am just writing it to a file.