large amount of data in many text files - how to process? large amount of data in many text files - how to process? r r

large amount of data in many text files - how to process?


(3) is not necessarily a bad idea -- Python makes it easy to process "CSV" file (and despite the C standing for Comma, tab as a separator is just as easy to handle) and of course gets just about as much bandwidth in I/O ops as any other language. As for other recommendations, numpy, besides fast computation (which you may not need as per your statements) provides very handy, flexible multi-dimensional arrays, which may be quite handy for your tasks; and the standard library module multiprocessing lets you exploit multiple cores for any task that's easy to parallelize (important since just about every machine these days has multi-cores;-).


Ok, so just to be different, why not R?

  • You seem to know R so you may get to working code quickly
  • 30 mb per file is not large on standard workstation with a few gb of ram
  • the read.csv() variant of read.table() can be very efficient if you specify the types of columns via the colClasses argument: instead of guestimating types for conversion, these will handled efficiently
  • the bottleneck here is i/o from the disk and that is the same for every language
  • R has multicore to set up parallel processing on machines with multiple core (similar to Python's multiprocessing, it seems)
  • Should you want to employ the 'embarrassingly parallel' structure of the problem, R has several packages that are well-suited to data-parallel problems: E.g. snow and foreach can each be deployed on just one machine, or on a set of networked machines.


Have a look at Disco. It is a lightweight distributed MapReduce engine, written in about 2000 lines of Erlang, but specifically designed for Python development. It supports not only working on your data, but also storing an replication reliably. They've just released version 0.3, which includes an indexing and database layer.