Hadoop for Large Image Processing Hadoop for Large Image Processing hadoop hadoop

Hadoop for Large Image Processing


Is your data in HDFS? What exactly do you expect to leverage from Hadoop/Spark? Seems to me that all you need is a queue of filenames and a bunch of machines to execute.

You can pack your app into AWS Lambda (see Running Arbitrary Executables in AWS Lambda) and trigger events for each file. You can pack your app into a Docker container and start up a bunch of them in ECS, let them loose on a queue of filenames (or URLs or S3 buckets) to process.

I think Hadoop/Spark is overkill, specially since they're quite bad at processing 1GB splits as input, and your processing is not a M/R (no key-values for reducers to merge). If you must, you could pack your C++ app to read from stdin and use Hadoop Streaming.

Ultimately, the question is: where are the 50TB data stored, and what format? The solution depends a lot on the answer, as you would like to bring compute to where the data is and avoid transferring 50TB into AWS or even upload into HDFS.


  • You have 50TBs of ~1GB large .tif files.
  • You want to run same algorithm on each file.

One aspect of solving a problem in MapReduce paradigm, which most developers are not aware of is that:

If you do complex calculation on your Data nodes, the system will limp.

A big reason why you see mostly text-based simple examples is that they are actually the kind of problems which you can run on commercial grade hardware. In case you don't know or have forgotten, I'd like to point out that:

MapReduce programming paradigm is for running the kind of jobs that need scaling out vs scaling up.


Some hints:

  • With data this big in size, it makes sense to take computation where the data is rather than bringing data to computation.
  • Running this job on commercial grade hardware is clearly a bad idea. You need machines with multiple cores - 16/32 maybe.
  • After you have procured the required hardware, you should optimize your software to parallelize the algorithm wherever necessary/useful.
  • Your problem is definitely one which can benefit from scaling up. For files large in size and large collection of those types of files, increasing RAM and using a faster processor is undoubtedly a sensible thing to do.
  • Lastly, if you are concerned about taking in the input, you may read the images as binary. This will limit your ability to work with .tif format and you may have to rework your processing algorithm.