Processing images using hadoop Processing images using hadoop hadoop hadoop

Processing images using hadoop


For processing images on Hadoop the best way to organize the computations would be:

  1. Store images in a sequence file. Key - image name or its ID, Value - image binary data. This way you will have a single file with all the images you need to process. If you have images added dynamically to your system, consider aggregating them in daily sequence files. I don't think you should use any compression for this sequence file as the general compression algorithms does not work well with images
  2. Process the images. Here you have a number of options to choose. First is to use Hadoop MapReduce and write a program in Java as with Java you would be able to read the sequence file and directly obtain the "Value" from it on each map step, where the "value" is the binary file data. Given this, you can run any processing logic. Second option is Hadoop Streaming. It has a limitation that all the data goes to stdin of your application and the result is read from stdout. But you can overcome this by writing your own InputFormat in Java that would serialize the image binary data from sequence file as Base64 string and pass it to your generic application. Third option would be to use Spark to process this data, but again you are limited in the programming languages choise: Scala, Java or Python.
  3. Hadoop was developed to simplify batch processing over the large amounts of data. Spark is essentialy similar - it is a batch tool. This means you cannot get any result before all the data is processed. Spark Streaming is a bit different case - there you work with micro batches of 1-10 seconds and process each of them separately, so in general you can make it work for your case.

I don't know the complete case of yours, but one possible solution is to use Kafka + Spark Streaming. Your application should put the images in a binary format to the Kafka queue while Spark will consume and process them in micro batches on the cluster, updating the users through some third component (at least by putting the image processing status into the Kafka for another application to process it)

But in general, information you provided is not complete to recommend a good architecture for your specific case


As 0x0FFF says in another answer, the question does not provided enough details to recommend a proper architecture. Though this question is old, I'm just adding my research that i did on this topic so that it can help anyone in their research.

Spark is a great way of doing processing on distributed systems. But it doesn't have a strong community working on OpenCV. Storm is another Apache's free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing.

StormCV is an extension of Apache Storm specifically designed to support the development of distributed computer-vision pipelines. StormCV enables the use of Storm for video processing by adding computer vision (CV) specific operations and data model. The platform uses Open CV for most of its CV operations and it is relatively easy to use this library for other functions.

There are a few examples of using storm with OpenCV. There are examples on their official git hub page. You might want to look at this face detection example and try it to do human detection - https://github.com/sensorstorm/StormCV/blob/master/stormcv-examples/src/nl/tno/stormcv/example/E2_FacedetectionTopology.java.


You can actually create your custom logic using Hadoop Storm framework. You can easily integrate any functionality of some specific Computer Vision library and distrubute it across the bolts of this framework. Besides Storm has a great extension called DRPC server which allows you to consume your logic as a simple RPC calls. You can find a simple example of how you can process video files through Storm using OpenCV face detection in my article Consuming OpenCV through Hadoop Storm DRPC Server from .NET