Using Amazon MapReduce/Hadoop for Image Processing Using Amazon MapReduce/Hadoop for Image Processing bash bash

Using Amazon MapReduce/Hadoop for Image Processing


There are several problems with your task.

Hadoop does not natively process images as you've seen. But you can export all the file names and paths as a textfile and call some Map function on it. So calling ImageMagick on the files on local disk should not be a great deal.

But how do you deal with data locality?

You can't run ImageMagick on files in HDFS (only Java API and FUSE mount is not stable) and you can't predict the task scheduling. So for example a map task can be scheduled to a host where the image does not exists.

Sure you can simply use just a single machine and a single task. But then you don't have an improvement. You would then just have a bunch of overhead.

Also there is a memory problem when you shell out from a Java task. I made a blog post about it [1].

and should be able to be done using Bash

That is the next problem, you'd have to write the map task at least. You need a ProcessBuilder to call ImageMagick with a specific path and function.

I cannot find anything about this kind of work with Hadoop: starting with a set of files, performing the same action to each of the files, and then writing out the new file's output as it's own file.

Guess why? :D Hadoop is not the right thing for this task.

So basically I would recommend to manually split your images to multiple hosts in EC2 and run a bash script over it.It is less stress and is faster. To parallize on the same host, split your files in multiple folders for each core and run the bash scripts over it. This should utilize your machine quite well, and better than Hadoop ever could.

[1] http://codingwiththomas.blogspot.com/2011/07/dealing-with-outofmemoryerror-in-hadoop.html


I would think you could look at the example in "Hadoop: The Definitive Guide" 3rd Edition. Appendix C outlines a way, in bash, to get a file (in hdfs), unzip it, create a folder, create a new file from those files in the unzipped folder and then put that file into another hdfs location.

I customized this script myself so that the initial hadoop get is a curl call to a webserver hosting the input files I need - I didn't want to put all the files in hdfs. If your files are already in hdfs then you can use the commented out line instead. The hdfs get or the curl will ensure the file is available locally for the task. There's lot of network overhead in this.

There's no need for a reduce task.

Input file is a list of the urls to files for conversion/download.

#!/usr/bin/env bash# NLineInputFormat gives a single line: key is offset, value is Isotropic Urlread offset isofile# Retrieve file from Isotropic server to local diskecho "reporter:status:Retrieving $isofile" >&2target=`echo $isofile | awk '{split($0,a,"/");print a[5] a[6]}'`filename=$target.tar.bz2#$HADOOP_INSTALL/bin/hadoop fs -get $isofile ./$filenamecurl  $isofile -o $filename# Un-bzip and un-tar the local filemkdir -p $targetecho "reporter:status:Un-tarring $filename to $target" >&2tar jxf $filename -C $target# Take the file and do what you want with it. echo "reporter:status:Converting $target" >&2imagemagick convert .... $target/$filename $target.all# Put gzipped version into HDFSecho "reporter:status:Gzipping $target and putting in HDFS" >&2gzip -c $target.all | #$HADOOP_INSTALL/bin/hadoop fs -put - gz/$target.gz

The New York Times processed 4TB of raw image data into pdfs in 24 hours using Hadoop. It sounds like they took a similar approach: http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/?scp=1&sq=self%20service%20prorated&st=cse. They used the java api, but the rest is get the file locally, process it and then stick it back into hdfs/sc3.


You can take a look at CombineFileInputFormat in Hadoop, which can implicitly combine multiple files and split it, based on the files.

But I'm not sure how you gonna process the 100M-500M images, as it's quite big and in fact larger than the split size of Hadoop. Maybe you can try different approaches in splitting one image into several parts.

Anyway, good luck.