What is the best way to run Map/Reduce stuff on data from Mongo? What is the best way to run Map/Reduce stuff on data from Mongo? mongodb mongodb

What is the best way to run Map/Reduce stuff on data from Mongo?


Amazon S3 provides a utility called S3DistCp to get data in and out of S3. This is commonly used when running Amazon's EMR product and you don't want to host your own cluster or use up instances to store data. S3 can store all your data for you and EMR can read/write data from/to S3.

However, transferring 100GB will take time and if you plan on doing this more than once (i.e. more than a one-off batch job), it will be a significant bottleneck in your processing (especially if the data is expected to grow).

It looks you may not need to use S3. Mongo has implemented an adapter to implement map reduce jobs on top of your MongoDB. http://blog.mongodb.org/post/24610529795/hadoop-streaming-support-for-mongodb

This looks appealing since it lets you implement the MR in python/js/ruby.

I think this mongo-hadoop setup would be more efficient than copying 100GB of data out to S3.

UPDATE: An example of using map-reduce with mongo here.