Amazon MapReduce best practices for logs analysis Amazon MapReduce best practices for logs analysis hadoop hadoop

Amazon MapReduce best practices for logs analysis


That's a very very wide open question, but here are some thoughts you could consider:

  • Using Amazon SQS: this is a distributed queue, and is very useful for workflow management, you cna have a process that writes to the queue as soon as a log is available, and another who reads from it, processes the log described in the queue message, and deletes it when it's done processing. This would ensure that logs are processed only once.
  • Apache Flume as you mentionned is very useful for log aggregation. This is something you should consider, even if you don't need real-time, as it gives you at the very least a standardized aggregation process.
  • Amazon recently release SimpleWorkFlow. I have just started looking into it, but that sounds promising to manage every step of your data pipeline.

Hope that gives you some clues.