MapReduce or Spark for Batch processing on Hadoop? MapReduce or Spark for Batch processing on Hadoop? hadoop hadoop

MapReduce or Spark for Batch processing on Hadoop?


Spark is an order of magnitude faster than mapreduce for iterative algorithms, since it gets a significant speedup from keeping intermediate data cached in the local JVM.

With Spark 1.1 which primarily includes a new shuffle implementation (sort-based shuffle instead of hash-based shuffle), a new network module (based on netty instead of using block manager for sending shuffle data), a new external shuffle service made Spark perform the fastest PetaByte sort (on 190 nodes with 46TB RAM) and TeraByte sort breaking Hadoop's old record.

Spark can easily handle the dataset which are order of magnitude larger than the cluster's aggregate memory. So, my thought is that Spark is heading in the right direction and will eventually get even better.

For reference this blog post explains how databricks performed the petabyte sort.


I'm assuming when you say Hadoop you mean HDFS.

There are number of benefits of using Spark over Hadoop MR.

  1. Performance: Spark is at least as fast as Hadoop MR. For iterative algorithms (that need to perform number of iterations of the same dataset) is can be a few orders of magnitude faster. Map-reduce writes the output of each stage to HDFS.

    1.1. Spark can cache (depending on the available memory) this intermediate results and therefore reduce latency due to disk IO.

    1.2. Spark operations are lazy. This means Spark can perform certain optimizing before it starts processing the data because it can reorder operations because they have executed yet.

    1.3. Spark keeps a lineage of operations and recreates the partial failed state based on this lineage in case of failure.

  2. Unified Ecosystem: Spark provides a unified programming model for various types of analysis - batch (spark-core), interactive (REPL), streaming (spark-streaming), machine learning (mllib), graph processing (graphx), SQL queries (SparkSQL)

  3. Richer and Simpler API: Spark's API is richer and simpler. Richer because it supports many more operations (e.g., groupBy, filter ...). Simpler because of the expressiveness of these functional constructs. Spark's API supports Java, Scala and Python (for most APIs). There is experimental support for R.

  4. Multiple Datastore Support: Spark supports many data stores out of the box. You can use Spark to analyze data in a normal or distributed file system, HDFS, Amazon S3, Apache Cassandra, Apache Hive and ElasticSearch to name a few. I'm sure support for many other popular data stores is comings soon. This essentially if you want to adopt Spark you don't have to move your data around.

For example, here is what code for word count looks in Spark (Scala).

val textFile = sc.textFile("some file on HDFS")val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)

I'm sure you have to write a few more lines if you are using standard Hadoop MR.

Here are some common misconceptions about Spark.

  1. Spark is just a in-memory cluster computing framework. However, this is not true. Spark excels when your data can fit in memory because memory access latency is lower. But you can make it work even when your dataset doesn't completely fit in memory.

  2. You need to learn Scala to use Spark. Spark is written in Scala and runs on the JVM. But the Spark provides support for most of the common APIs in Java and Python as well. So you can easily get started with Spark without knowing Scala.

  3. Spark does not scale. Spark is for small datasets (GBs) only and doesn't scale to large number of machines or TBs of data. This is also not true. It has been used successfully to sort PetaBytes of data

Finally, if you do not have a legacy codebase in Hadoop MR it makes perfect sense to adopt Spark, the simple reason being all major Hadoop vendors are moving towards Spark for good reason.


Apache Spark runs in memory, making it much faster than mapreduce. Spark started as a research project at Berkeley.

Mapreduce use disk extensively (for external sort, shuffle,..).

As the input size for a hadoop job is in order of terabytes. Spark memory requirements will be more than traditional hadoop.

So basically, for smaller jobs and with huge memory in ur cluster, sparks wins. And this is not practically the case for most clusters.

Refer to spark.apache.org for more details on spark