How does Apache Spark achieve a 100x speedup over Hadoop MapReduce and in what scenarios?

hadoop apache-spark bigdata distributed-computing

Spark does data processing in-memory.
There will not be intermediary files as in Map Reduce, so there is no I/O or negligible.
It does not run 100x faster in all the scenarios, especially when it involves joins and sorting.
As it is memory intensive, it can saturate cluster quickly. You might be able to run one job 100x faster at a given point in time, but will not be able to run as many jobs/applications you can run using traditional hadoop approach.
RDDs and Data Frames are the internal data structures which come handy for processing the data. RDDs are in-memory data structures for the data and data frames are primarily metadata of those RDDs. They are more of representation of data in spark.

Problem with most of these claims are not bench marked against true production use cases. There might be volume, but not quality in the data which can represent actual business applications. Spark can be very handy for streaming analytics, where you want to understand data in near real time. But for true batch processing Hadoop can be better solution, especially on commodity hardware.

hadoop apache-spark bigdata distributed-computing

Spark is faster than Hadoop due to in-memory processing. But there are some twisted facts about the numbers.

Still Spark has to rely on HDFS for some use cases.

Have a look at this slide and especially slide no: 6 and this benchmarking article

Have a look at complete presentation.

Spark is faster for real time analytics due to in-memory processing. Hadoop is good for batch processing. If you are not worried about the latency of job, still you can use Hadoop.

But one thing is sure. Spark and Hadoop have to co-exist. Neither of them can replace other.

hadoop apache-spark bigdata distributed-computing

Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action. But Spark needs a lot of memory
Spark loads a process into memory and keeps it there until further notice, for the sake of caching.
Resilient Distributed Dataset (RDD), which allows you to transparently store data on memory and persist it to disc if it's needed.
Since Spark uses in-memory, there's no synchronisation barrier that's slowing you down. This is a major reason for Spark's performance.
Rather than just processing a batch of stored data, as is the case with MapReduce, Spark can also manipulate data in real time using Spark Streaming.
The DataFrames API was inspired by data frames in R and Python (Pandas), but designed from the ground-up to as an extension to the existing RDD API.
A DataFrame is a distributed collection of data organized into named columns, but with richer optimizations under the hood that supports to the speed of spark.
Using RDDs Spark simplifies complex operations like join and groupBy and in the backend, you’re dealing with fragmented data. That fragmentation is what enables Spark to execute in parallel.
Spark allows to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It supports in-memory data sharing across DAGs, so that different jobs can work with the same data. DAGs are a major part of Sparks speed.
Spark code base is much smaller.

Hope this helps.

CodeHunter

How does Apache Spark achieve a 100x speedup over Hadoop MapReduce and in what scenarios?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last