Is HDFS necessary for Spark workloads? Is HDFS necessary for Spark workloads? hadoop hadoop

Is HDFS necessary for Spark workloads?


Spark is a distributed processing engine and HDFS is a distributed storage system.

If HDFS is not an option, then Spark has to use some other alternative in form of Apache Cassandra Or Amazon S3.

Have a look at this comparision

S3 – Non urgent batch jobs. S3 fits very specific use cases, when data locality isn’t critical.

Cassandra – Perfect for streaming data analysis and an overkill for batch jobs.

HDFS – Great fit for batch jobs without compromising on data locality.

When to use HDFS as storage engine for Spark distributed processing?

  1. If you have big Hadoop cluster already in place and looking for real time analytics of your data, Spark can use existing Hadoop cluster. It will reduce development time.

  2. Spark is in-memory computing engine. Since data can't fit into memory always, data has to be spilled to disk for some operations. Spark will benifit from HDFS in this case. The Teragen sorting record achieved by Spark used HDFS storage for sorting operation.

  3. HDFS is scalable, reliable and fault tolerant distributed file system ( since Hadoop 2.x release). With data locality principle, processing speed is improved.

  4. Best for Batch-processing jobs.


The shortest answer is:"No, you don't need it". You can analyse data even without HDFS, but off course you need to replicate the data on all your nodes.

The long answer is quite counterintuitive and i'm still tryng to understand it with the help stackoverflow community.

Spark local vs hdfs permormance


HDFS (or any distributed Filesystems) makes distributing your data much simpler. Using a local filesystem you would have to partition/copy the data by hand to the individual nodes and be aware of the data distribution when running your jobs. In addition HDFS also handles failing nodes failures.From an integration between Spark and HDFS, you can imagine spark knowing about the data distribution so it will try to schedule tasks to the same nodes where the required data resides.

Second: which problems did you face exactly with the instruction?

BTW: if you are just looking for an easy setup on AWS, DCOS allows you to install HDFS with a single command...