Is HDFS necessary for Spark workloads?

hadoop apache-spark hdfs mesos mesosphere

Spark is a distributed processing engine and HDFS is a distributed storage system.

If HDFS is not an option, then Spark has to use some other alternative in form of Apache Cassandra Or Amazon S3.

Have a look at this comparision

S3 – Non urgent batch jobs. S3 fits very specific use cases, when data locality isn’t critical.

Cassandra – Perfect for streaming data analysis and an overkill for batch jobs.

HDFS – Great fit for batch jobs without compromising on data locality.

When to use HDFS as storage engine for Spark distributed processing?

If you have big Hadoop cluster already in place and looking for real time analytics of your data, Spark can use existing Hadoop cluster. It will reduce development time.
Spark is in-memory computing engine. Since data can't fit into memory always, data has to be spilled to disk for some operations. Spark will benifit from HDFS in this case. The Teragen sorting record achieved by Spark used HDFS storage for sorting operation.
HDFS is scalable, reliable and fault tolerant distributed file system ( since Hadoop 2.x release). With data locality principle, processing speed is improved.
Best for Batch-processing jobs.

hadoop apache-spark hdfs mesos mesosphere

The shortest answer is:"No, you don't need it". You can analyse data even without HDFS, but off course you need to replicate the data on all your nodes.

The long answer is quite counterintuitive and i'm still tryng to understand it with the help stackoverflow community.

Spark local vs hdfs permormance

hadoop apache-spark hdfs mesos mesosphere

HDFS (or any distributed Filesystems) makes distributing your data much simpler. Using a local filesystem you would have to partition/copy the data by hand to the individual nodes and be aware of the data distribution when running your jobs. In addition HDFS also handles failing nodes failures.From an integration between Spark and HDFS, you can imagine spark knowing about the data distribution so it will try to schedule tasks to the same nodes where the required data resides.

Second: which problems did you face exactly with the instruction?

BTW: if you are just looking for an easy setup on AWS, DCOS allows you to install HDFS with a single command...

CodeHunter

Is HDFS necessary for Spark workloads?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last