spark over kubernetes vs yarn/hadoop ecosystem [closed] spark over kubernetes vs yarn/hadoop ecosystem [closed] kubernetes kubernetes

spark over kubernetes vs yarn/hadoop ecosystem [closed]


Can someone help me understand the difference/comparision between running spark on kubernetes vs Hadoop ecosystem?

Be forewarned this is a theoretical answer, because I don't run Spark anymore, and thus I haven't run Spark on kubernetes, but I have maintained both a Hadoop cluster and now a kubernetes cluster, and so I can speak to some of their differences.

Kubernetes is as much a battle hardened resource manager with api access to all its components as a reasonable person could wish for. It provides very painless declarative resource limitations (both cpu and ram, plus even syscall capacities), very, very painless log egress (both back to the user via kubectl and out of the cluster using multiple flavors of log management approaches), unprecedented level of metrics gathering and egress allowing one to keep an eye on the health of the cluster and the jobs therein, and the list goes on and on.

But perhaps the biggest reason one would choose to run Spark on kubernetes is the same reason one would choose to run kubernetes at all: shared resources rather than having to create new machines for different workloads (well, plus all of those benefits above). So if you have a Spark cluster, it is very, very likely it is going to burn $$$ while a job isn't actively running on it, versus kubernetes will cheerfully schedule other jobs onto those Nodes while they aren't running Spark jobs. Yes, I am aware that Mesos and Yarn are "generic" cluster resource managers, but it has not been my experience that they are as painless or ubiquitous as kubernetes.

I would welcome someone posting the counter narrative, or contributing more hands-on experience of Spark on kubernetes, but tho


To complete Matthew L Daniel opinion, the mine focuses on 2 interesting concepts that Kubernetes can bring to data pipelines:- namespaces + resource quotas help to easier separate and share resources by for instance reserving much more resources to data intensive/more unpredictable/business critical parts without necessarily new node every time- horizontal scaling - basically when Kubernetes scheduler doesn't succeed to allocate new pods that may be created with Spark's dynamic resource allocation in the future (not implemented yet), it's able to mount necessary nodes dynamically (e.g. through https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#introduction). That said horizontal scaling are currently difficult to achieve in Apache Spark since it requires to keep the external shuffle service even for a shut down executor. So even if our load decreases, we'll still keep the nodes created to handle its increase. But when this problem will be solved Kubernetes autoscaling will be an interesting option to reduce costs, improve processing performances and make pipelines elastic.

However please notice that all these sayings are based only on personal observations and some local tests on early Spark on Kubernetes feature (2.3.0).