On demand user cluster for Apache Zeppelin + Spark? On demand user cluster for Apache Zeppelin + Spark? hadoop hadoop

On demand user cluster for Apache Zeppelin + Spark?


There are several ways to solve it.

I'm assuming you're running the cluster anyway, so any on-demand resources are limited to static resources, but allocated in a dynamic manner by YARN.You should first take a look at YARN-queues, and YARN-authorization. This way you can manage resource availability effectively, and according to the fairnes criteria of your organization.

On the Zeppelin side, make sure to also enable authentication - this passes through to YARN and HDFS. Effectively segregating users. If you have differing requirements, and want to make sure that interpreters do not collide, you can

  • Use isolated mode. This is easiest to set up, but tricky to maintain.
  • Set up one interpreter per team/org-unit. This has minor overhead, and you manage all your OUs in one Zeppelin instance, but it may be the best way to centrally manage different requirements.
  • Use deployable Zeppelins (e.g. Dockerized) this isolates OUs from one another, but you have to also maintain configurations per OU and inject them into the images at deploy time, OR manage a whole bunch of customized images
  • Just have the OUs manage their own Zeppelin and use cluster access rights to restrict them in what they can actually do cluster-side. Since there is no "general" Zeppelin-user this recommendation depends on the technical finesse of the users. Maintaining this shouldn't be too difficult, and the flexibility might make it worth the while. Expect a higher support/assistance workload, obviously.

Which solution is the right one for you, depends a lot on the organizational makeup, the technical skills and the variety of requirements of your users. One of the things to keep in mind is dependency management - this is potentially the biggest issue, once cluster access has been solved. As soon as more people start using Zeppelin, and share one interpreter-setting, the more likely you are to encounter dependency conflicts, which will ruin everyone's day. I would personally recommend my second and fourth proposition, but have seen the third one also used in large enterprises -- it can work if variety isn't too high.

One thing I would NOT do, is create one instance of Zeppelin per user. Zeppelin serves mostly to share information, so one instance of ZEppelin should be shared amongst a group of users who are looking to benefit from each other's work. I think you could use netmounted notebook-directories to re-merge the notebooks, but there may be write-contention issues with unintended overwrites/reverts of previous writes.