Running Apache Hive on Kubernetes (without YARN)

hive kubernetes hadoop-yarn

Hive on MR3 runs on Kubernetes, as MR3 (a new execution engine for Hadoop and Kubernetes) provides a native support for Kubernetes.

https://mr3docs.datamonad.com/docs/k8s/

hive kubernetes hadoop-yarn

Please, take a look at my blog related to this topic:

https://medium.com/@mykidong/hive-on-spark-in-kubernetes-115c8e9fa5c1

Assumed that you are running spark as batch execution engine for your data lake, it will be easy to run Hive Server2 on spark, namely spark thrift server which is compatiable with hive server2.

Before submitting spark thrift server on kubernetes, you should install hive metastore on kubernetes, there is a good approach to install hive metastore on kubernetes: https://github.com/joshuarobinson/presto-on-k8s/tree/master/hive_metastore

Because Spark submit will prevent running spark thrift server on kubernetes from running it in cluster mode, you can write just a simple wrapper class where runs spark thrift server class like this:

public class SparkThriftServerRunner {    public static void main(String[] args) {        org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(args);    }}

, and build your spark application uberjar with maven shade plugin.

Now, you are ready to submit spark thrift server onto kubernetes.To do so, run the following commands:

spark-submit \--master k8s://https://10.233.0.1:443 \--deploy-mode cluster \--name spark-thrift-server \--class io.spongebob.hive.SparkThriftServerRunner \--packages com.amazonaws:aws-java-sdk-s3:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0 \--conf spark.kubernetes.file.upload.path=s3a://mykidong/spark-thrift-server \--conf spark.kubernetes.container.image.pullPolicy=Always \--conf spark.kubernetes.namespace=spark \--conf spark.kubernetes.container.image=mykidong/spark:v3.0.0 \--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \--conf spark.hadoop.hive.metastore.client.connect.retry.delay=5 \--conf spark.hadoop.hive.metastore.client.socket.timeout=1800 \--conf spark.hadoop.hive.metastore.uris=thrift://metastore.hive-metastore.svc.cluster.local:9083 \--conf spark.hadoop.hive.server2.enable.doAs=false \--conf spark.hadoop.hive.server2.thrift.http.port=10002 \--conf spark.hadoop.hive.server2.thrift.port=10016 \--conf spark.hadoop.hive.server2.transport.mode=binary \--conf spark.hadoop.metastore.catalog.default=spark \--conf spark.hadoop.hive.execution.engine=spark \--conf spark.hadoop.hive.input.format=io.delta.hive.HiveInputFormat \--conf spark.hadoop.hive.tez.input.format=io.delta.hive.HiveInputFormat \--conf spark.sql.warehouse.dir=s3a://mykidong/apps/spark/warehouse \--conf spark.hadoop.fs.defaultFS=s3a://mykidong \--conf spark.hadoop.fs.s3a.access.key=bWluaW8= \--conf spark.hadoop.fs.s3a.secret.key=bWluaW8xMjM= \--conf spark.hadoop.fs.s3a.endpoint=http://10.233.25.63:9099 \--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \--conf spark.hadoop.fs.s3a.fast.upload=true \--conf spark.driver.extraJavaOptions="-Divy.cache.dir=/tmp -Divy.home=/tmp" \--conf spark.executor.instances=4 \--conf spark.executor.memory=2G \--conf spark.executor.cores=2 \--conf spark.driver.memory=1G \--conf spark.jars=/home/pcp/delta-lake/connectors/dist/delta-core-shaded-assembly_2.12-0.1.0.jar,/home/pcp/delta-lake/connectors/dist/hive-delta_2.12-0.1.0.jar \file:///home/pcp/spongebob/examples/spark-thrift-server/target/spark-thrift-server-1.0.0-SNAPSHOT-spark-job.jar;

, then Spark thrift server driver and executors will be run on kubernetes in cluster mode.

Take a look at s3 paths like s3a://mykidong/spark-thrift-server where your spark application uberjar and deps jar files will be uploaded which will be downloaded from your spark thrift server driver and executors and loaded to their classloader. you should have such external repository like s3 bucket or hdfs for the uploaded files.

To access spark thrift server as hive server2, you can type like this:

[pcp@master-0 ~]$ kubectl get po -n spark -o wideNAME                                          READY   STATUS    RESTARTS   AGE    IP              NODE       NOMINATED NODE   READINESS GATESspark-thrift-server-54001673a399bdb7-exec-1   1/1     Running   0          116m   10.233.69.130   minion-2   <none>           <none>spark-thrift-server-54001673a399bdb7-exec-2   1/1     Running   0          116m   10.233.67.207   minion-0   <none>           <none>spark-thrift-server-54001673a399bdb7-exec-3   1/1     Running   0          116m   10.233.68.14    minion-1   <none>           <none>spark-thrift-server-54001673a399bdb7-exec-4   1/1     Running   0          116m   10.233.69.131   minion-2   <none>           <none>spark-thrift-server-ac08d873a397a201-driver   1/1     Running   0          118m   10.233.67.206   minion-0   <none>           <none>

The IP Address of the pod spark-thrift-server-ac08d873a397a201-driver will be used to connect to spark thrift server.

Now, with beeline, you connect to it:

cd <spark-home>;bin/beeline -u jdbc:hive2://10.233.67.206:10016;# type some queries....show tables;...

CodeHunter

Running Apache Hive on Kubernetes (without YARN)

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last