Running Apache Hive on Kubernetes (without YARN) Running Apache Hive on Kubernetes (without YARN) kubernetes kubernetes

Running Apache Hive on Kubernetes (without YARN)


Hive on MR3 runs on Kubernetes, as MR3 (a new execution engine for Hadoop and Kubernetes) provides a native support for Kubernetes.

https://mr3docs.datamonad.com/docs/k8s/


Please, take a look at my blog related to this topic:

Assumed that you are running spark as batch execution engine for your data lake, it will be easy to run Hive Server2 on spark, namely spark thrift server which is compatiable with hive server2.

Before submitting spark thrift server on kubernetes, you should install hive metastore on kubernetes, there is a good approach to install hive metastore on kubernetes: https://github.com/joshuarobinson/presto-on-k8s/tree/master/hive_metastore

Because Spark submit will prevent running spark thrift server on kubernetes from running it in cluster mode, you can write just a simple wrapper class where runs spark thrift server class like this:

public class SparkThriftServerRunner {    public static void main(String[] args) {        org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(args);    }}

, and build your spark application uberjar with maven shade plugin.

Now, you are ready to submit spark thrift server onto kubernetes.To do so, run the following commands:

spark-submit \--master k8s://https://10.233.0.1:443 \--deploy-mode cluster \--name spark-thrift-server \--class io.spongebob.hive.SparkThriftServerRunner \--packages com.amazonaws:aws-java-sdk-s3:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0 \--conf spark.kubernetes.file.upload.path=s3a://mykidong/spark-thrift-server \--conf spark.kubernetes.container.image.pullPolicy=Always \--conf spark.kubernetes.namespace=spark \--conf spark.kubernetes.container.image=mykidong/spark:v3.0.0 \--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \--conf spark.hadoop.hive.metastore.client.connect.retry.delay=5 \--conf spark.hadoop.hive.metastore.client.socket.timeout=1800 \--conf spark.hadoop.hive.metastore.uris=thrift://metastore.hive-metastore.svc.cluster.local:9083 \--conf spark.hadoop.hive.server2.enable.doAs=false \--conf spark.hadoop.hive.server2.thrift.http.port=10002 \--conf spark.hadoop.hive.server2.thrift.port=10016 \--conf spark.hadoop.hive.server2.transport.mode=binary \--conf spark.hadoop.metastore.catalog.default=spark \--conf spark.hadoop.hive.execution.engine=spark \--conf spark.hadoop.hive.input.format=io.delta.hive.HiveInputFormat \--conf spark.hadoop.hive.tez.input.format=io.delta.hive.HiveInputFormat \--conf spark.sql.warehouse.dir=s3a://mykidong/apps/spark/warehouse \--conf spark.hadoop.fs.defaultFS=s3a://mykidong \--conf spark.hadoop.fs.s3a.access.key=bWluaW8= \--conf spark.hadoop.fs.s3a.secret.key=bWluaW8xMjM= \--conf spark.hadoop.fs.s3a.endpoint=http://10.233.25.63:9099 \--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \--conf spark.hadoop.fs.s3a.fast.upload=true \--conf spark.driver.extraJavaOptions="-Divy.cache.dir=/tmp -Divy.home=/tmp" \--conf spark.executor.instances=4 \--conf spark.executor.memory=2G \--conf spark.executor.cores=2 \--conf spark.driver.memory=1G \--conf spark.jars=/home/pcp/delta-lake/connectors/dist/delta-core-shaded-assembly_2.12-0.1.0.jar,/home/pcp/delta-lake/connectors/dist/hive-delta_2.12-0.1.0.jar \file:///home/pcp/spongebob/examples/spark-thrift-server/target/spark-thrift-server-1.0.0-SNAPSHOT-spark-job.jar;

, then Spark thrift server driver and executors will be run on kubernetes in cluster mode.

Take a look at s3 paths like s3a://mykidong/spark-thrift-server where your spark application uberjar and deps jar files will be uploaded which will be downloaded from your spark thrift server driver and executors and loaded to their classloader. you should have such external repository like s3 bucket or hdfs for the uploaded files.

To access spark thrift server as hive server2, you can type like this:

[pcp@master-0 ~]$ kubectl get po -n spark -o wideNAME                                          READY   STATUS    RESTARTS   AGE    IP              NODE       NOMINATED NODE   READINESS GATESspark-thrift-server-54001673a399bdb7-exec-1   1/1     Running   0          116m   10.233.69.130   minion-2   <none>           <none>spark-thrift-server-54001673a399bdb7-exec-2   1/1     Running   0          116m   10.233.67.207   minion-0   <none>           <none>spark-thrift-server-54001673a399bdb7-exec-3   1/1     Running   0          116m   10.233.68.14    minion-1   <none>           <none>spark-thrift-server-54001673a399bdb7-exec-4   1/1     Running   0          116m   10.233.69.131   minion-2   <none>           <none>spark-thrift-server-ac08d873a397a201-driver   1/1     Running   0          118m   10.233.67.206   minion-0   <none>           <none>

The IP Address of the pod spark-thrift-server-ac08d873a397a201-driver will be used to connect to spark thrift server.

Now, with beeline, you connect to it:

cd <spark-home>;bin/beeline -u jdbc:hive2://10.233.67.206:10016;# type some queries....show tables;...