Running Apache Hive on Kubernetes (without YARN)
Hive on MR3 runs on Kubernetes, as MR3 (a new execution engine for Hadoop and Kubernetes) provides a native support for Kubernetes.
Please, take a look at my blog related to this topic:
Assumed that you are running spark as batch execution engine for your data lake, it will be easy to run Hive Server2 on spark, namely spark thrift server which is compatiable with hive server2.
Before submitting spark thrift server on kubernetes, you should install hive metastore on kubernetes, there is a good approach to install hive metastore on kubernetes: https://github.com/joshuarobinson/presto-on-k8s/tree/master/hive_metastore
Because Spark submit will prevent running spark thrift server on kubernetes from running it in cluster mode, you can write just a simple wrapper class where runs spark thrift server class like this:
public class SparkThriftServerRunner { public static void main(String[] args) { org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(args); }}
, and build your spark application uberjar with maven shade plugin.
Now, you are ready to submit spark thrift server onto kubernetes.To do so, run the following commands:
spark-submit \--master k8s://https://10.233.0.1:443 \--deploy-mode cluster \--name spark-thrift-server \--class io.spongebob.hive.SparkThriftServerRunner \--packages com.amazonaws:aws-java-sdk-s3:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0 \--conf spark.kubernetes.file.upload.path=s3a://mykidong/spark-thrift-server \--conf spark.kubernetes.container.image.pullPolicy=Always \--conf spark.kubernetes.namespace=spark \--conf spark.kubernetes.container.image=mykidong/spark:v3.0.0 \--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \--conf spark.hadoop.hive.metastore.client.connect.retry.delay=5 \--conf spark.hadoop.hive.metastore.client.socket.timeout=1800 \--conf spark.hadoop.hive.metastore.uris=thrift://metastore.hive-metastore.svc.cluster.local:9083 \--conf spark.hadoop.hive.server2.enable.doAs=false \--conf spark.hadoop.hive.server2.thrift.http.port=10002 \--conf spark.hadoop.hive.server2.thrift.port=10016 \--conf spark.hadoop.hive.server2.transport.mode=binary \--conf spark.hadoop.metastore.catalog.default=spark \--conf spark.hadoop.hive.execution.engine=spark \--conf spark.hadoop.hive.input.format=io.delta.hive.HiveInputFormat \--conf spark.hadoop.hive.tez.input.format=io.delta.hive.HiveInputFormat \--conf spark.sql.warehouse.dir=s3a://mykidong/apps/spark/warehouse \--conf spark.hadoop.fs.defaultFS=s3a://mykidong \--conf spark.hadoop.fs.s3a.access.key=bWluaW8= \--conf spark.hadoop.fs.s3a.secret.key=bWluaW8xMjM= \--conf spark.hadoop.fs.s3a.endpoint=http://10.233.25.63:9099 \--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \--conf spark.hadoop.fs.s3a.fast.upload=true \--conf spark.driver.extraJavaOptions="-Divy.cache.dir=/tmp -Divy.home=/tmp" \--conf spark.executor.instances=4 \--conf spark.executor.memory=2G \--conf spark.executor.cores=2 \--conf spark.driver.memory=1G \--conf spark.jars=/home/pcp/delta-lake/connectors/dist/delta-core-shaded-assembly_2.12-0.1.0.jar,/home/pcp/delta-lake/connectors/dist/hive-delta_2.12-0.1.0.jar \file:///home/pcp/spongebob/examples/spark-thrift-server/target/spark-thrift-server-1.0.0-SNAPSHOT-spark-job.jar;
, then Spark thrift server driver and executors will be run on kubernetes in cluster mode.
Take a look at s3 paths like s3a://mykidong/spark-thrift-server
where your spark application uberjar and deps jar files will be uploaded which will be downloaded from your spark thrift server driver and executors and loaded to their classloader. you should have such external repository like s3 bucket or hdfs for the uploaded files.
To access spark thrift server as hive server2, you can type like this:
[pcp@master-0 ~]$ kubectl get po -n spark -o wideNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATESspark-thrift-server-54001673a399bdb7-exec-1 1/1 Running 0 116m 10.233.69.130 minion-2 <none> <none>spark-thrift-server-54001673a399bdb7-exec-2 1/1 Running 0 116m 10.233.67.207 minion-0 <none> <none>spark-thrift-server-54001673a399bdb7-exec-3 1/1 Running 0 116m 10.233.68.14 minion-1 <none> <none>spark-thrift-server-54001673a399bdb7-exec-4 1/1 Running 0 116m 10.233.69.131 minion-2 <none> <none>spark-thrift-server-ac08d873a397a201-driver 1/1 Running 0 118m 10.233.67.206 minion-0 <none> <none>
The IP Address of the pod spark-thrift-server-ac08d873a397a201-driver
will be used to connect to spark thrift server.
Now, with beeline, you connect to it:
cd <spark-home>;bin/beeline -u jdbc:hive2://10.233.67.206:10016;# type some queries....show tables;...