Spark UI History server on Kubernetes? Spark UI History server on Kubernetes? kubernetes kubernetes

Spark UI History server on Kubernetes?


Yes it is possible. Briefly you will need to ensure following:

  • Make sure all your applications store event logs in a specific location (filesystem, s3, hdfs etc).
  • Deploy the history server in your cluster with access to above event logs location.

Now spark (by default) only read from the filesystem path so I will elaborate this case in details with spark operator:

  • Create a PVC with a volume type that supports ReadWriteMany mode. For example NFS volume. The following snippet assumes you have storage class for NFS (nfs-volume) already configured:
apiVersion: v1kind: PersistentVolumeClaimmetadata:  name: spark-pvc  namespace: spark-appsspec:  accessModes:    - ReadWriteMany  volumeMode: Filesystem  resources:    requests:      storage: 5Gi  storageClassName: nfs-volume
  • Make sure all your spark applications have event logging enabled and at the correct path:
  sparkConf:    "spark.eventLog.enabled": "true"    "spark.eventLog.dir": "file:/mnt"
  • With event logs volume mounted to each application (you can also use operator mutating web hook to centralize it ) pod. An example manifest with mentioned config is show below:
---apiVersion: "sparkoperator.k8s.io/v1beta2"kind: SparkApplicationmetadata:  name: spark-java-pi  namespace: spark-appsspec:  type: Java  mode: cluster  image: gcr.io/spark-operator/spark:v2.4.4  mainClass: org.apache.spark.examples.SparkPi  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.4.jar"  imagePullPolicy: Always  sparkVersion: 2.4.4  sparkConf:    "spark.eventLog.enabled": "true"    "spark.eventLog.dir": "file:/mnt"  restartPolicy:    type: Never  volumes:    - name: spark-data      persistentVolumeClaim:        claimName: spark-pvc  driver:    cores: 1    coreLimit: "1200m"    memory: "512m"    labels:      version: 2.4.4    serviceAccount: spark    volumeMounts:      - name: spark-data        mountPath: /mnt  executor:    cores: 1    instances: 1    memory: "512m"    labels:      version: 2.4.4    volumeMounts:      - name: spark-data        mountPath: /mnt
  • Install spark history server mounting the shared volume. Then you will have access events in history server UI:
apiVersion: apps/v1beta1kind: Deploymentmetadata:  name: spark-history-server  namespace: spark-appsspec:  replicas: 1  template:    metadata:      name: spark-history-server      labels:        app: spark-history-server    spec:      containers:        - name: spark-history-server          image: gcr.io/spark-operator/spark:v2.4.0          resources:            requests:              memory: "512Mi"              cpu: "100m"          command:            -  /sbin/tini            - -s            - --            - /opt/spark/bin/spark-class            - -Dspark.history.fs.logDirectory=/data/            - org.apache.spark.deploy.history.HistoryServer          ports:            - name: http              protocol: TCP              containerPort: 18080          readinessProbe:            timeoutSeconds: 4            httpGet:              path: /              port: http          livenessProbe:            timeoutSeconds: 4            httpGet:              path: /              port: http          volumeMounts:            - name: data              mountPath: /data      volumes:      - name: data        persistentVolumeClaim:          claimName: spark-pvc          readOnly: true

Feel free to configure Ingress, Service for accessing the UI.enter image description here

Also you can use Google Cloud Storage, Azrue Blob Storage or AWS S3 as event log location. For this you will need to install some extra jars so I would recommend having a look at lightbend spark history server image and charts.