Is there a way to monitor kube cron jobs using prometheus
I'm using these rules with kube-state-metrics:
groups:- name: job.rules rules: - alert: CronJobRunning expr: time() -kube_cronjob_next_schedule_time > 3600 for: 1h labels: severity: warning annotations: description: CronJob {{$labels.namespaces}}/{{$labels.cronjob}} is taking more than 1h to complete summary: CronJob didn't finish after 1h - alert: JobCompletion expr: kube_job_spec_completions - kube_job_status_succeeded > 0 for: 1h labels: severity: warning annotations: description: Job completion is taking more than 1h to complete cronjob {{$labels.namespaces}}/{{$labels.job}} summary: Job {{$labels.job}} didn't finish to complete after 1h - alert: JobFailed expr: kube_job_status_failed > 0 for: 1h labels: severity: warning annotations: description: Job {{$labels.namespaces}}/{{$labels.job}} failed to complete summary: Job failed
The tricky part here is the cronjobs themselves have no useful status, you have to match them to the jobs they create. I've written up an article on how to achieve this:
https://medium.com/@tristan_96324/prometheus-k8s-cronjob-alerts-94bee7b90511
The article goes into a bit of detail as to how things work, but the alert config is as follow:
groups:- name: kube-cron rules: - record: job_cronjob:kube_job_status_start_time:max expr: | label_replace( label_replace( max( kube_job_status_start_time * ON(exported_job) GROUP_RIGHT() kube_job_labels{label_cronjob!=""} ) BY (exported_job, label_cronjob) == ON(label_cronjob) GROUP_LEFT() max( kube_job_status_start_time * ON(exported_job) GROUP_RIGHT() kube_job_labels{label_cronjob!=""} ) BY (label_cronjob), "job", "$1", "exported_job", "(.+)"), "cronjob", "$1", "label_cronjob", "(.+)") - record: job_cronjob:kube_job_status_failed:sum expr: | clamp_max( job_cronjob:kube_job_status_start_time:max, 1) * ON(job) GROUP_LEFT() label_replace( label_replace( (kube_job_status_failed != 0), "job", "$1", "exported_job", "(.+)"), "cronjob", "$1", "label_cronjob", "(.+)") - alert: CronJobStatusFailed expr: | job_cronjob:kube_job_status_failed:sum * ON(cronjob) GROUP_RIGHT() kube_cronjob_labels > 0 for: 1m annotations: description: '{{ $labels.cronjob }} last run has failed {{$value }} times.'
The jobTemplate must include a label called cronjob
that matches the name of the cronjob object.
The way to monitoring cronjobs with Prometheus is to have them push a metric indicating the last time they succeeded to the pushgateway. You can then alert on if the cronjob hasn't succeeded recently enough.