Timeout trying to start flink job master for checkpointed job Timeout trying to start flink job master for checkpointed job kubernetes kubernetes

Timeout trying to start flink job master for checkpointed job


I'm staging jars on flink before execution using the /jars/upload endpoint. It seems that flink's performance tanks when it has too many jars uploaded. All the endpoints become unresponsive including the /jobs/<job_id> endpoint. It was taking 1 - 2 minutes to load the job graph overview in the flink UI. I imagine this rest endpoint uses the akka same actor the job manager does. I think I must've hit a tipping point where this started causing timeouts. I've reduced the number of jars for 30 odd to just the 4 latest versions and flink is responsive again.