How We Broke the Kubernetes Job Controller with 7,500 Crashlooping Pods
2026-02-04
Post by:

We had noticed a gradual slowdown and performance issues for a long time but did not think too much about it since it is seeing increased adoption and workloads. But eventually we reached a point where jobs were taking ages from the moment they were created until they actually spawned a pod, more than 5 minutes. The control plane felt sluggish. We checked node resources, but the CPU and memory looked good, and we had plenty available. The API server latency was acceptable.
Confused, we decided to zoom out. Instead of looking at our usual workloads, we looked at the entire cluster state.
Bash
kubectl get jobs -A | grep Running | wc -l
The result gave us a collective heart attack: >7,500.
A quick investigation showed almost every single one of those pods belonged to a single, unassuming CronJob in a system related namespace intended to sync some configurations. They were all failing and crashlooping.
The Mistake
We dug up the CronJob manifest. In hindsight, it looks quite bad. But, at the time we did not notice it, and it is easy to simply copy/paste without thinking too much about it.
Here is the deadly combination that brought our Job Controller to its knees:
YAML
spec:
schedule: '*/5 * * * *'
jobTemplate:
spec:
# Missing backoffLimit! (Defaults to 6)
template:
spec:
restartPolicy: OnFailure # <-- The culprit
containers:
- command: ["/bin/bash", "/scripts/failing_script.sh"]
If you have some Kubernetes® experience, you might spot the issue immediately. If not, here is the mechanism of failure:
- The script inside the container failed (exit code 1).
- Because of
restartPolicy: OnFailure, the Kubelet on the node intercepted the failure and restarted the container within the same Pod.
- The Pod status went into
CrashLoopBackOff.
- Crucially, because the Pod itself never reached a “Failed” terminal phase, the Job Controller didn’t count it against the default
backoffLimitof 6 immediately.
Every 5 minutes, a new Job spun up, created a pod, which immediately entered an infinite, local crashloop.
We didn’t just have failing jobs; we had an ever-growing army of zombie pods thrashing the Kubelet and flooding the API server with constant status updates. The Job Controller was so busy processing thousands of crashloop events that it couldn’t schedule legitimate work.
The Fix
The fix was simple. We needed to tell Kubernetes: “If this container fails, let the Pod die completely so the Controller knows about it.”
YAML
spec:
jobTemplate:
spec:
backoffLimit: 2 # Give up after 2 tries
template:
spec:
restartPolicy: Never # <-- Let it die
By changing to restartPolicy: Never, a script failure terminates the Pod. The Job Controller sees the failure, increments the backoff counter, and eventually marks the Job as failed, cleaning up the mess.
The Real Lesson
It is very easy to look at that bad YAML now and say, “Well, that was stupid.”
But Kubernetes manifests are complex, and defaults can bite you. It’s an easy mistake to make, especially when rushing a small utility script into production.
The real failure here wasn’t the typo in the YAML. The real failure was that we didn’t know about it until it hit 7,500 pods.
We were monitoring for node health and API availability but did not yet have robust enough alerts for “total non-running pods in the cluster”.
If you don’t have high-level alerts for abnormally high pod counts or aggregate crashloop rates across your entire cluster, you are just waiting for a silent, misconfigured CronJob to choke your control plane. Don’t rely on users telling you the cluster feels slow.