TroubleshootingAdvanced

Reduce Kubernetes Pod Errors

Kubernetes pod errors disrupt services and availability. Monitor pod health, diagnose crash loops, optimize resource limits, and maintain reliable container orchestration.

16 min read
Atatus Team
Updated March 15, 2025
7 sections
01

Understanding Kubernetes Pod Lifecycle and Failure Modes

Pod failures have distinct patterns that indicate specific root causes.

Kubernetes pods progress through defined lifecycle phases: Pending (waiting to be scheduled), Running (executing), Succeeded (all containers exited successfully), Failed (at least one container exited with non-zero status), and Unknown (state cannot be determined). Pod phase transitions and the reasons for each transition provide the first layer of diagnostic information. A pod stuck in Pending indicates scheduling failure—no node has sufficient resources, no matching node selector, or no available PersistentVolumes. A pod transitioning repeatedly between Running and Failed indicates a crash loop.

CrashLoopBackOff is one of the most common and visible pod error states. It occurs when a container crashes and Kubernetes attempts to restart it repeatedly, applying an exponential backoff delay between restarts to prevent thrashing. The backoff period starts at 10 seconds and doubles with each restart, up to a maximum of 5 minutes. A pod in CrashLoopBackOff with many restarts may have a wait time of 5 minutes between attempts, making it appear as if nothing is happening even though the issue is actively being addressed.

OOMKilled (Out of Memory Killed) occurs when a container exceeds its memory limit and the Linux kernel's OOM killer terminates it. The pod's containers exit with reason OOMKilled, and unless restart policy is Never, the pod is restarted. Unlike CrashLoopBackOff from application errors, OOMKilled happens silently without application-level error output—the container simply disappears and is restarted. Check the lastState.terminated.reason field in pod describe output to confirm OOMKilled status.

ImagePullBackOff occurs when Kubernetes cannot pull the container image from the registry. This can be caused by an incorrect image name or tag, a non-existent image version, network connectivity issues from the node to the registry, or missing or invalid registry authentication credentials (imagePullSecrets). ImagePullBackOff blocks pod startup entirely, making it critical to resolve before any other pod health issues. Check node-to-registry network connectivity and verify that imagePullSecrets reference valid registry credentials.

02

Monitor Pod Status and Lifecycle Events

Continuous monitoring of pod state changes enables rapid detection of and response to pod failures.

Track pod restart counts as the primary indicator of pod health instability. A container restart count of 0 indicates no failures since the last deployment. Restart counts above 5 in a 24-hour period indicate a persistent issue that needs investigation rather than just tolerating Kubernetes's automatic recovery. Set up alerts on container restart rate—more than 2 restarts per minute for any container—to detect crash loops before the backoff period extends to 5-minute intervals and delays your alerting.

Monitor pod evictions, which occur when nodes experience resource pressure (CPU, memory, or disk) and evict lower-priority pods to reclaim resources for higher-priority pods. Evictions are usually silent from the application perspective—the pod terminates and is rescheduled elsewhere—but they can cause service disruption if multiple pods are evicted simultaneously or if replacement pods cannot be scheduled. Track eviction events with alerts so that node resource pressure is investigated before it becomes a systemic cluster health problem.

Liveness and readiness probe failures provide early warning of application health issues before pods enter the failure cycle. Liveness probe failures trigger container restarts; readiness probe failures remove pods from service load balancing without restarting them. Track probe failure rates separately: high liveness probe failure rates indicate application crashes or hangs, while high readiness probe failure rates indicate the application is running but not ready to serve traffic (slow startup, database connection failures, memory pressure). Tune probe thresholds and delays to match your application's actual startup and health check behavior.

Kubernetes events provide a rich stream of operational information that is often overlooked in favor of pod status monitoring. Events include scheduling decisions, resource warnings, image pull attempts, probe failures, and node assignments. Events have a short default retention period (1 hour in most clusters), so export events to your centralized logging system to retain them for post-incident analysis. Alert on events with Warning reason and on events indicating backoff conditions to catch emerging issues before they become visible user-facing failures.

03

Optimize Pod Resource Requests and Limits

Correct resource configuration is essential for pod stability and efficient cluster utilization.

Resource requests tell the Kubernetes scheduler how much CPU and memory a pod needs and must be available on a node for the pod to be scheduled there. Resource limits tell the kubelet the maximum CPU and memory the container is allowed to consume. The gap between requests and limits determines resource headroom—a pod requesting 512Mi memory with a 1Gi limit can burst to 1Gi but is guaranteed 512Mi. Setting requests too low causes pods to be scheduled on under-resourced nodes and compete for resources; setting limits too low causes OOMKilled events when the application exceeds the limit.

Right-size resource requests by measuring actual resource consumption under production load. Monitor the p95 CPU and memory usage of each container over at least one week to understand normal consumption patterns under typical and peak load. Set memory requests slightly above the p95 memory consumption to handle normal variation, and set memory limits at 20 to 50% above the request to allow for burst capacity without causing instability. For CPU, requests can be set conservatively because CPU is compressible—exceeding the limit throttles the container rather than killing it.

OOMKilled events require immediate memory limit increases, but simply raising limits without understanding why the application is using more memory than expected is not sufficient. Examine heap profiler data and application metrics to determine whether the increased memory usage is due to legitimate growth (more data, more users), a memory leak, or a misconfigured cache that is consuming more memory than intended. OOMKilled events that correlate with specific request patterns or data volumes indicate correct application behavior that requires a limit adjustment; OOMKilled events that correlate with runtime duration indicate memory leaks.

CPU throttling occurs when a container attempts to use more CPU than its limit. Unlike memory limits, CPU limits do not kill the container—they throttle it, slowing execution in proportion to how much the container exceeds its limit. Heavy CPU throttling (above 20% of CPU time being throttled) causes latency increases and can manifest as event loop lag in Node.js or GC pauses in JVM applications. Monitor CPU throttling separately from CPU utilization—a container with 50% average CPU utilization but 30% throttling has a CPU limit that is too low for its peak usage patterns.

04

Diagnose Issues from Pod Logs and Events

Log analysis is the primary method for diagnosing application-level errors within pods.

Centralized log aggregation from all pods in all namespaces is essential for effective Kubernetes debugging. Without centralized logging, accessing logs from restarted or evicted pods requires interacting with each node directly, and logs from terminated pods may already be lost by the time you investigate. Use Fluentd, Fluent Bit, or Prometheus with a log storage backend (Elasticsearch, Loki, or Splunk) to collect, index, and retain logs from all containers. Configure log retention to at least 7 days to support incident investigations.

Correlate error log entries with pod events to build a timeline of failures. When a pod is OOMKilled, the sequence is typically: application logs showing increasing memory pressure (GC warnings, allocation failures), followed by a kernel OOM kill event in system logs, followed by a Kubernetes restart event in pod status. Understanding this sequence before investigating helps you know where to look first. Build log queries that join application error logs with Kubernetes event logs by pod name and time range.

Distinguish between application errors and infrastructure errors in your log analysis. An application that logs 'database connection refused' is experiencing an infrastructure problem (the database is unreachable) rather than an application bug. An application that logs 'null pointer exception' has a code bug. Infrastructure errors often correlate across multiple pods simultaneously, while application bugs typically affect individual pods or appear after specific user actions. This distinction guides the investigation toward either infrastructure remediation or code debugging.

Implement structured logging in your applications to make log queries in Kubernetes environments efficient and reliable. Unstructured log lines like 'Error processing order 12345: insufficient inventory' require complex regex parsing to analyze at scale. Structured JSON log entries with explicit fields (error_type, order_id, user_id, component, severity) allow you to aggregate, filter, and analyze logs using exact field matches rather than fragile text patterns. Most Kubernetes log aggregation systems handle JSON logs natively, enabling powerful analysis without preprocessing.

05

Configure Health Checks and Graceful Shutdown

Well-configured health probes and shutdown handling prevent request loss and unnecessary restarts.

Readiness probes determine whether a pod should receive traffic from services and load balancers. A new pod that has started successfully but has not yet completed its warm-up phase—loading cache data, establishing database connections, or initializing in-memory state—should fail readiness checks until it is fully prepared. Configure readiness probe initialDelaySeconds to wait for your application's typical startup time, failureThreshold to require multiple consecutive failures before marking unready (to avoid flapping during brief slowness), and periodSeconds to check frequently enough that unhealthy pods are quickly removed from rotation.

Liveness probes determine whether a pod should be restarted. They are intended to detect situations where an application is running but cannot make progress—a deadlock, an infinite loop, or an unresponsive state that the application cannot recover from on its own. Liveness probes should be simpler and more permissive than readiness probes. A liveness check that calls a complex health endpoint can itself cause a restart if the health endpoint becomes slow under load—use a lightweight check like a simple HTTP GET to a /healthz endpoint that returns 200 with no database or external dependency calls.

Startup probes are specifically designed for applications with slow or variable startup times. Without startup probes, Kubernetes uses liveness probes during startup, and a liveness probe failure during a slow startup triggers unnecessary pod restarts. Startup probes disable liveness and readiness checks during the startup phase, allowing the application to take as long as it needs to complete initialization. Configure the startup probe with a high failureThreshold multiplied by periodSeconds to allow sufficient total startup time: failureThreshold=30 and periodSeconds=10 allows 300 seconds of startup time.

Graceful shutdown prevents request loss when pods are terminated. When Kubernetes terminates a pod—for scaling down, rolling updates, or eviction—it sends SIGTERM to the container and waits for terminationGracePeriodSeconds (default 30 seconds) before sending SIGKILL. Applications must handle SIGTERM by stopping accepting new requests, completing in-flight requests, and then exiting. If your application ignores SIGTERM or takes longer than terminationGracePeriodSeconds to shut down, in-flight requests are dropped with connection errors. Monitor shutdown duration and configure terminationGracePeriodSeconds to match your application's actual shutdown time.

06

Manage Pod Scheduling and Anti-Affinity

Scheduling configuration affects both availability and performance of deployed applications.

Pod anti-affinity rules prevent Kubernetes from scheduling multiple replicas of the same service on the same node, ensuring that a single node failure does not take down all instances. By default, Kubernetes may schedule all replicas of a deployment on the same node if it has sufficient capacity, creating a single point of failure. Configure requiredDuringSchedulingIgnoredDuringExecution anti-affinity rules for critical services to enforce hard distribution across nodes, or preferredDuringSchedulingIgnoredDuringExecution for softer distribution that is maintained when possible but not required.

Pod Disruption Budgets (PDBs) limit the number of pods that can be simultaneously disrupted during voluntary disruptions like cluster maintenance, node drains, or rolling updates. Without a PDB, a cluster administrator draining multiple nodes for maintenance could simultaneously remove all instances of your service from the cluster. Configure PDBs with minAvailable or maxUnavailable settings that ensure enough replicas remain available to handle your expected traffic during disruptions. PDBs are respected by Kubernetes's eviction API but not by forceful terminations.

Resource quotas and LimitRanges enforce resource governance at the namespace level, preventing any single application from consuming disproportionate cluster resources. LimitRanges set default request and limit values for containers that do not specify them explicitly, ensuring all containers have resource boundaries even if developers forget to set them. ResourceQuotas cap total resource consumption per namespace, preventing one team's deployment from consuming resources needed by other teams' critical services.

Node affinity and taints/tolerations direct specific workloads to appropriate nodes. Assign GPU nodes to ML workloads using node selectors, direct memory-intensive databases to high-memory nodes with node affinity, and isolate sensitive workloads on dedicated nodes using taints that only specific pods with matching tolerations can schedule on. This scheduling control improves both performance (workloads run on appropriate hardware) and security (sensitive workloads are isolated from shared infrastructure).

07

Implement Reliable Rolling Updates and Rollbacks

Deployment strategies determine how much risk is introduced during updates and how quickly problems can be remediated.

Rolling updates replace old pods with new pods gradually, maintaining service availability throughout the update process. Configure maxUnavailable to control how many pods can be simultaneously unavailable (reducing service capacity) and maxSurge to control how many extra pods can be created above the desired replica count (increasing resource consumption temporarily). A conservative update strategy of maxUnavailable=0 and maxSurge=1 ensures zero capacity reduction during updates by always adding a new pod before removing an old one.

Health check gates in rolling updates prevent a bad update from propagating across all replicas. The rolling update process only proceeds to terminate old pods when new pods pass their readiness probes. If new pods fail readiness checks (because the new application version is crashing, cannot connect to the database, or has a configuration error), the update pauses with some old pods still running. This automatic pause limits the blast radius of bad deployments to the maxSurge extra pods, while the majority of traffic continues to be served by healthy old pods.

Automated rollbacks triggered by alert conditions provide faster recovery than manual detection and response. Configure your deployment pipeline to monitor error rates and response latency for a defined period after each deployment, and automatically trigger kubectl rollout undo if metrics degrade beyond a threshold. Automated rollbacks can recover from bad deployments in 2 to 5 minutes, compared to 15 to 30 minutes for manual detection, escalation, and rollback initiation. Most GitOps tools and deployment pipelines support automated rollback based on monitoring alerts.

Blue-green deployments and canary deployments provide lower-risk alternatives to rolling updates for high-stakes changes. Blue-green deployments run the new version alongside the old version with a complete second set of pods, then switch traffic from old to new in a single step using a service selector update. Canary deployments route a small fraction of traffic (typically 5 to 10%) to the new version while monitoring for errors, then gradually increase the fraction until the full rollout is complete. Both strategies provide a safe escape hatch of reverting to the previous version without a rolling update cycle.

Key Takeaways

  • CrashLoopBackOff indicates repeated container crashes with exponential backoff delays—check container logs immediately after the first restart to capture the original error before log rotation
  • OOMKilled happens silently without application error output—check pod.status.containerStatuses[].lastState.terminated.reason to confirm OOMKill vs application error
  • Set memory requests to p95 actual consumption and limits at 20-50% above requests to balance scheduling efficiency against OOMKill risk
  • Readiness probes should reflect application readiness to serve traffic (pass only when fully initialized); liveness probes should be simple checks with high tolerance to avoid restart loops under load
  • Pod anti-affinity rules prevent all replicas from being scheduled on the same node—without them, a single node failure can take down all instances of a service
  • Automated rollbacks triggered by post-deployment metric degradation can recover from bad deployments in minutes rather than the 15-30 minutes typical for manual detection and response
Get started today

Monitor your applications with Atatus

Put the concepts from this guide into practice. Set up full-stack observability in minutes with no credit card required.

No credit card required14-day free trialSetup in minutes

Related guides