Reduce 502 Bad Gateway Errors

Understanding 502 Bad Gateway Errors

502 errors are gateway-level failures with multiple distinct root causes.

A 502 Bad Gateway error is returned by a gateway or proxy (nginx, Apache, AWS ALB, Kubernetes Ingress) when the upstream server it is proxying to returns an invalid response or does not respond at all. The key distinction from a 503 is that 502 indicates the upstream server was reached but responded incorrectly or not at all, while 503 typically indicates the service is intentionally unavailable (overloaded or in maintenance). The 502 error can originate from: the upstream server crashing, the upstream process dying unexpectedly, network connectivity issues between the gateway and upstream, or the upstream server returning a non-HTTP response.

Common 502 scenarios and their signatures: Application crashes generate 502s that correlate with error log entries in the upstream application logs showing unhandled exceptions. Upstream process restarts (from OOM kills, deployment restarts, or process manager restarts) generate brief 502 storms that resolve within seconds as the process restarts. Configuration errors (wrong backend port, incorrect upstream address) generate 502s that begin immediately after a configuration change and persist until corrected. Resource exhaustion (too many concurrent connections, file descriptor limits exceeded) generates 502s that correlate with high traffic and resource utilization metrics.

The gateway's error log is the first source of 502 diagnostic information. nginx logs 502 errors with context: the upstream address that returned the error, the error code (502), and often an additional error code (recv() failed, connect() failed, upstream timed out). The error code provides the specific failure mechanism: recv() failed indicates the upstream closed the connection unexpectedly; connect() failed indicates the upstream refused the connection or was unreachable; upstream timed out indicates the upstream took too long to respond. Each error code points to different root causes.

502 error spikes have characteristic temporal patterns that indicate different root causes. A brief 5-second spike of 502s that resolves on its own is consistent with an application server restart or a brief connectivity interruption. Sustained 502s at a consistent rate indicate a subset of backend instances are unavailable—the load balancer is routing some requests to healthy instances (returning 200) and some to unhealthy instances (returning 502). 502 rates that correlate exactly with traffic spikes indicate capacity issues where the backend cannot handle the request rate.

Track All 502 Errors

Comprehensive 502 monitoring enables rapid detection and impact assessment.

Monitor 502 error rates as a proportion of total requests, not just as an absolute count. Ten 502 errors per hour on a low-traffic internal API may be more concerning than 100 per hour on a high-traffic public API if the former represents 10% of requests and the latter represents 0.01%. Track 502 rates by service, endpoint, geographic region, and time of day to identify patterns. Alert on 502 rates exceeding 0.1% for user-facing endpoints and 1% for internal service-to-service communication, adjusting thresholds based on each endpoint's historical baseline and business impact.

Track 502 errors at multiple levels of the stack to understand the blast radius of an upstream issue. If a single backend service is failing, the load balancer serving it generates 502s (at the gateway level), the calling application service receives errors (at the application level), and end users experience failures (at the browser/client level). Correlating 502 rates across gateway logs, application error monitoring, and end-user error tracking reveals the complete scope of impact and validates that a remediation at the gateway level (health-checking the failing instance out of rotation) actually resolved the end-user experience.

Distinguish between 502 errors from different upstream services in your error monitoring to enable targeted response. In a microservices architecture, a gateway may proxy to 10 to 20 different backend services. When 502 rates increase, identifying whether the increase is concentrated on requests to a specific upstream service (suggesting a single-service incident) or spread across all services (suggesting gateway or network infrastructure issues) determines the appropriate escalation path. Tag 502 errors with the upstream service identifier in your monitoring system to enable this segmentation.

User impact measurement for 502 errors—how many unique users received 502 responses, which pages or features were affected, and which user sessions were completely interrupted—provides the business context needed to prioritize incident response. A 502 rate of 1% affecting checkout endpoint users means 1% of users attempting to purchase received errors, with direct revenue impact. The same 502 rate on a secondary analytics endpoint has negligible user impact. Prioritize incident response by affected user count and business criticality of the affected endpoint.

Identify Upstream Failures

Upstream failure diagnosis requires examining both gateway metrics and upstream service health simultaneously.

Health check failures are a leading indicator of upcoming 502 errors. Load balancers and Kubernetes services periodically send health check requests to backend instances and remove instances that fail health checks from the rotation. Monitor health check failure rates per instance—a single instance failing health checks will have its traffic redistributed to healthy instances, potentially overloading them. When health check failure rates increase across multiple instances simultaneously, investigate the root cause (OOM kills, slow startup after deployment, database connectivity issues) rather than just waiting for instances to recover.

Application crash analysis using crash logs and APM error monitoring reveals why upstream processes are exiting unexpectedly. A Java application generating OutOfMemoryError stacktraces, a Node.js process receiving uncaught exceptions, or a Python process hitting memory limits are all application-level failures that cause the process to exit, resulting in 502 errors from the gateway. Correlate gateway 502 spikes with application error logs and process restart events to confirm application crashes as the 502 cause and identify the specific exception that caused the crash.

Resource limit violations (OOMKill in Kubernetes, memory limit exceeded in Docker) terminate processes without generating application logs—the process is killed by the kernel before it can log. Identify these resource-related 502 causes by checking Kubernetes pod events for OOMKilled status, Linux dmesg logs for OOM kill messages, and container runtime metrics for memory limit violations. The distinctive signature is 502 errors with no corresponding application error logs—the process was terminated before it could log anything.

Network connectivity issues between the gateway and upstream services cause connection-refused or connection-timed-out 502s. These are visible in gateway error logs as 'connect() to upstream failed' or 'upstream connection refused'. Check network ACLs, security group rules, and firewall policies between the gateway and upstream services. In Kubernetes, check that Service selector labels match the pod labels, that the correct port is defined in the Service spec, and that NetworkPolicy rules allow traffic from the ingress controller to the application pods. Verify that the upstream port is actively listening with a direct connectivity test.

Configure Gateway Timeout Settings

Incorrect timeout configuration is a common cause of 502 errors that look like backend failures.

Gateway timeout settings must be calibrated to upstream service response times. nginx's proxy_connect_timeout (time to establish connection to upstream), proxy_send_timeout (time to send request to upstream), and proxy_read_timeout (time to read response from upstream) all default to 60 seconds. If your upstream service consistently responds in under 5 seconds but occasionally takes 70 seconds for complex operations, the 60-second default proxy_read_timeout will generate 502 errors for those occasional slow requests. Increase gateway timeouts to match your P99 upstream response time plus appropriate headroom.

AWS ALB and Application Load Balancer idle timeout defaults of 60 seconds cause 502 errors for long-running requests or WebSocket connections. The ALB drops connections idle for more than 60 seconds, which manifests as 502 errors for requests that take longer than the idle timeout. Increase the ALB idle timeout in the Load Balancer settings to 180 or 300 seconds for applications with long-running requests, and configure keep-alive settings on backend instances to ensure the ALB's connection to backends stays active. The backend keep-alive timeout must be longer than the ALB idle timeout to prevent connection resets.

Upstream keep-alive connection management affects 502 error rates under load. When nginx maintains persistent connections to upstream backends and the upstream closes a connection that nginx believes is still alive, the first request on that dead connection receives a 502. nginx should retry the request on a fresh connection (configured with proxy_next_upstream error timeout http_502), but incorrect configuration can cause the 502 to be returned to the client. Configure proxy_http_version 1.1 and proxy_set_header Connection '' in nginx to enable HTTP/1.1 keep-alive with backends, and set proxy_next_upstream to retry on connection errors.

Circuit breaker configuration at the gateway level protects upstream services from being overwhelmed by requests when they are degraded. Nginx with the nginx_upstream_check_module or HAProxy passive health checks remove unresponsive backends from rotation. Kubernetes Ingress controllers with custom circuit breaker annotations, or service meshes like Istio with outlier detection, can automatically eject endpoints that return 502 errors above a configurable rate. Configure circuit breakers to open when an upstream returns 502 for more than 20% of requests in a 30-second window, preventing continued routing to a failing backend.

Implement Health Checks and Graceful Handling

Proactive health management prevents users from being routed to failing backends.

Active health checks poll backend instances independently of user traffic, allowing the load balancer to detect and remove unhealthy instances before users are routed to them. Configure health checks that test the full application stack—not just that the process is running (TCP check) but that it can actually serve requests (HTTP check to a /health endpoint). A health endpoint that queries the database, checks cache connectivity, and validates application-level readiness catches more failure modes than a simple TCP ping. Set health check intervals appropriate to your tolerance for routing users to unhealthy backends: 5 to 10 second intervals are standard for latency-sensitive applications.

Passive health checks (outlier detection) complement active health checks by detecting instances that are responding to traffic with errors. When a backend instance returns 502 or 5xx responses for more than a configured percentage of requests, passive health checks eject it from the rotation without waiting for the next active health check cycle. This provides faster failure detection for instances that are partially degraded (responding to some requests but failing others) compared to active health checks that may happen to succeed even when the instance is intermittently failing.

Graceful shutdown handling in upstream services prevents 502 errors during deployments and scaling events. When an upstream instance receives a shutdown signal (SIGTERM), it should stop accepting new connections, complete all in-flight requests, and then exit. Without graceful shutdown, the instance exits immediately when signaled, causing in-flight requests to receive 502 errors and causing the gateway to return 502 for new requests routed to the shutting-down instance before health checks have detected the change. Configure terminationGracePeriodSeconds in Kubernetes and SIGTERM handlers in your application to ensure graceful shutdown.

Upstream connection pooling in your reverse proxy or load balancer affects how quickly new connections can be established when existing ones fail. nginx's keepalive directive in upstream blocks configures connection reuse between nginx and upstream backends. With keepalive, nginx maintains a pool of persistent connections to each backend, reducing connection establishment overhead and providing faster failover when a connection fails (the pool creates a replacement). Setting nginx upstream keepalive to a value matching your expected concurrency per backend instance provides efficient connection management while preventing connection pool exhaustion.

Monitor and Reduce 502 Errors Continuously

Sustainable 502 reduction requires systematic monitoring and organizational practices.

Post-mortem analysis of 502 incidents identifies systemic issues that cause repeated incidents. When 502 errors occur due to the same root cause multiple times—repeated OOM kills, repeated connection pool exhaustion during traffic spikes, repeated deployment-related 502 storms—the recurring cause represents an unresolved systemic issue. Document post-mortems with root cause analysis, contributing factors, timeline, impact metrics, and action items. Track action item completion to ensure systemic fixes are implemented rather than just describing the incident without structural improvement.

Automated 502 rate alerting with runbooks enables consistent and fast incident response. When 502 rates exceed thresholds, alerts should fire with enough context for the on-call engineer to begin investigation immediately: which service is generating 502s, the 502 rate and trend, and a link to the runbook describing common causes and diagnostic steps for this service. Runbooks that document the 3 to 5 most common causes of 502s for each service, how to identify each cause from available logs and metrics, and the remediation steps for each cause reduce mean time to resolution from hours to minutes.

Load testing with synthetic traffic validates that 502 rates remain acceptable under peak load before deploying changes. A load test that drives 150% of expected peak traffic while monitoring 502 rates, gateway logs, and backend health metrics provides confidence that the system can handle traffic spikes without generating 502 errors. Gate deployments on load test pass/fail criteria that include 502 rate targets: deployments that cause 502 rates above 0.1% during load tests require investigation before production deployment.

Track 502 error trends as a long-term reliability metric alongside availability, error rate, and latency. A team that tracks P30-day 502 error budget consumption alongside their error budget has 502 reduction as a concrete, measurable reliability goal. When 502 rates trend upward over weeks—even gradually—it indicates that reliability is degrading faster than it is being improved. Regular review of 502 trends at weekly engineering meetings keeps the team aware of reliability trajectory and creates accountability for addressing the root causes of increases rather than dismissing individual incidents.

Key Takeaways

502 errors from nginx include specific error codes (recv() failed, connect() failed, upstream timed out) that identify the failure mechanism—check the gateway error log before the application log
Resource limit violations (OOMKill) terminate processes without generating application logs—the absence of application error logs during 502 spikes is itself diagnostic of memory limit issues
Gateway timeout values must exceed P99 upstream response times by 20-30%—timeouts set below P99 generate 502s for normal slow requests rather than only for genuinely failed requests
Graceful shutdown handling is mandatory to prevent 502 storms during deployments—configure SIGTERM handlers that complete in-flight requests before exiting, with terminationGracePeriodSeconds set appropriately
Active health checks (polling /health endpoints) combined with passive health checks (outlier detection on 5xx responses) provide the fastest detection of partially degraded instances before they affect many users
ALB and nginx idle timeouts default to 60 seconds—WebSocket connections and long-running requests require increasing these timeouts, and backend keep-alive timeouts must exceed gateway idle timeouts

Understanding 502 Bad Gateway Errors

Track All 502 Errors

Identify Upstream Failures

Configure Gateway Timeout Settings

Implement Health Checks and Graceful Handling

Monitor and Reduce 502 Errors Continuously

Key Takeaways

Monitor your applications
with Atatus

Related guides

Fix Timeout Errors

How to Fix High Error Rates in Production

Improve API Performance: Latency Reduction Guide

Save up to 4x on Costs

Enterprise Security & Compliance

Full Control & Customization

Reduce 502 Bad Gateway Errors

Understanding 502 Bad Gateway Errors

Track All 502 Errors

Identify Upstream Failures

Configure Gateway Timeout Settings

Implement Health Checks and Graceful Handling

Monitor and Reduce 502 Errors Continuously

Key Takeaways

Monitor your applications with Atatus

Related guides

Fix Timeout Errors

How to Fix High Error Rates in Production

Improve API Performance: Latency Reduction Guide

Monitor your applications
with Atatus