TroubleshootingIntermediate

Fix Timeout Errors

Timeout errors disrupt user experience and indicate performance issues. Identify timeout causes, optimize slow operations, configure appropriate timeouts, and prevent timeout failures.

14 min read
Atatus Team
Updated March 15, 2025
6 sections
01

Understanding Timeout Types and Their Causes

Timeouts have multiple distinct causes, each requiring a different remediation strategy.

Timeout errors occur when an operation takes longer than a configured maximum time limit and is aborted before completion. Timeouts exist at multiple levels of the stack: client-side HTTP request timeouts, API gateway and load balancer timeouts, application-level timeouts for external service calls, database query timeouts, and operating system socket timeouts. When a timeout triggers, the client receives an error but the upstream service may continue processing the request—creating a situation where the client retries while the original request is still running, potentially multiplying load on the slow component.

Distinguish between timeout misconfiguration (the timeout threshold is set too low for the operation's legitimate execution time) and genuine performance problems (the operation is taking longer than expected). A database query timeout of 100ms on a report generation query that legitimately needs 2 seconds is a misconfiguration. The same timeout on a simple user lookup query that normally takes 5ms but is taking 500ms due to a missing index is a genuine performance problem. Fixing misconfiguration requires adjusting timeout values; fixing performance problems requires optimizing the slow operation.

Cascading timeout failures amplify individual service slowness across the entire system. When Service A calls Service B and Service B calls Service C, and Service C is slow, Service B experiences timeouts when calling Service C. Service B then responds slowly to Service A, causing Service A to experience timeouts when calling Service B. If Service A does not handle these timeouts gracefully (fast failing rather than waiting), Service A's users also experience timeouts. A single slow service can cascade through the entire call chain, making multiple services appear unavailable even though only one is actually slow.

Timeout errors from different components have different characteristics: HTTP 408 Request Timeout (client did not send request in time), HTTP 504 Gateway Timeout (upstream server did not respond in time), HTTP 503 Service Unavailable (often from circuit breakers tripping due to timeouts), and connection reset errors (TCP connection terminated due to timeout). Each HTTP status code and error type provides context about where in the stack the timeout occurred, guiding you to the right component to investigate.

02

Monitor All Timeout Errors

Comprehensive timeout monitoring across all stack layers enables rapid diagnosis.

Track timeout errors separately from other error types because they indicate a different class of problem—not a code bug but a performance or configuration issue. Monitor timeout rate as a percentage of total requests per service and endpoint, and alert when the rate exceeds your SLO threshold. A timeout rate of 0.1% may be acceptable for a batch processing endpoint but unacceptable for a user-facing search endpoint. Set distinct alert thresholds for each endpoint class based on its impact on user experience and business metrics.

Correlate timeout spikes with infrastructure events: deployments, database maintenance windows, traffic spikes, memory pressure events, and network disruptions. Most timeout spikes are caused by specific events rather than gradual degradation. A deployment that introduced an N+1 query pattern may cause database timeouts that appear 10 to 30 minutes later as slow queries accumulate. A traffic spike that overwhelms the database connection pool causes connection acquisition timeouts. Deployment and infrastructure event annotations on timeout rate graphs accelerate root cause identification from hours to minutes.

Distributed tracing of timed-out requests provides the complete execution timeline showing how long the request ran before timing out. A trace for a timed-out request shows every operation with its duration—database queries, external API calls, internal service calls, middleware processing. You can see whether the request was 10% or 90% complete when it timed out, which component was executing at the timeout moment, and whether any individual operation appears responsible for the entire timeout. Trace data for slow and timed-out requests is the single most actionable data source for timeout investigation.

External service dependency timeouts require separate tracking from internal service timeouts because they have different remediation strategies. Internal service slowness is fixed by optimizing the service's code, infrastructure, or configuration. External service slowness requires implementing timeouts, circuit breakers, and fallbacks that degrade gracefully when the external service is slow. Monitor each external dependency's timeout rate and response time percentiles separately to identify which external services are causing the most timeout impact on your system.

03

Identify Why Operations Time Out

Root cause analysis of timeouts requires examining the operations executing at timeout time.

Database query timeouts are the most common category of timeout errors in web applications. A query that normally executes in 50ms may timeout at 5 seconds when database CPU is saturated during a traffic spike, when a table lock blocks the query from executing, or when the query execution plan changes due to outdated statistics and performs a full table scan instead of an index scan. Use the distributed trace for timed-out requests to identify which specific query was executing at timeout, capture the query execution plan with EXPLAIN ANALYZE, and identify whether the slowness is due to data volume growth, contention, or execution plan degradation.

External API timeouts occur when third-party services (payment processors, shipping providers, authentication services, data providers) respond slowly. These timeouts are outside your control but can be mitigated with appropriate fallback strategies. Measure each external API's P50, P95, and P99 response times and set timeouts slightly above P99 to avoid timing out legitimate slow responses while still failing fast on completely unresponsive services. Implement circuit breakers that open when external API timeouts exceed a threshold, returning graceful error responses rather than waiting for timeouts on every request.

Memory pressure causes slowness that manifests as timeouts. When an application process is under memory pressure and garbage collection runs frequently, GC pause times add to operation duration and can cause operations that were previously within timeout thresholds to exceed them. Monitor JVM GC pause times (for Java applications), V8 GC events (for Node.js), or Python GC collection frequencies alongside timeout rates. When GC pauses and timeout rates are correlated, the fix is addressing the underlying memory issue—memory leak investigation, garbage collection tuning, or increasing heap allocation.

Resource contention under concurrent load causes timeouts that do not appear in isolation. A lock contention scenario where 50 concurrent requests compete for the same row-level lock causes all but one of them to wait—potentially exceeding timeout thresholds for the waiting requests. Database lock wait timeouts are a specific category distinct from query execution timeouts. Monitor lock wait times and deadlock frequency alongside query execution times. Lock contention is addressed by reducing transaction scope, using optimistic locking, or restructuring update patterns to avoid holding locks longer than necessary.

04

Configure Appropriate Timeout Values

Timeout values must be calibrated to actual operation performance to be effective.

Set timeout values based on measured P99 latency for each operation, not on round-number guesses. If your user authentication service responds in under 150ms for 99% of requests, a 500ms timeout gives adequate headroom while still failing fast when the service is genuinely unresponsive. If your report generation query takes 3 seconds at P99, a 100ms timeout will trigger constantly even when the service is healthy. Collect P99 latency data for all operations that have timeouts and review whether current timeout values are above P99 (allowing legitimate slow requests) or below P99 (triggering constantly for normal traffic).

Implement tiered timeout hierarchies where inner timeouts are shorter than outer timeouts. If a user-facing API endpoint has a 10-second timeout, the database queries it calls should have timeouts of 2 to 3 seconds, and the external APIs it calls should have timeouts of 1 to 2 seconds. This ensures that the outer timeout is reached only when inner operations have already failed and been handled rather than having the outer timeout expire while an inner operation is still running. Tiered timeouts enable graceful degradation: when a database query times out, the API can return a partial result or cached fallback rather than waiting for the full outer timeout.

Configure read timeouts and write timeouts separately where the technology supports it. Read operations (queries, GET requests) typically have lower timeout tolerance than write operations (inserts, updates, POST/PUT requests) because write operations may legitimately take longer due to transaction commit overhead, disk I/O, and replication. Setting the same timeout for reads and writes either allows reads to be slow (if set high for write compatibility) or causes false write failures (if set low for read latency). PostgreSQL supports statement_timeout and lock_timeout separately; HTTP clients support separate connect timeout and read timeout.

Document timeout decisions alongside their rationale in configuration files and runbooks. A timeout value of 3,000ms looks arbitrary without context; a comment noting 'set to 3x P99 query time of 900ms, last reviewed 2024-01' provides context for future reviewers who need to evaluate whether the value is still appropriate. As service performance improves through optimization, timeout values may become conservatively high; as data volumes grow, previously adequate timeouts may become too tight. Regular review of timeout configurations against current P99 measurements ensures timeouts remain correctly calibrated.

05

Implement Resilience Patterns for Timeout Handling

Resilience patterns prevent timeout failures from cascading and degrading the entire system.

Circuit breakers prevent cascade failures by detecting when a downstream service is timing out repeatedly and switching to a fast-fail mode that returns errors immediately without waiting for timeouts. A circuit breaker in the closed state allows requests through normally. When timeout failures exceed a threshold (e.g., 50% of requests in a 10-second window), the circuit opens, and subsequent requests fail immediately with a circuit-open error rather than waiting for a timeout. After a configured delay, the circuit transitions to half-open, allowing a test request through to check whether the service has recovered. If the test request succeeds, the circuit closes; if it fails, the circuit remains open.

Retry logic with exponential backoff handles transient failures that resolve within a few seconds. When a request times out, immediately retrying may encounter the same conditions that caused the timeout. Exponential backoff waits progressively longer between retries (100ms, 200ms, 400ms, 800ms) to give the system time to recover. Add jitter (random variation) to retry delays to prevent synchronized retry storms from multiple clients. Critically, only retry idempotent operations—retrying a payment charge or an order creation can cause duplicate operations if the original request completed despite appearing to timeout from the client's perspective.

Fallback responses provide degraded-but-functional responses when operations timeout. Instead of propagating a timeout error to the user, return a cached version of the data, a best-effort partial result, a default value, or a user-friendly error message. A product recommendation API that times out can fall back to returning best-selling items from a fast cache rather than returning an error that breaks the page layout. A user profile service that times out can fall back to returning a minimal profile from a fast cache rather than causing authentication to fail. Design fallback responses for each timeout scenario and test them explicitly.

Bulkhead patterns isolate failures by dividing resources into separate pools for different clients or request types, preventing one slow operation from consuming all resources and timing out unrelated operations. Dedicate a separate thread pool, connection pool, or request queue to your highest-priority operations (user-facing requests) separate from lower-priority operations (analytics, batch processing, background sync). When batch processing creates database connection pool pressure, it affects the batch processing pool but not the user-facing request pool. Bulkheads contain the blast radius of slow operations and prevent them from degrading unrelated functionality.

06

Prevent Timeout Errors Proactively

Proactive performance management reduces timeout frequency more effectively than reactive incident response.

Performance testing that measures P99 latency under realistic load prevents deploying code that will timeout under production traffic. Load tests that simulate production traffic patterns, concurrent user counts, and data volumes identify endpoints where P99 latency approaches timeout thresholds before they are deployed. Set performance test exit criteria based on SLO compliance: fail the test if P99 latency exceeds 80% of the timeout threshold under peak expected load. This ensures a 20% buffer between measured worst-case performance and the timeout threshold.

Capacity planning prevents resource exhaustion that causes timeouts under traffic growth. When database CPU, connection pool utilization, or application server memory approaches the threshold where performance degrades, timeouts begin to occur for the users unlucky enough to arrive during resource-constrained windows. Monitor resource utilization trends and plan infrastructure scaling when P95 utilization exceeds 60 to 70% of maximum capacity—providing enough lead time for infrastructure provisioning before timeout rates become user-visible.

Slow query detection and automated alerting catches database performance regressions before they cause timeouts. A query that takes 5ms in development might take 500ms in production after 6 months of data growth if it lacks proper indexes. Continuous slow query log monitoring (queries exceeding 100ms) and APM query performance tracking detect this drift early. When a previously fast query suddenly appears in the slow query log, investigate immediately—the transition from fast to slow is often a single threshold crossing (table size exceeds buffer pool size, for example) that will not self-resolve.

Dependency health checks and SLO monitoring for external services provide early warning of degrading external APIs before their slowness causes timeout errors in your application. Many third-party services publish status pages and incident notifications, but these lag behind actual degradation. Monitoring external API response times from your application provides direct measurement of the service as experienced by your requests, with no lag. Configure alerts when external API P99 response time exceeds 50% of your client-side timeout value, providing a warning window to implement fallbacks before timeouts begin.

Key Takeaways

  • Set timeout values to slightly above P99 latency for each operation—timeouts set below P99 trigger constantly during normal operation; timeouts too far above P99 allow slow operations to cascade
  • Circuit breakers prevent cascade failures by detecting timeout threshold violations and switching to fast-fail mode, returning errors immediately rather than waiting for timeouts on every request
  • Tiered timeout hierarchies ensure inner operation timeouts (database queries, external APIs) are shorter than outer request timeouts, enabling graceful degradation and partial results
  • Retry only idempotent operations—retrying payment charges, order creation, or other non-idempotent operations after a timeout can cause duplicate operations if the original request completed silently
  • Bulkhead patterns isolate resource pools for different operation types, preventing batch processing or analytics operations from consuming connection pool capacity needed for user-facing requests
  • Distributed traces for timed-out requests reveal the exact operation executing at timeout time—this data immediately identifies whether the timeout is from database slowness, external API delay, or memory pressure
Get started today

Monitor your applications with Atatus

Put the concepts from this guide into practice. Set up full-stack observability in minutes with no credit card required.

No credit card required14-day free trialSetup in minutes

Related guides