TroubleshootingIntermediate

How to Fix High Error Rates in Production

High error rates damage user trust and business outcomes. Track, categorize, and systematically eliminate errors across your application stack.

14 min read
Atatus Team
Updated March 15, 2025
7 sections
01

Define What Constitutes a High Error Rate

Before you can fix high error rates, you need a clear threshold and classification system.

Error rates need to be measured in context to be meaningful. A 2% error rate on a payment endpoint is catastrophic, while a 2% error rate on a non-critical analytics endpoint may be acceptable. Define error rate thresholds for each endpoint class based on business impact: critical paths like checkout, authentication, and data submission should have near-zero tolerance for errors, while informational endpoints can tolerate higher rates before requiring immediate action.

Distinguish between different error categories: 4xx client errors indicate bad requests or unauthorized access, 5xx server errors indicate your backend is failing, and JavaScript exceptions indicate frontend code failures. These categories have very different root causes and remediation strategies. A spike in 400 Bad Request errors may indicate a client-side validation bug, an API contract change, or an attack, while a spike in 500 Internal Server Errors indicates your backend code is throwing unhandled exceptions.

Track error rates as a rolling window metric rather than a point-in-time count. A rate of 50 errors per minute is alarming on a low-traffic endpoint but entirely normal on a high-traffic service processing thousands of requests per minute. Calculate error rate as the percentage of requests resulting in errors, and measure this across 1-minute, 5-minute, and 1-hour windows to distinguish spikes from sustained degradation.

Establish a baseline error rate for each service and endpoint during normal operation. Error rates of exactly zero are extremely rare in production—network glitches, client disconnections, and edge cases produce occasional errors even in healthy systems. Your alert thresholds should be set relative to the baseline, such as alerting when the error rate exceeds baseline by more than 3x or crosses an absolute threshold of 1% for critical endpoints.

02

Catch All Errors with Complete Stack Traces

Comprehensive error capture is the foundation of effective debugging.

Automatic error detection should capture every unhandled exception and rejection across both frontend and backend code without requiring manual instrumentation of each potential failure point. On the backend, this means wrapping Express, Django, Rails, or Spring middleware with error handlers that capture exceptions before they are swallowed or converted to generic 500 responses. On the frontend, this means listening to the window.onerror and window.onunhandledrejection events to capture JavaScript exceptions that occur outside of try/catch blocks.

Complete stack traces with source code context are the difference between a bug that takes 5 minutes to find and one that takes 5 hours. A stack trace that shows only minified JavaScript filenames and line numbers provides little actionable information. Source map support allows your error tracking tool to transform minified stack traces back to their original source code, showing the exact function name, file path, and line number where the error occurred.

Capture contextual information alongside each error: the user ID, session ID, request URL, HTTP method, request headers, response status code, browser and OS version, and any custom attributes relevant to your application. This context transforms a cryptic error message into a reproducible bug report. When you know an error is only occurring for users on iOS 16 with Safari, or only on requests to a specific endpoint from a particular geographic region, you can immediately narrow down the investigation.

Group similar errors to identify patterns and measure true frequency. When the same null pointer exception occurs 10,000 times with slightly different stack traces, treating each occurrence as a unique error creates noise that makes it impossible to identify the highest-priority issues. Smart grouping algorithms that normalize variable data—such as user IDs, timestamps, and request-specific values—allow you to see that those 10,000 occurrences are actually one bug affecting a large fraction of your users.

03

Understand Error Impact and Root Causes

Impact analysis helps you prioritize which errors to fix first.

Not all errors are equally important. An error affecting 1% of users on a rarely-visited page has less impact than an error affecting 0.1% of users on the checkout page. Measure error impact in terms of affected users and sessions, not just raw error counts. When you know that a specific database connection error is affecting 847 unique users per hour and blocking them from completing purchases, you have a clear business case for immediate escalation.

Correlation with deployments is one of the most powerful tools for identifying error root causes. When error rates spike within minutes of a deployment, the deployment is almost certainly responsible. Connect your error tracking to your deployment pipeline so that every release creates an annotation on your error rate timeline. This allows you to see at a glance that error rates were stable for three weeks, spiked after the Tuesday 2pm deployment, and returned to baseline after the rollback.

Distributed error tracing allows you to see not just that an error occurred but the full chain of events that led to it. In a microservices architecture, a database connection failure in a data access service might manifest as a generic 503 error in the API gateway. Without distributed tracing, you see only the 503 at the edge; with tracing, you see the complete causal chain from the user's request through every service that was called before the failure occurred.

Identify error clusters by analyzing common attributes across errors. If 80% of your 500 errors share the same database query signature, the same external API call, or the same authentication provider, that shared attribute is likely the root cause. Automatic clustering by stack trace fingerprint, error message, affected URL, and user segment can surface these patterns automatically rather than requiring manual analysis of thousands of individual error reports.

04

Prioritize and Fix Critical Errors First

A triage system ensures maximum impact per unit of engineering effort.

Establish a priority matrix that scores errors by severity, frequency, and user impact. A severity-1 error is one that is completely blocking a critical user path, such as the inability to log in or complete a checkout. A severity-2 error is one that degrades functionality but does not completely block users. Assign on-call engineers to severity-1 errors with a response SLA of under 15 minutes, and severity-2 errors with a response SLA of under 4 hours during business hours.

Alert on new errors or sudden increases in known error rates immediately, before users begin submitting support tickets. The goal is for your engineering team to be aware of and investigating a production error before the first user contacts customer support. This requires low-latency alerting—error detection to alert delivery should take no more than 1 to 2 minutes—and alert routing that pages the right team rather than everyone.

Implement error budgets aligned with your Service Level Objectives. If your SLO is 99.9% availability, your monthly error budget is approximately 43 minutes of downtime equivalent. Tracking error budget consumption in real time allows you to make informed decisions about shipping velocity versus reliability investment. When you have consumed 80% of your monthly error budget in the first week, that is a signal to slow down new feature releases and invest in stability.

Track error resolution status and verify fixes in production before marking them resolved. An error is not fixed until production monitoring confirms the error rate has returned to baseline—not when a developer marks the bug as resolved in the issue tracker. Implement a verification workflow that automatically checks whether an error recurs within 24 hours of being marked resolved, and reopens it automatically if the rate has not decreased.

05

Implement Error Prevention Practices

Proactive practices reduce the rate at which new errors are introduced.

Code review practices that include performance and error handling checklist items prevent many production errors before they are deployed. Review every pull request for missing null checks, unhandled promise rejections, missing error boundaries in React components, and database queries without appropriate error handling. Automated static analysis tools can enforce many of these checks programmatically, freeing reviewers to focus on logical errors that tools cannot detect.

Comprehensive test coverage for error scenarios is as important as testing the happy path. Every external dependency call should have test cases for timeout, connection failure, and unexpected response format. Every user input should have test cases for empty values, extremely long strings, special characters, and invalid formats. Teams that invest in testing error scenarios in development discover and fix the vast majority of production error causes before they reach users.

Feature flags allow you to gradually roll out changes to a small percentage of users before full deployment. This limits the blast radius of bugs to the fraction of users in the canary group. If error rates spike in the canary group, you can disable the flag and investigate without affecting the full user base. This practice alone can reduce the frequency and severity of production incidents by an order of magnitude compared to full-release deployments.

Chaos engineering—intentionally injecting failures in controlled ways—tests your error handling code under realistic conditions. Many error handling code paths are written but never tested because the failures they handle are rare in development. Running chaos experiments that kill database connections, introduce network latency, and exhaust memory validates that your error handling code actually works and that errors are surfaced to monitoring rather than silently swallowed.

06

Monitor Frontend JavaScript Errors

JavaScript errors affect user experience directly and often go undetected without dedicated monitoring.

Frontend JavaScript errors are frequently underinvested compared to backend errors, yet they directly degrade user experience in ways that backend errors may not. A JavaScript exception that breaks the checkout button is invisible to backend monitoring but completely blocks users from converting. Implement browser error monitoring that captures and reports all unhandled exceptions, rejected promises, and resource loading failures to your error tracking system.

Browser compatibility errors are a class of frontend errors that predominantly affect specific browser versions, operating systems, or screen sizes. A CSS Grid layout issue may only manifest on specific versions of Safari on iOS, while a JavaScript API that is not polyfilled may fail only on older Chrome versions. Real User Monitoring data that includes browser version, OS, and device type alongside error reports makes these issues immediately identifiable.

Network errors on the frontend—failed API calls, CORS errors, and request timeouts—need to be tracked separately from JavaScript code exceptions. A user who cannot load their account data due to a network error has a completely degraded experience even though no JavaScript exception occurred. Monitor HTTP error rates from the browser perspective using performance observer APIs and report them alongside your other error metrics.

Error boundaries in React applications prevent a single component error from crashing the entire application, but they need to be combined with error reporting to be effective for debugging. When an error boundary catches an error and renders a fallback UI, report that error to your monitoring system with the full component stack trace. Without this reporting, you have graceful degradation from the user's perspective but invisible errors from the engineering perspective.

07

Build a Culture of Error Ownership

Technical tools are only effective when paired with organizational practices that drive accountability.

Assign clear ownership to error types so that every error has a team responsible for investigating and resolving it. Unowned errors in a shared queue tend to be ignored until they become severe. When the authentication team owns all errors from the auth service and the payments team owns all errors from the checkout flow, error rates in those areas improve because the owning team experiences the cost of the errors in their own metrics.

Include error rate and error budget metrics in weekly engineering reviews and sprint retrospectives. Teams that review their error rates regularly develop better intuition for what causes them and build better habits around defensive coding, error handling, and pre-production testing. Error rates should be visible to the entire organization—including product managers and engineering leaders—to create alignment on the trade-offs between feature velocity and reliability.

Post-incident reviews (also called post-mortems) after significant error events create organizational learning that prevents recurrence. A blameless post-mortem analyzes what happened, why the error was not caught before production, how long it took to detect and resolve, and what systemic changes will prevent similar errors in the future. Document these reviews and share them broadly to propagate lessons across teams.

Set quarterly error rate reduction goals alongside feature delivery goals to signal that reliability is a first-class priority. When engineering teams are measured only on feature delivery velocity, error rates tend to increase over time as shortcuts accumulate. Explicit error rate targets with executive visibility create the organizational pressure needed to invest in reliability work before problems become crises.

Key Takeaways

  • Measure error rates as percentages of total requests rather than raw counts, and track P50/P95/P99 error rates by endpoint criticality
  • Complete stack traces with source maps and contextual attributes (user ID, session, browser) are essential for efficient debugging
  • Correlate error spikes with deployments immediately—most production error events are directly caused by recent code changes
  • Prioritize errors by the product of severity, frequency, and user impact rather than treating all errors with equal urgency
  • Verify fixes in production monitoring data before marking errors resolved—a fix is confirmed only when error rates return to baseline
  • Error budgets tied to SLOs create an objective framework for balancing feature velocity against reliability investment
Get started today

Monitor your applications with Atatus

Put the concepts from this guide into practice. Set up full-stack observability in minutes with no credit card required.

No credit card required14-day free trialSetup in minutes

Related guides