Fix Slow API Endpoints

Systematically Identify Your Slowest Endpoints

Data-driven prioritization ensures optimization effort delivers maximum user impact.

Start with a comprehensive inventory of all API endpoints sorted by P95 response time. The P95 metric shows the response time that 95% of requests complete within—it reveals endpoints where a significant fraction of requests are slow, even if the median (P50) appears acceptable. An endpoint with P50 of 80ms and P95 of 2,000ms has a severe tail latency problem affecting 5% of requests. In a system handling 1,000 requests per second, that is 50 slow requests per second, and on a critical endpoint like checkout or search, this significantly degrades user experience.

Weight endpoint priority by both P95 latency and request volume. An endpoint taking 5 seconds at P95 but receiving only 10 requests per day has lower total user impact than an endpoint taking 500ms at P95 and receiving 10,000 requests per hour. Calculate impacted user-seconds per day: (P95 latency - P50 latency) × requests per day gives the total excess latency absorbed by the worst-performing fraction of your users. This metric helps prioritize endpoints where optimization delivers the most user experience improvement per engineering hour invested.

Track SLO compliance per endpoint to identify when endpoints begin degrading before they become critical. Define SLOs for each endpoint class (100ms for cached reads, 500ms for database reads, 2,000ms for complex computations) and monitor what fraction of requests meet the SLO. An endpoint that met its SLO 99.5% of requests two weeks ago but meets it 95% today is degrading—investigation before it drops further is much faster than emergency response when it reaches 80% compliance. SLO dashboards that show historical compliance trends enable proactive rather than reactive optimization.

Separate baseline latency from regression investigation. Some endpoints are slow because they perform genuinely expensive operations (aggregating millions of records, generating large reports), while others are slow due to specific bugs or regressions. For genuinely expensive operations, the goal is optimization to reduce complexity. For regressions, the goal is identifying what changed and reverting or fixing the change. Correlating latency increases with deployment timestamps immediately distinguishes regressions from baseline characteristics and guides the appropriate remediation path.

Diagnose Performance Bottlenecks with Tracing

Distributed tracing reveals the complete breakdown of time spent within each slow request.

Distributed tracing provides a complete timeline of every operation within a slow request. When an API endpoint is slow, a trace shows exactly how much time was spent in each operation: authentication middleware (50ms), database query 1 (80ms), external API call (350ms), database query 2 (120ms), serialization (20ms)—total 620ms. Without tracing, you know the endpoint is slow but not why or where. With tracing, you immediately know the external API call is consuming 57% of the time and is the highest-priority optimization target.

Capture traces for outlier requests specifically, not just for a random sample. P99 performance issues are caused by rare conditions—lock contention, external API degradation, cache miss storms, garbage collection pauses—that are invisible in P50-weighted random sampling. Tail-based sampling that preserves 100% of requests exceeding your SLO thresholds provides a complete record of all slow requests for investigation, while sampling only a fraction of fast, successful requests to manage storage costs. This ensures you have tracing data for the requests that need investigation.

Identify serial operations in traces that could be parallelized. When a trace shows three sequential database queries that are independent of each other—each starting only after the previous completes—you have a parallelization opportunity. Executing independent operations concurrently with Promise.all() (Node.js), asyncio.gather() (Python), or parallel executor service calls (Java) reduces total time from the sum of all operations to the duration of the slowest one. A common scenario of three 100ms queries taking 300ms sequentially can be reduced to 100ms in parallel.

Compare traces for slow versus fast requests to identify what is different. A fast execution of an endpoint at 150ms and a slow execution of the same endpoint at 800ms have traces you can compare side by side. The difference in trace structure—an additional cache miss, a longer database query, a retry on an external service, an extra authentication lookup—directly reveals the cause of the slow execution. This differential analysis is one of the most effective debugging techniques because it eliminates the noise of normal operation and highlights only what changed.

Optimize Database Queries in API Endpoints

Database operations are the most common bottleneck in API endpoints.

Analyze database queries executed by each slow endpoint using APM query traces. Look for four categories of problematic patterns: N+1 queries (many small queries that should be one bulk query), missing indexes (queries performing full table scans), oversized result sets (fetching all columns and rows when only a subset is needed), and unnecessarily complex joins (multiple JOINs that could be simplified or replaced with application-level joins). Each category has a specific fix, and a single endpoint often has multiple categories simultaneously.

Add EXPLAIN ANALYZE output for the top 5 slowest queries in each slow endpoint. Most APM tools can capture query execution plans alongside timing data. The execution plan shows whether indexes are being used, how many rows are being scanned versus returned (a large ratio indicates poor index efficiency), and where the most time is spent within the query. A plan showing 'Seq Scan' on a large table is an immediate indicator that an index would help. A plan showing 1,000,000 rows scanned to return 10 results indicates either a missing index or a query that needs fundamental redesign.

Implement query result caching at the repository or service layer to eliminate repeated identical database queries. For read-heavy endpoints that serve the same or similar data to many users, caching query results in Redis with appropriate TTLs can reduce database load by 80 to 95% and cut endpoint latency by 70 to 90%. The cache key should include all query parameters that affect the result—user ID for user-specific data, filters for filtered lists, page number for paginated results. Cache invalidation based on data modification events (cache tags, event-driven invalidation) ensures stale data is not served.

Optimize data fetching to retrieve only the data your endpoint actually needs. If an endpoint returns a list of user names and emails, do not query for all user fields and discard the extras. SELECT id, name, email FROM users is faster than SELECT * FROM users both because less data is transferred and because the query optimizer can often satisfy the query from an index without a table scan (index-only scan). Apply the same principle at the ORM level: User.objects.only('id', 'name', 'email') in Django, User.select(:id, :name, :email) in ActiveRecord.

Implement Caching at Multiple Levels

Multi-layer caching eliminates redundant work at different levels of the request stack.

HTTP response caching with Cache-Control headers allows clients and CDNs to cache API responses without any server involvement for repeat requests. Public endpoints (product catalog, public pricing, publicly accessible content) can be cached at CDNs with TTLs matching the data's change frequency. Private endpoints (user-specific data) can be cached at the client level with appropriate max-age values. Even a 10-second cache TTL on a high-traffic public endpoint reduces origin requests by 99.9% during traffic peaks, from 1,000 requests per second to approximately 0.1 requests per second.

Application-level caching with Redis or Memcached stores computation results that are expensive to generate and reusable across multiple users or requests. Complex aggregations, expensive analytical queries, external API responses with slow providers, and computed recommendation results are good candidates for application-level caching. Design cache keys that uniquely identify the specific data being cached and include all parameters that affect the result. Monitor cache hit rates per cache key type and alert on significant hit rate drops that indicate cache efficiency degradation.

In-memory caching within the API process avoids network round trips to Redis for very hot data. A dictionary or LRU cache in process memory provides sub-millisecond access compared to 0.5 to 2ms for Redis. Use in-memory caching for data that changes rarely and is the same for all users: application configuration, feature flags, reference data, and lookup tables. Limit in-memory cache size to prevent memory issues, and implement a refresh mechanism that periodically updates from the authoritative source without service restarts.

Conditional HTTP requests allow clients to revalidate cached responses efficiently. When a client has a cached response with an ETag or Last-Modified header, it can send a conditional request (If-None-Match or If-Modified-Since) to check whether the cached response is still valid. If the data has not changed, the server returns 304 Not Modified with no response body, and the client uses its cached version. This saves bandwidth and reduces response time for cases where data has not changed since the last request—common for settings, profile data, and other slowly-changing resources.

Optimize Payload Size and Serialization

Large response payloads waste bandwidth and slow serialization, directly adding to endpoint latency.

Response payload size directly affects serialization time (CPU time to convert objects to JSON), transmission time (bandwidth to send the response), and deserialization time on the client. A 1MB JSON response takes 10 to 50ms to serialize on the server and 20 to 100ms to deserialize on mid-range mobile devices. Reduce payload size by implementing sparse fieldsets (allowing clients to specify which fields to include), removing unused legacy fields from responses, and avoiding repeated data in responses by using IDs with a separate lookup table for shared reference data.

Pagination prevents endpoints from returning unbounded result sets that grow as data accumulates. An endpoint that returns all user orders might return 5 records in development but 50,000 records for an active production customer, causing multi-second response times and potentially running out of memory. Implement cursor-based pagination for all list endpoints, enforce maximum page sizes (100 records is a common limit), and document pagination for API consumers. For existing endpoints with large callers, add pagination gradually with a deprecation notice for the unpaginated version.

Response compression with Brotli or gzip reduces transmission time for text-based JSON responses by 70 to 90%. A 500KB JSON response compresses to approximately 50 to 150KB, reducing network transfer time proportionally. Enable compression on your web server (nginx: gzip on, gzip_types application/json) or application framework, and verify your CDN is serving compressed responses. Compression adds a small CPU cost on the server (1 to 5ms for typical response sizes) but provides a much larger benefit in reduced network transfer time, especially on slow connections.

Binary serialization formats like Protocol Buffers, MessagePack, and CBOR can replace JSON for high-volume internal API calls between services, reducing both payload size and serialization/deserialization CPU time. Protocol Buffers are typically 30 to 80% smaller than equivalent JSON and deserialize 5 to 10 times faster. For public-facing APIs, JSON remains necessary for browser compatibility, but for microservice-to-microservice communication where both ends are under your control, adopting a binary format can significantly reduce per-request overhead for high-throughput services.

Implement Async Processing and Response Patterns

Asynchronous patterns decouple response time from processing time for expensive operations.

Background job processing decouples the API response from the time required to complete work. Instead of making an API client wait 30 seconds for a report to generate, a PDF to export, or a bulk email to send, the endpoint accepts the request, queues the work as a background job, and immediately returns 202 Accepted with a job ID. The client polls a status endpoint or receives a webhook notification when the job completes. This pattern keeps all API response times under 100ms and moves the variability of job execution time to a background system where it does not affect user experience.

Request deduplication prevents duplicate expensive operations when clients retry requests. When a client makes a request and does not receive a response (due to network failure), it may retry the same request, potentially triggering the same database write, payment charge, or email send multiple times. Implement idempotency keys—unique client-provided identifiers included in request headers—that allow the server to detect and return cached responses for duplicate requests. Idempotency is essential for payment and order endpoints where duplicate operations have direct financial consequences.

Rate limiting and request prioritization protect slow endpoints from traffic spikes. An endpoint that handles 100 requests per second comfortably may degrade severely under 1,000 requests per second due to database connection pool exhaustion. Configure endpoint-specific rate limits that match the endpoint's capacity, and implement request prioritization that processes high-value user requests (paid customers, critical operations) before best-effort requests (analytics updates, background refreshes). A well-tuned rate limiter that gracefully degrades by returning 429 responses is better than an unprotected endpoint that degrades for all users.

Asynchronous validation and side effects improve endpoint response times by deferring non-critical work to after the response is sent. For an order creation endpoint, validating payment, reserving inventory, and creating the database record must happen synchronously before responding. Sending confirmation emails, updating analytics counters, and publishing events to downstream services can happen asynchronously after the response—the user does not need to wait for email delivery to receive their order confirmation. Use an async task queue or event bus to trigger these post-response operations reliably without blocking the response.

Key Takeaways

Prioritize endpoints by (P95 latency - P50 latency) × requests per day to target optimization effort where it reduces the most cumulative user-experienced latency
Distributed tracing with per-operation timing immediately reveals which component within a slow endpoint—database, external API, serialization—is the bottleneck to fix
Background job processing with 202 Accepted responses keeps all API endpoint response times fast regardless of how expensive the underlying operation is
Add EXPLAIN ANALYZE to the top 5 slowest database queries per endpoint—execution plans immediately identify missing indexes and full table scans
Multi-level caching (HTTP Cache-Control, application Redis cache, in-memory process cache) matches cache type to data characteristics: shared/public data at CDN, user-specific at Redis, ultra-hot at in-memory
Pagination is required on all list endpoints to prevent response time from growing as data accumulates—implement and enforce maximum page sizes before data volumes make the problem severe

Systematically Identify Your Slowest Endpoints

Diagnose Performance Bottlenecks with Tracing

Optimize Database Queries in API Endpoints

Implement Caching at Multiple Levels

Optimize Payload Size and Serialization

Implement Async Processing and Response Patterns

Key Takeaways

Monitor your applications
with Atatus

Related guides

Fix Database Bottlenecks: Query Optimization Guide

Improve API Performance: Latency Reduction Guide

Troubleshoot Cache Issues

Save up to 4x on Costs

Enterprise Security & Compliance

Full Control & Customization

Fix Slow API Endpoints

Systematically Identify Your Slowest Endpoints

Diagnose Performance Bottlenecks with Tracing

Optimize Database Queries in API Endpoints

Implement Caching at Multiple Levels

Optimize Payload Size and Serialization

Implement Async Processing and Response Patterns

Key Takeaways

Monitor your applications with Atatus

Related guides

Fix Database Bottlenecks: Query Optimization Guide

Improve API Performance: Latency Reduction Guide

Troubleshoot Cache Issues

Monitor your applications
with Atatus