Troubleshoot Cache Issues

Understanding Cache Behavior and Failure Modes

Caches fail in distinct ways that require different diagnostic and remediation approaches.

Caches fail in three fundamentally different ways: cache misses (data is not in cache when requested), cache staleness (data is in cache but is outdated), and cache poisoning (incorrect data is stored in cache). Each failure mode has different symptoms and different root causes. Cache misses manifest as performance degradation—requests that should be fast are slow because they fall through to the underlying database or service. Cache staleness manifests as inconsistent data—users see outdated information even after changes are made. Cache poisoning manifests as systematic incorrectness—multiple users receive the same wrong response from a shared cache.

The thundering herd problem (also called cache stampede) occurs when a highly requested cache entry expires simultaneously, causing many concurrent requests to miss the cache at the same time and all attempt to regenerate the entry by querying the underlying data source. If the underlying query takes 500ms and 200 concurrent requests miss the cache simultaneously, the database receives 200 concurrent queries for the same data—potentially overwhelming the database with work that would normally be handled by 1 cache population request per TTL period. This can cause database overload that appears as a sudden spike in response times aligned with cache TTL expirations.

Cache penetration is an attack pattern where requests for data that does not exist in the database repeatedly bypass the cache and hit the database directly, because caches typically do not cache negative results. If an attacker sends requests for millions of non-existent user IDs (GET /users/99999999, GET /users/99999998, ...), each request misses the cache and hits the database, which returns empty results. Neither the miss nor the empty result is cached, so every subsequent request for the same non-existent ID also hits the database. Cache non-existent results with short TTLs to prevent this pattern.

Cache inconsistency in distributed environments occurs when multiple cache nodes serve different versions of the same data after an update. In a Redis cluster or Memcached cluster, cache invalidation signals may not reach all nodes simultaneously, leaving some nodes serving stale data while others serve fresh data. This is particularly problematic for financial data, inventory counts, and any information where consistency is important. Understand your cache topology's consistency guarantees before relying on cache for consistency-sensitive data, and implement read-your-writes consistency for operations where users must immediately see their own changes.

Track Cache Performance Metrics

Comprehensive cache metrics provide early warning of both performance and consistency issues.

Cache hit rate is the primary health metric for any caching system. It measures the percentage of requests that are served from cache versus falling through to the underlying data source. A healthy cache hit rate varies by use case—a static asset CDN cache should achieve 90 to 99% hit rate; an application-level API response cache may achieve 50 to 80% depending on data uniqueness. What matters is not the absolute hit rate but whether it is meeting the target for your specific caching strategy and whether it is declining over time without explanation.

Cache eviction rate indicates whether your cache has sufficient capacity for your working data set. When eviction rate is high, frequently accessed items are being evicted before the next access, causing cache misses that would not occur with a larger cache. Monitor both the total eviction count and the eviction policy being triggered (LRU evictions versus TTL expirations versus explicit invalidations). High LRU evictions indicate a cache that is smaller than the working data set; high TTL expirations are expected and healthy; high explicit invalidations may indicate over-aggressive cache invalidation.

Cache response time should be consistently under 1ms for in-memory caches (local process memory, Memcached, Redis with fast network). Response times above 2 to 5ms indicate network issues, Redis/Memcached server load issues, or inefficient data structures. Monitor cache response time percentiles (P50, P95, P99) separately from application response time to distinguish between cache performance degradation and application logic performance degradation. A cache that responds in 10ms is adding rather than removing latency for simple lookups.

Memory utilization and eviction patterns require monitoring to ensure caches are sized appropriately. When a Redis instance approaches 100% memory utilization and begins evicting keys, cache effectiveness degrades unpredictably because the eviction decisions are based on LRU approximation rather than your application's access patterns. Monitor memory utilization trends and alert at 75 to 80% utilization to provide time to add capacity before the cache becomes a bottleneck. Track the memory overhead of data structures—Redis hashes versus strings have different memory footprints for the same data.

Debug Cache Invalidation Problems

Cache invalidation is famously difficult—bugs in invalidation logic cause stale data that undermines cache correctness.

Cache invalidation bugs often present as intermittent data inconsistency rather than systematic failures, making them difficult to reproduce and diagnose. A user who updates their profile but still sees the old version, a product price that was changed 30 minutes ago but still shows the old price, or a permission change that takes 10 minutes to take effect—all indicate invalidation failures. Instrument your cache invalidation code to log every invalidation operation (what key was invalidated, when, and what triggered the invalidation) so you can correlate data inconsistency reports with the invalidation log.

Race conditions in cache invalidation occur when a write and a cache population operation execute concurrently in an interleaved order that leaves stale data in the cache. The typical race condition: thread 1 reads stale data from database, thread 2 writes new data to database and invalidates cache, thread 1 caches the stale data it read in step 1. After this sequence, the cache contains stale data that will not be invalidated again until the next write. Prevent this race condition with cache-aside patterns that validate data freshness before caching, or with distributed locks that prevent concurrent cache population for the same key.

TTL-based expiration is the safest invalidation mechanism but has inherent staleness during the TTL period. When you cannot guarantee timely event-driven invalidation (high write volume, complex cache key relationships, multiple cache nodes with different invalidation latencies), TTL-based expiration provides a bounded staleness guarantee: data is at most TTL-seconds old. Match TTL values to your consistency requirements: authentication tokens might require 30-second TTLs; product catalog data might accept 5-minute TTLs; static reference data might accept 24-hour TTLs.

Cache tag-based invalidation allows you to invalidate related cache entries together by associating them with a logical tag. When a product is updated, invalidating all cache entries tagged with that product ID (product detail, category listing, search results that include that product) ensures all stale entries are cleared with a single operation. Without tag-based invalidation, you must explicitly track every cache key that might contain a product's data—a fragile and error-prone approach. Redis modules like RedisSearch and application-level tag tracking in Redis hashes or sets provide tag-based invalidation capabilities.

Optimize Cache Strategy and Configuration

Cache design decisions determine the effectiveness and reliability of your caching layer.

Cache key design is one of the most critical and underappreciated aspects of caching. A cache key must uniquely identify the specific data being cached while being as reusable as possible. Keys that are too specific (including request timestamps, trace IDs, or other unique request identifiers) will never cache hit. Keys that are too general (not including user ID for user-specific data) will serve wrong data to different users. Design cache keys systematically: start with the logical entity (user, product, order), add the specific operation (list, detail, count), then add the parameters that affect the result (filters, pagination, user ID for personalized data).

Implement cache warming to pre-populate the cache with frequently accessed data before it is requested, avoiding cold cache performance degradation after deployments or cache flushes. Identify your highest-traffic, slowest-to-generate cache entries and add them to a warming process that runs after deployment or cache clear. For a product catalog, a warming script that loads the top 1,000 products after deployment ensures the first 1,000 product page views after deployment hit the cache rather than all simultaneously querying the database during a potentially high-traffic deployment period.

Cache serialization format affects both performance and correctness. Storing complex objects requires serialization to a storable format (JSON, MessagePack, serialized binary) and deserialization on retrieval. JSON serialization is human-readable and widely supported but slow for large objects. MessagePack or Protocol Buffers are faster and more compact for high-throughput caches. Whatever format you choose, ensure that type information is preserved correctly—floating point numbers, dates, and null values serialize differently in different formats and can cause subtle bugs when values round-trip through the cache.

Multi-tier caching architectures match cache types to data access patterns. In-process memory caching (application-level Map or LRU cache) provides sub-millisecond access but is not shared between processes and is lost on restart—appropriate for lookup tables, configuration values, and computed constants. Distributed caches (Redis, Memcached) provide 1 to 3ms access shared across all processes—appropriate for user session data, API responses, and computed aggregations. CDN edge caches provide 1 to 50ms access globally—appropriate for public static assets, publicly accessible HTML, and public API responses.

Handle Cache Failures Gracefully

Caches should improve performance when available but not cause failures when unavailable.

Implement cache-aside (lazy loading) pattern as the default caching pattern for most use cases. The pattern: try to read from cache; on miss, read from the underlying data source, store the result in cache with appropriate TTL, return the result. This pattern degrades gracefully when the cache is unavailable—requests fall through to the underlying data source and return correct data, just more slowly. Compare this to a write-through pattern that stores to both cache and database on every write—if the cache write fails, the write fails, which is unacceptable when the cache is non-critical infrastructure.

Circuit breakers for cache dependencies prevent cascading failures when your caching infrastructure degrades. When Redis is unresponsive—due to network issues, server overload, or failover—every request that tries to use the cache will timeout, potentially making application response times worse than if no cache existed. A circuit breaker that detects Redis failures (via timeout monitoring) and temporarily bypasses the cache (falling through directly to the database) allows the application to continue functioning at reduced performance rather than completely failing. Reset the circuit breaker periodically to detect when the cache has recovered.

Cache stampede prevention techniques protect your underlying data source when a popular cache entry expires. Probabilistic early expiration (or jitter on TTL) adds a small random probability of treating an entry as expired slightly before its TTL ends, allowing one request to proactively refresh the entry before the actual expiration—preventing the simultaneous expiration storm. Request coalescing (only one request fetches while others wait) is another approach: when a cache miss is detected, a distributed lock allows only one request to repopulate the cache while all other concurrent requests for the same key wait for the lock holder to complete.

Redis Sentinel and Redis Cluster provide high availability for cache infrastructure that your application depends on. Redis Sentinel monitors master/replica nodes and automatically promotes a replica to master when the master becomes unavailable, typically within 10 to 30 seconds. Applications connected via Sentinel-aware clients (most Redis client libraries support this) automatically reconnect to the new master. Cluster mode distributes keys across multiple nodes, reducing the impact of any single node failure. For critical caching use cases, high availability is essential—a cache that causes application failures when it goes down is worse than no cache.

Monitor and Debug Cache in Production

Production cache debugging requires different tools and techniques than development debugging.

Redis MONITOR command (or keydb-cli MONITOR) displays all commands received by the Redis server in real time, allowing you to observe exactly which keys are being accessed, when, and by which operations. This is invaluable for debugging cache key correctness, understanding access patterns, and identifying unexpected cache operations. However, MONITOR has significant performance overhead (it logs every command) and should only be used briefly in production for debugging purposes, not left running continuously. Use it with a filter to capture only operations on specific key prefixes relevant to your investigation.

Redis keyspace notifications send events to subscribed clients when specific cache operations occur—key expiration, key deletion, key modification. Subscribe to expiration events (keyevent @0__:expired) to observe which cache keys are expiring and when. This allows you to validate that your TTL configuration matches expected behavior and to detect unexpectedly frequent expiration of hot keys. Keyspace notifications are also useful for building reactive cache invalidation systems that propagate invalidation events to secondary caches or application processes.

Key access pattern analysis using Redis SCAN with pattern matching allows you to audit cache contents without the full-scan performance impact of KEYS command. Analyze cache key distributions to identify patterns: keys with similar prefixes that can be grouped for bulk operations, keys that are much larger than expected indicating serialization bugs, or keys that should have been evicted but are still present due to eviction policy misconfiguration. Monitoring key count by prefix family over time reveals whether specific cache categories are growing without bound.

Integrated cache observability in APM tools captures cache hit/miss events, response times, and key access patterns as part of the overall request trace. When a slow API request trace shows 3 cache misses followed by database queries that would have been cache hits, the trace immediately reveals whether the cache miss is due to an empty cache (cold start), an expired entry (TTL too short), or a cache key mismatch (wrong key format). APM-integrated cache tracing eliminates the need to manually correlate cache logs with application logs to understand cache behavior in the context of specific request executions.

Key Takeaways

Cache hit rate, eviction rate, and response time are the three primary metrics—declining hit rate without explanation indicates a change in data access patterns or cache key design issues
Thundering herd protection through probabilistic early expiration or request coalescing prevents simultaneous cache expiry from creating a database overload spike
Cache-aside pattern degrades gracefully when the cache is unavailable—requests fall through to the underlying source, preserving correctness at the cost of performance
Cache key design must uniquely identify specific data while being reusable—keys too specific never hit; keys too general serve incorrect data to different users
Circuit breakers for Redis/Memcached dependencies prevent cache unavailability from causing application failures—bypass the cache temporarily on persistent errors and retry periodically
Tag-based cache invalidation eliminates the need to explicitly track every cache key containing related data—associate cache entries with logical tags and invalidate all entries by tag on data changes

Understanding Cache Behavior and Failure Modes

Track Cache Performance Metrics

Debug Cache Invalidation Problems

Optimize Cache Strategy and Configuration

Handle Cache Failures Gracefully

Monitor and Debug Cache in Production

Key Takeaways

Monitor your applications
with Atatus

Related guides

Fix Slow API Endpoints

Improve API Performance: Latency Reduction Guide

Reduce CDN Latency

Save up to 4x on Costs

Enterprise Security & Compliance

Full Control & Customization

Troubleshoot Cache Issues

Understanding Cache Behavior and Failure Modes

Track Cache Performance Metrics

Debug Cache Invalidation Problems

Optimize Cache Strategy and Configuration

Handle Cache Failures Gracefully

Monitor and Debug Cache in Production

Key Takeaways

Monitor your applications with Atatus

Related guides

Fix Slow API Endpoints

Improve API Performance: Latency Reduction Guide

Reduce CDN Latency

Monitor your applications
with Atatus