Troubleshoot WebSocket Disconnections

Understanding WebSocket Connection Lifecycle

WebSocket connections have distinct phases and failure modes that require targeted responses.

WebSocket connections begin with an HTTP Upgrade handshake—the client sends an HTTP GET request with an Upgrade: websocket header, and the server responds with 101 Switching Protocols if it accepts the upgrade. After this handshake, the connection transitions from HTTP to the WebSocket protocol and remains open as a persistent, bidirectional TCP connection. This upgrade mechanism means that any HTTP-level issue—authentication failures, proxy misconfigurations, load balancer limitations—will prevent the WebSocket connection from being established at all.

WebSocket connections are persistent TCP connections that can remain open for hours or days if the application and infrastructure allow it. Unlike HTTP/1.1 requests that complete in milliseconds and release resources, WebSocket connections hold resources—file descriptors, memory, connection state—on both client and server for their entire duration. This persistence creates benefits (zero connection establishment overhead for messages) and challenges (connections must be explicitly managed, cleaned up on disconnect, and protected against resource exhaustion from too many concurrent connections).

Close codes communicate the reason for connection termination and are essential for understanding disconnection patterns. Close code 1000 indicates normal closure; 1001 indicates the server or client is going away; 1006 is never sent by the WebSocket protocol itself but indicates an abnormal closure where the TCP connection was terminated without a proper WebSocket close frame; 1011 indicates an unexpected server error. When your client receives close code 1006, it means the TCP connection was dropped without warning—indicating a network issue, proxy timeout, or server crash rather than an intentional disconnection.

WebSocket connections traverse multiple network hops—browser, client-side proxy, firewall, load balancer, reverse proxy (nginx/HAProxy), and finally the WebSocket server. Each intermediary can terminate the connection, impose timeout limits, or modify headers that affect WebSocket behavior. Load balancers that are not configured for sticky sessions may route WebSocket messages to different servers, breaking the connection state. Proxies that do not understand the WebSocket protocol may close connections they do not recognize as valid HTTP traffic.

Track WebSocket Connection Health

Proactive monitoring of WebSocket connection state enables rapid problem detection.

Monitor active WebSocket connection count as a fundamental health metric. Normal variation in connection count follows your user traffic patterns—connections increase during business hours and decrease overnight. Sudden drops in active connection count that are not correlated with user traffic changes indicate a systemic disconnection event (server crash, infrastructure issue, or network disruption). Alert on active connection count dropping by more than 20% within a 5-minute window relative to the same time window on the previous day.

Track disconnection frequency and close codes separately to distinguish intentional closures from unexpected drops. Expected disconnections (users navigating away, mobile apps being backgrounded, users going offline) generate 1000 and 1001 close codes and are normal. Unexpected disconnections generating 1006 close codes in large numbers indicate infrastructure or network problems. Calculating the ratio of 1006 close codes to total disconnections over time reveals whether unexpected disconnections are increasing, which is an early warning sign of infrastructure degradation.

Measure connection duration distributions to identify patterns in when connections drop. If connections reliably disconnect after 30 to 60 minutes, a load balancer idle timeout is likely the cause. If connections drop after exactly 60 seconds during periods of message inactivity, a proxy with a 60-second idle timeout is disconnecting quiet connections. Plotting connection duration as a histogram reveals these characteristic patterns that point to specific infrastructure components as the root cause.

Monitor reconnection success rates and time-to-reconnect alongside disconnection rates. A high disconnection rate is less severe if reconnections succeed quickly; a low disconnection rate with high reconnection failure rate is more concerning because clients are unable to recover. Track the sequence: disconnection occurred at T0, reconnection attempt at T1, successful reconnection at T2. Alert when time-to-reconnect (T2-T0) exceeds acceptable thresholds for your real-time use case, as extended reconnection windows translate directly to gaps in real-time data delivery.

Diagnose Disconnection Root Causes

Systematic diagnosis identifies the specific infrastructure component causing disconnections.

Load balancer timeout configuration is the most common cause of WebSocket disconnections in production. Most load balancers have timeout settings designed for HTTP request-response cycles, where idle connections should be closed after 60 to 120 seconds. WebSocket connections that are idle (no messages in either direction) will be terminated by these timeouts. AWS ALB has a default idle connection timeout of 60 seconds; nginx has a default proxy_read_timeout of 60 seconds; HAProxy has various timeout settings for different connection states. Check load balancer and proxy timeout configurations as the first step in diagnosing WebSocket disconnection issues.

Proxy and firewall state tables have limited capacity and remove inactive connections to free resources. Stateful firewalls, NAT gateways, and enterprise proxy servers maintain connection state tables that map external connections to internal connections. These state tables have configurable timeout periods—connections that appear inactive (no packets) for the timeout duration are removed from the state table. When the client then sends a message, the firewall or NAT has no state for the connection and drops the packet, causing the client to receive no response and eventually the connection to timeout.

Server-side resource exhaustion causes disconnections under high concurrency. WebSocket servers have limits on the number of concurrent connections based on available file descriptors (each connection uses one file descriptor), available memory (connection state is stored in memory), and thread or event loop capacity. On Linux, the default per-process file descriptor limit is 1,024, which must be increased to support thousands of concurrent WebSocket connections. Monitor file descriptor usage alongside connection count to detect when approaching the file descriptor limit.

Network partitions and packet loss cause WebSocket connections to become zombies—the server believes the connection is alive and the client believes the connection is alive, but messages are not flowing. TCP's keep-alive mechanism eventually detects and closes these zombie connections, but the default TCP keep-alive timeout (2 hours on Linux) is too long for real-time applications. WebSocket heartbeat messages at the application level provide faster detection of zombie connections and allow the application to reconnect before users notice a disruption.

Configure WebSocket Infrastructure Correctly

Infrastructure configuration determines baseline WebSocket stability before application-level optimization.

Configure load balancer timeout settings to accommodate long-lived WebSocket connections. Increase the load balancer's idle connection timeout to 3,600 seconds (1 hour) or longer for applications with sustained WebSocket sessions. In AWS ALB, set the idle timeout in the target group settings. In nginx, increase proxy_read_timeout and proxy_send_timeout to match your expected connection duration. In HAProxy, configure timeout tunnel for WebSocket connections, which applies different timeout rules to upgraded connections than to regular HTTP connections.

Enable WebSocket support explicitly in reverse proxies and load balancers. nginx requires specific configuration to proxy WebSocket connections: the Upgrade and Connection headers must be passed through using proxy_set_header directives, and HTTP/1.1 must be used for the proxy connection. Missing this configuration causes nginx to silently fail the WebSocket upgrade. AWS ALB supports WebSocket connections natively without additional configuration. HAProxy requires setting the http-server-close or http-request upgrade options to allow WebSocket upgrades.

Configure sticky sessions (session affinity) in your load balancer for WebSocket connections. When a WebSocket client reconnects after a disconnection, it must reconnect to the same server instance that holds its session state, unless you have implemented a distributed session store. Configure load balancer sticky sessions using cookies or source IP affinity to ensure reconnections route to the same backend. In Kubernetes, use sessionAffinity: ClientIP in the Service resource or implement cookie-based affinity using an ingress annotation.

Increase operating system file descriptor limits on WebSocket server hosts to support high connection counts. The default per-process limit on Linux (ulimit -n, typically 1,024 or 4,096) restricts the number of concurrent WebSocket connections. Increase it to 65,536 or higher for production WebSocket servers by adding the appropriate limits.conf entries or systemd service LimitNOFILE setting. Also increase the system-wide file descriptor limit (fs.file-max sysctl) and the TCP socket backlog (net.core.somaxconn) to prevent connection queuing under high connection rates.

Implement Application-Level Heartbeats

Heartbeat messages maintain connections and provide fast detection of network disruptions.

WebSocket ping/pong frames are the protocol-level mechanism for testing connection liveness. The WebSocket protocol defines ping and pong control frames specifically for this purpose: the server sends a ping frame, and the client must respond with a pong frame. If no pong is received within a configurable timeout, the server closes the connection. Implement server-side pings every 15 to 30 seconds and close connections that do not respond within 10 seconds. This prevents zombie connections from accumulating and provides 15 to 30 second detection windows for network disruptions.

Application-level heartbeats in your message protocol provide additional liveness checking beyond the protocol-level ping/pong. Send heartbeat messages (JSON objects with type: 'ping') every 30 seconds and expect heartbeat responses within 10 seconds. This allows your application to distinguish between a connection that is alive at the network level but not processing messages (a server-side processing deadlock or overloaded server) versus a connection that is simply idle. Application heartbeats also pass through proxies that may silently drop WebSocket control frames.

Heartbeat intervals should be configured based on the keepalive timeouts of intermediary network components. If your firewall terminates connections idle for 300 seconds, configure heartbeats at 250-second intervals to ensure packets flow before the firewall closes the connection. This creates a buffer between your heartbeat interval and the firewall timeout. The optimal heartbeat interval is slightly less than the most restrictive timeout in your network path from client to server, which requires profiling your network infrastructure to determine.

Client-side connection health monitoring using heartbeat timeouts allows clients to detect disconnections proactively. When a client has not received any message (including heartbeat responses) from the server for a configurable duration, the client should consider the connection dead and initiate reconnection, rather than waiting for the TCP stack to detect the disconnection (which can take minutes). Implement a watchdog timer that resets on every received message and fires an event to trigger reconnection when no message is received within twice the expected heartbeat interval.

Build Robust Reconnection Logic

Resilient reconnection prevents temporary disruptions from becoming extended outages for users.

Exponential backoff with jitter is the standard algorithm for WebSocket reconnection attempts. Starting with a short initial delay (500ms) and doubling on each failed attempt (1s, 2s, 4s, 8s, 16s, 32s) up to a maximum delay (60s to 300s) prevents thundering herd behavior where thousands of clients simultaneously attempt to reconnect after a server restart or network disruption. Adding random jitter (randomizing each delay within a range) further distributes reconnection attempts across time, preventing synchronized waves of reconnection traffic.

State reconciliation on reconnection is necessary for stateful real-time applications. When a client reconnects after a disconnection, it may have missed messages that were published during the disconnection window. Implement a sequence number or timestamp-based message acknowledgment system that allows clients to request missed messages after reconnecting. The reconnection message should include the last sequence number or timestamp received before disconnection, allowing the server to replay missed events since that point.

Connection status UI indication allows users to understand when they are temporarily disconnected and when real-time data may be stale. Display a subtle connection status indicator that shows 'Connected', 'Reconnecting...', and 'Disconnected' states. When reconnecting, disable user actions that require real-time communication and show the current data with a staleness indicator. This manages user expectations during brief disruptions and prevents users from taking actions that will fail because the connection is not ready.

Implement a maximum reconnection attempt limit with a graceful fallback for cases where the server is genuinely unavailable. After 10 to 20 reconnection attempts spread over 5 to 15 minutes, stop automatic reconnection and display a message asking the user to check their connection or refresh the page. Continuing reconnection attempts indefinitely wastes battery on mobile devices and creates a poor user experience. Provide a manual 'Retry' button so users who want to try again can do so at any time.

Key Takeaways

Load balancer and proxy timeout settings (often 60 seconds for idle connections) are the most common cause of WebSocket disconnections—increase timeouts to match your connection duration requirements
Close code 1006 indicates abnormal TCP termination without a WebSocket close frame—network drops, proxy timeouts, and server crashes all generate 1006, which requires infrastructure investigation
Heartbeat pings every 15-30 seconds prevent firewall state table timeouts and detect zombie connections faster than TCP keep-alive's default 2-hour detection window
Sticky sessions (session affinity) in load balancers route reconnections to the same server instance holding session state—required unless you implement distributed session storage
Exponential backoff with jitter for reconnections prevents thousands of clients from simultaneously reconnecting after a server restart, which would create a secondary overload event
Message sequence numbers enable clients to request missed messages after reconnection, maintaining data consistency in stateful real-time applications

Understanding WebSocket Connection Lifecycle

Track WebSocket Connection Health

Diagnose Disconnection Root Causes

Configure WebSocket Infrastructure Correctly

Implement Application-Level Heartbeats

Build Robust Reconnection Logic

Key Takeaways

Monitor your applications
with Atatus

Related guides

Fix Timeout Errors

Improve API Performance: Latency Reduction Guide

Reduce 502 Bad Gateway Errors

Save up to 4x on Costs

Enterprise Security & Compliance

Full Control & Customization

Troubleshoot WebSocket Disconnections

Understanding WebSocket Connection Lifecycle

Track WebSocket Connection Health

Diagnose Disconnection Root Causes

Configure WebSocket Infrastructure Correctly

Implement Application-Level Heartbeats

Build Robust Reconnection Logic

Key Takeaways

Monitor your applications with Atatus

Related guides

Fix Timeout Errors

Improve API Performance: Latency Reduction Guide

Reduce 502 Bad Gateway Errors

Monitor your applications
with Atatus