Understanding Docker Container Failure Modes
Container failures have distinct categories that map to specific debugging approaches.
Docker container failures fall into three broad categories: image issues (the container cannot start because the image is malformed, missing, or has the wrong architecture), startup failures (the container starts but the application exits immediately or fails health checks), and runtime failures (the container starts successfully but fails during operation due to resource exhaustion, external dependency failures, or application errors). Each category requires a different debugging approach and is visible in different parts of Docker's tooling.
Exit codes provide the first clue about why a container failed. Exit code 0 indicates a clean shutdown; exit code 1 indicates a generic application error; exit code 137 indicates the container was killed by SIGKILL (typically from the OOM killer or manual docker kill); exit code 143 indicates the container received and handled SIGTERM for a graceful shutdown; exit codes 125-127 indicate Docker itself could not start the container (configuration or runtime errors). Check exit codes as the first step in any container failure investigation.
Container restart policies determine what happens after a failure. The default 'no' policy never restarts; 'on-failure' restarts on non-zero exit codes with an optional maximum retry count; 'always' restarts regardless of exit code; and 'unless-stopped' restarts unless manually stopped. In Docker Swarm and Kubernetes, restart policies are managed by the orchestrator rather than Docker directly. Choosing the appropriate restart policy prevents both unnecessary downtime (no restart policy in production) and resource-wasting infinite restart loops (always restart policy without a maximum retry count).
Container logs are the primary debugging tool for application-level failures. Docker stores container logs in files on the host filesystem by default, with configurable log drivers (json-file, syslog, journald, fluentd, AWS CloudWatch) for production log management. The docker logs command retrieves logs from the configured log driver, with --since and --until flags for time-range filtering and --tail for limiting output to recent lines. Log output to stderr results in log lines with the 'error' stream label, while stdout output has the 'output' label.
Diagnose Container Startup Failures
Startup failures prevent containers from serving traffic and have distinctive diagnostic patterns.
Application startup failures are the most common container issue in development and staging environments. The container starts, the application process launches, encounters an error (missing environment variable, wrong database connection string, missing configuration file, port already in use), and exits. The error is typically visible in the container logs. Run docker logs {container_name} immediately after a failed startup to capture the error message before the container is cleaned up or restarts and overwrites the logs with new output.
Environment variable errors cause many startup failures in containerized applications. Applications that read configuration from environment variables fail at startup when required variables are missing or have incorrect values. Use docker inspect {container_name} to view the full environment variables set for a container, and verify that all required configuration is present with correct values. Common mistakes include typos in variable names, mixing up staging and production variable names, and forgetting to pass secrets from Docker secrets or Kubernetes secrets to the container environment.
Port binding conflicts prevent containers from starting when the host port they need is already in use. The error message 'address already in use' in container logs or 'port is already allocated' in Docker error output indicates this problem. Check which process is using the conflicting port with lsof -i :{port} on Linux/macOS or netstat -ano | findstr :{port} on Windows, and either stop the conflicting process or map the container to a different host port. In development, port conflicts are common when starting multiple instances of the same service.
Permission errors occur when containerized applications attempt to access files, directories, or Unix sockets that the container's process user does not have permission to use. In Kubernetes and Docker with security contexts, containers often run as non-root users to improve security, but application files may have been created as root or may have restrictive permissions from the base image. Check the application process user with docker exec {container} id and verify that application directories and files have appropriate ownership and permissions for the running user.
Track Container Health and Performance
Runtime monitoring detects resource issues and performance degradation before they cause failures.
Monitor container CPU and memory usage continuously using docker stats for single-host environments or Prometheus with cAdvisor for multi-host and Kubernetes environments. docker stats provides a real-time stream of CPU percentage, memory usage versus limit, memory percentage, network I/O, and block I/O for each running container. Set up persistent metric collection with a time-series database so you can examine historical trends, not just current state. Correlate resource usage spikes with application behavior changes, deployment events, and traffic patterns.
Memory limit violations cause containers to be terminated with exit code 137. When a container hits its memory limit, the Linux OOM killer selects a process to kill—usually the container's main process. Unlike CPU limits that throttle execution, memory limits kill containers, causing service interruption. Monitor memory usage trends to detect gradual growth toward the limit before OOMKills occur. Alert when container memory usage exceeds 80% of its limit to provide warning time for investigation and remediation.
CPU throttling under CPU limits causes latency increases without visible failures. A container with a CPU limit of 0.5 cores (500m millicores in Kubernetes notation) that attempts to use 1.5 cores will have 67% of its CPU work throttled—meaning it runs at one-third its normal speed. This manifests as slower API responses, higher request latency, and reduced throughput without any error messages or container restarts. Monitor CPU throttling percentage separately from CPU utilization to identify this category of performance problem.
Network I/O metrics reveal containers that are transmitting or receiving unexpectedly large amounts of data. A container performing database queries, receiving file uploads, or streaming video will show predictably high network I/O. Unexpected spikes in network I/O may indicate runaway logging, data exfiltration, or recursive data retrieval bugs. Monitor per-container network throughput and alert on values that exceed normal baselines by more than 3x, enabling rapid detection of anomalous network behavior.
Optimize Docker Images for Production
Well-built images improve deployment speed, security, and runtime reliability.
Multi-stage builds reduce production image size by using a full build environment to compile application code and then copying only the compiled artifacts into a minimal runtime image. A Node.js application built in a full node:18 image might be 1.2GB including all development dependencies and build tools. The same application in a production image using only node:18-alpine with production dependencies is typically 100 to 250MB. Smaller images pull faster during deployments, start faster, and have a smaller attack surface. Use multi-stage builds as the default for all languages with compilation steps (Go, Java, TypeScript, Rust).
Layer caching optimization reduces image build times and speeds up CI/CD pipelines. Docker builds each Dockerfile instruction as a separate layer and caches layers that have not changed since the last build. Instructions that change frequently (like copying application source code) should appear late in the Dockerfile, while instructions that change rarely (like installing system packages and dependencies) should appear early. Separating COPY package.json and npm install from COPY src/ ensures that the dependency installation layer is cached when only source code changes, saving 30 to 120 seconds per build.
Base image selection affects image size, security vulnerability surface, and startup time. Alpine Linux base images are minimal (under 10MB) but may have compatibility issues with some applications due to their musl libc implementation. Distroless images contain only the application runtime without a shell or package manager, providing a minimal attack surface at the cost of reduced debugging capability. Debian slim images balance size, compatibility, and debuggability. Choose your base image based on your security requirements, compatibility needs, and acceptable image size.
Regular vulnerability scanning of your container images prevents deploying containers with known security vulnerabilities. Scan images in your CI/CD pipeline using tools like Trivy, Clair, Snyk, or Amazon ECR scanning before pushing to your registry. Configure your pipeline to fail on critical or high-severity vulnerabilities. Pin specific base image versions (FROM node:18.19.0-alpine3.19) rather than using mutable tags (FROM node:18) to ensure reproducible builds and avoid inadvertently pulling a vulnerable updated base image.
Manage Container Networking and Inter-Service Communication
Networking issues between containers are a common source of runtime failures in multi-container applications.
Docker's internal DNS resolution allows containers in the same network to communicate using container names or service names as hostnames rather than IP addresses. When a container attempts to connect to another container using its IP address directly, the connection breaks whenever the target container is restarted and receives a new IP. Use service names (the name field in docker-compose.yml or the service name in Kubernetes) as hostnames in all inter-service connection strings, and rely on Docker's internal DNS to resolve them to the current container IP.
Docker Compose networking creates an isolated network for each compose project where all services can reach each other by service name. However, issues arise when services start up in the wrong order—a service that depends on a database starting before the database container is ready will fail on its first connection attempt. Use health checks combined with depends_on conditions to enforce startup ordering: define a healthcheck on the database container and configure dependent services to wait for the database to be healthy before starting.
Container-to-host and container-to-external network connectivity requires proper firewall configuration and routing. On Linux, Docker manages iptables rules to enable container networking, and custom iptables rules that run after Docker starts can interfere with container networking. On systems with fail2ban, ufw, or other firewall managers, Docker's iptables modifications may be overridden or blocked. If containers cannot reach external services or each other, examine iptables rules and verify that Docker's DOCKER and DOCKER-USER chains are present and not being bypassed.
DNS resolution failures within containers cause intermittent connection errors that are difficult to reproduce. Each container inherits DNS configuration from the host or from Docker's internal DNS resolver (127.0.0.11 in Docker's default networking). If your host's DNS resolver is slow or unreliable, container DNS lookups will also be slow. Configure explicit DNS servers in your Docker daemon configuration (daemon.json) using reliable resolvers like Google (8.8.8.8) or Cloudflare (1.1.1.1) for external lookups, and verify that your internal service discovery DNS is functioning correctly for internal service resolution.
Implement Container Security Best Practices
Security configuration prevents both external attacks and inadvertent container breakouts.
Running containers as non-root users is the most impactful security improvement for container deployments. Root containers have full access to the host filesystem via volume mounts and can potentially escape the container namespace in some configurations. Add a USER instruction to your Dockerfile to run the application as a dedicated non-root user, and ensure that application files have appropriate permissions for that user. Most framework-specific base images (node:alpine, python:slim) provide pre-created non-root users that you can reference in your USER instruction.
Read-only root filesystems prevent compromised containers from writing malicious code or modifying system binaries. Configure containers with readOnlyRootFilesystem: true in Kubernetes security contexts or --read-only in docker run, then explicitly mount writable volumes for directories that legitimately need write access (temporary files, log files, application state). This limits the blast radius of container compromises to the mounted writable directories rather than the entire container filesystem.
Secrets management should never involve building secrets into Docker images or passing them as plain environment variables in Compose files or Kubernetes manifests. Secrets embedded in images are retrievable by anyone with access to the image registry. Instead, use Docker secrets for Swarm deployments, Kubernetes secrets (ideally encrypted at rest and managed by an external secrets manager), or dynamic secrets from HashiCorp Vault or cloud provider secret management services. Mount secrets as files in the container filesystem rather than environment variables where possible, as environment variables are visible in process listings and may be logged.
Capability dropping reduces the privileges available to container processes. Docker containers by default include a set of Linux capabilities that are not needed by most applications: NET_BIND_SERVICE, CHOWN, FOWNER, and others. Drop all default capabilities and add back only those specifically required by your application using the cap_drop: ALL and cap_add: [SPECIFIC_CAPABILITY] pattern. Running with minimal capabilities limits the impact of container compromise and prevents privilege escalation attempts.
Debug Live Container Issues Efficiently
Effective debugging techniques reduce mean time to resolution for container incidents.
Executing into a running container for interactive debugging is one of the most powerful tools for container diagnosis. docker exec -it {container} /bin/sh (or /bin/bash for images based on Debian/Ubuntu) opens an interactive shell inside the running container, allowing you to inspect the filesystem, run diagnostic commands, check environment variables with printenv, test connectivity with curl or nc, and examine process state with ps and top. Be cautious with this approach in production—interactive access to production containers should be logged and restricted to authorized personnel.
Debug containers and ephemeral containers (in Kubernetes 1.23+) allow you to attach a debug container with full tooling to an existing pod without modifying the production container. kubectl debug -it {pod} --image=busybox --target={container} attaches a busybox container that shares the target container's process namespace, allowing you to inspect the target container's processes, network connections, and filesystem. This is invaluable for debugging distroless images that have no shell or debugging tools of their own.
Container inspection with docker inspect provides detailed configuration and state information that is not visible in docker ps output. Inspect reveals the complete container configuration including environment variables, volume mounts, network settings, health check status, and restart count. Use jq to extract specific fields from the JSON output: docker inspect {container} | jq '.[0].State'. The HealthState field shows the most recent health check results and is the first place to look when a container is unhealthy.
Network debugging from within containers requires appropriate tools to be available. tcpdump for packet capture, curl and wget for HTTP testing, nc (netcat) for TCP connectivity testing, dig and nslookup for DNS debugging, and ss or netstat for socket state inspection are the essential toolkit. Distroless images and minimal Alpine images may not include these tools. Maintain a debug sidecar image with these tools available that can be temporarily deployed alongside production containers during incident investigation, then removed after the issue is resolved.
Key Takeaways
- Exit codes provide the first diagnostic signal: 137 means OOM kill, 143 means SIGTERM handled, 1 means application error—check exit code before examining logs
- Multi-stage Docker builds reduce production image sizes by 80-90% compared to single-stage builds, improving deployment speed and security surface area
- Layer caching optimization in Dockerfiles—copying package manifests before source code—saves 30-120 seconds per build in CI/CD pipelines
- Never build secrets into Docker images or pass them as environment variables in committed configuration files—use Docker secrets, Kubernetes secrets, or external vault solutions
- Container DNS uses service names as hostnames rather than IP addresses to maintain connectivity across restarts and replacements
- Health checks with depends_on conditions in Docker Compose enforce correct service startup ordering and prevent connection failures from services starting before their dependencies are ready