AdvancedAdvanced

How We Monitor 1,500 Customer Applications with Atatus

An in-depth enterprise case study on operating full-stack observability across 1,500 customer-facing applications. Covers architecture, alert strategy, dashboard organization, cost optimization, and results.

25 min read
Atatus Team
Updated March 15, 2026
11 sections
01

The Challenge: Scaling Observability Across a Growing SaaS Platform

What it looks like when monitoring requirements outpace your tooling

When a SaaS platform grows from 200 to 1,500 enterprise customers in three years, the monitoring infrastructure that served well at the start becomes a critical bottleneck. Each customer in a multi-tenant SaaS model runs their own isolated application stack — dedicated compute, isolated databases, separate API gateways. At 1,500 customers, this translates to monitoring not one application but 1,500 distinct, independently operating application environments, each with its own health state, performance characteristics, and failure modes.

The operational complexity at this scale is qualitatively different from monitoring a single large application. Questions emerge that have no parallel in single-application monitoring: which of 1,500 customer environments is experiencing degraded performance right now? Is the issue isolated to one customer or does it indicate a platform-wide problem? How do you give each customer team visibility into their own environment without exposing other customers data? How do you detect when a customer workload is growing to the point where allocated resources will soon be insufficient?

Alert management at 1,500-environment scale is its own engineering challenge. If each customer environment generates even one spurious alert per week, the total alert volume reaches 214 per day before any real incidents occur. Without intelligent alert grouping, deduplication, and routing, the oncall team is buried in noise and real incidents get missed because engineers develop alert fatigue and start ignoring pages.

Dashboard organization for 50+ engineering teams supporting 1,500 customer environments requires a systematic, hierarchical approach. Individual dashboards for each customer environment are impractical to navigate. Platform-wide health views that aggregate across all customers hide individual customer issues. A multi-tier dashboard hierarchy is needed: platform-wide health at the top, customer tier health in the middle, and individual customer environment detail at the bottom — with efficient navigation between tiers.

Cost at scale is a forcing function for architecture decisions. Monitoring 1,500 customer environments on a per-host billing model where each environment has 5 to 10 hosts means 7,500 to 15,000 monitored hosts. At Datadog pricing of $46 per host including APM, this translates to $345,000 to $690,000 per month in monitoring costs — an amount that is simply not sustainable for most SaaS businesses and that consumes a significant fraction of gross margin.

The platform team managing this monitoring transformation defined clear requirements: monitoring coverage must scale to customer count without linear cost growth, alert quality must be maintainable by a team of fewer than 10 platform engineers, each customer team must have self-service access to their environment observability data, and the total monitoring cost must be below $50,000 per month for 1,500 customer environments with full-stack monitoring.

02

Why Previous Tools Failed at This Scale

Lessons from Datadog and New Relic at 1,500-application scale

The platform started on Datadog at 200 customers, where the cost was manageable at approximately $85,000 per month for full-stack coverage. As the customer base grew to 500, then 1,000, the cost grew in near-perfect proportion. At 1,000 customers, the monitoring bill had reached $420,000 per month — a number that triggered board-level concern about unit economics.

Datadog billing model created perverse operational incentives at scale. To control costs, the platform team implemented aggressive log sampling, reduced log retention from 30 days to 7 days, and disabled APM for standard-tier customer environments. This cost optimization reduced the bill by approximately $95,000 per month but created observability blind spots. Three separate P1 incidents in a four-month period were extended by 2 to 4 hours because the log data needed for diagnosis had been sampled or expired.

New Relic was evaluated as an alternative during the cost crisis. The user-based pricing model was initially appealing because it decoupled cost from infrastructure scale. But the per-full-platform-user pricing became prohibitive when the requirement was to give each of 1,500 customers technical contacts access to their environment observability data. Even at a discounted enterprise rate of $250 per user, 1,500 customer users would cost $375,000 per month — equivalent to the Datadog bill being replaced.

Self-hosted observability was evaluated seriously. A Prometheus, Grafana, Loki, and Tempo stack for 1,500 environments was prototyped. The prototype worked technically, but the operational overhead was immediately apparent: managing 1,500 separate Prometheus instances, handling federation and aggregation across all environments, and maintaining the entire stack required an estimated 3 to 4 additional platform engineers. At $200,000 or more per engineer per year, the self-hosting savings were largely consumed by operational staffing costs.

The evaluation criteria that emerged from the failed alternatives were specific: the pricing model must not grow linearly with the number of monitored entities, the platform must support multi-tenant data isolation natively, the alerting system must support intelligent grouping at scale to prevent noise at 1,500x alert volume, and the total cost must be below $50,000 per month for 1,500 customer environments with full-stack monitoring.

Atatus met these criteria through a combination of its host-inclusive pricing model and its enterprise multi-tenant architecture. By right-sizing host allocations to match actual customer workload requirements and negotiating a volume commitment discount, the billable host count was optimized to approximately 8,000 hosts at a blended rate of $12 per host: $96,000 per month for complete coverage across all 1,500 customer environments.

03

Evaluation Criteria and Selection Process

How to score observability platforms when standard feature comparisons are insufficient

The evaluation process required a more rigorous methodology than typical APM tool evaluations. Standard feature checklists and demo environments are designed for small to medium deployments — they do not stress-test the scenarios that matter at enterprise scale. The team designed a Proof of Concept that specifically validated the platform under conditions representative of production scale.

The POC required each evaluated platform to demonstrate: simultaneous monitoring of 100 test environments representing 6.7% of the target scale, alert routing that correctly delivered environment-specific alerts to the correct team without cross-contamination, dashboard navigation that allowed an operator to move from a platform-wide health view to a specific customer environment detail view in under 3 clicks, and log search performance for queries spanning 100 environments returning results in under 5 seconds.

Multi-tenant data isolation was evaluated with a red-team test. In each candidate platform, a test user with permissions limited to one customer environment was used to attempt to access data from another customer environment — through direct URL manipulation, API calls, and dashboard widget queries. Any platform that failed this isolation test was immediately disqualified. Data isolation is a trust and compliance requirement that cannot be compromised.

Agent deployment automation was evaluated by measuring the time required to instrument a new customer environment from zero to full monitoring visibility. The target was under 15 minutes for a new environment to show APM traces, infrastructure metrics, and log data in the monitoring platform without manual intervention from the platform team. This requirement reflects the production constraint that new customer environments are provisioned automatically.

Alert quality was evaluated by replaying 30 days of historical incident data through each platform alerting system and measuring signal-to-noise ratio: what percentage of alert firings corresponded to real customer-impacting incidents? Atatus scored highest on this evaluation, with 78% of alerts during the replay period corresponding to genuine customer-impacting conditions.

API completeness was evaluated by confirming that every configuration action needed — creating alert policies, provisioning customer workspaces, configuring agent deployments, exporting metrics data — was available via API. At 1,500 customer environments, manual configuration is not an option. Platforms with incomplete or inconsistent APIs were evaluated as significantly higher operational risk.

04

The Migration Journey: From Legacy Tools to Atatus

How the transition was managed for 1,500 live customer environments

The migration to Atatus was executed as a 16-week program divided into four phases: foundation (weeks 1 to 3), pilot migration (weeks 4 to 7), tier-based rollout (weeks 8 to 13), and decommission and optimization (weeks 14 to 16). Each phase had explicit entry criteria, exit criteria, and rollback triggers. No phase began until the previous phase exit criteria were met and documented.

The foundation phase established the Atatus account structure, workspace organization, team access controls, and API automation before any customer environment was migrated. This upfront investment in infrastructure — configuring Terraform-managed Atatus resources, building the provisioning scripts that would create new customer workspaces automatically, setting up the alert routing configuration — prevented the chaos that typically occurs when infrastructure is built reactively during migration.

The pilot migration phase selected 50 customer environments across multiple customer tiers and geographic regions. These 50 environments were migrated to Atatus while maintaining Datadog in parallel. Each environment ran dual instrumentation for two weeks, and automated comparison scripts validated that metric values were within 3% between platforms, log volumes matched, and APM error rates were equivalent. The pilot phase surfaced 12 configuration issues — all resolved before the broader rollout began.

The tier-based rollout migrated customer environments in cohorts organized by tier: enterprise customers (100 environments) first, because these customers have dedicated support relationships and could provide prompt feedback on any issues; standard customers (600 environments) second over a four-week period at approximately 150 environments per week; and starter customers (800 environments) last.

Each weekly cohort migration followed a Monday-to-Friday sequence: Monday — deploy Atatus agents alongside existing Datadog agents; Tuesday and Wednesday — automated validation comparing both platforms; Thursday — alert routing switched to Atatus for the cohort; Friday — validate oncall experience for the new routing; following Monday — Datadog agents decommissioned if validation passed.

The decommission phase began after the final cohort was validated and involved removing Datadog agents from all environments, closing the Datadog account, and completing the final cost comparison. The team preserved the Datadog configuration exports in an archive repository as a historical reference. The Datadog contract was terminated with 30 days notice, realizing the first month of full cost savings in week 18 of the program.

05

Architecture: Agents, Collectors, and Data Pipeline

The technical infrastructure that makes monitoring 1,500 environments manageable

Each customer environment runs an Atatus APM agent embedded in the application Docker image, deployed as part of the environment standard Terraform module. The agent configuration is templated with environment-specific values — API key, environment tag, customer ID tag — injected at provisioning time. New environments are born with monitoring enabled; there is no separate step to add monitoring later which would inevitably be skipped for some environments.

Infrastructure metrics collection uses the Atatus Infrastructure agent deployed as a DaemonSet on each customer Kubernetes cluster. The DaemonSet approach ensures every node in the cluster is monitored without requiring per-pod agent configuration. Node metrics including CPU, memory, disk I/O, and network are all collected through the single DaemonSet deployment.

Log collection uses Fluent Bit deployed as a DaemonSet alongside the Atatus Infrastructure agent. Fluent Bit collects container logs from the Docker socket, applies customer-specific metadata tags including customer ID, environment tier, and region, then forwards to the Atatus log ingest endpoint with TLS encryption.

The customer workspace isolation model is the architectural cornerstone of the multi-tenant monitoring setup. Each customer environment is assigned to a dedicated Atatus workspace provisioned automatically by the environment creation automation. Workspace-level data isolation ensures that Atatus RBAC controls prevent any user from querying data across workspace boundaries.

A central platform workspace aggregates metrics from all customer workspaces using Atatus cross-workspace query capability. This aggregation layer enables the platform-wide health views that the internal operations team uses to monitor the overall SaaS platform health — detecting when multiple customer environments in the same region show correlated degradation, indicating a shared infrastructure problem rather than individual customer issues.

The data pipeline from agent to Atatus dashboard operates with end-to-end latency of under 30 seconds for metrics and traces, and under 60 seconds for logs. For the subset of enterprise customers with sub-minute SLA requirements, synthetic monitoring checks running every 30 seconds provide faster degradation detection than passive metric collection.

06

Monitoring Strategy: Tiered Coverage by Criticality

How different service tiers receive different monitoring depth and alerting sensitivity

Not all 1,500 customer environments are equal in business criticality, and the monitoring strategy reflects this reality. Enterprise customers (100 environments) represent approximately 60% of ARR despite being 7% of the customer count. These environments receive the highest monitoring depth: full APM tracing with 100% trace sampling, 90-day log retention, synthetic checks every 30 seconds, custom metric tracking for SLA-relevant business events, and dedicated Slack channels for their environment alerts.

Standard customers (600 environments) represent approximately 35% of ARR. These environments are monitored with full APM tracing at 10% trace sampling (capturing all errors and slow transactions regardless of sampling rate), 30-day log retention, synthetic checks every 5 minutes, and alert routing to shared tier-specific notification channels. The reduced trace sampling rate reduces data volume without meaningfully impacting incident diagnosis capability.

Starter customers (800 environments) represent approximately 5% of ARR and receive baseline monitoring: infrastructure metrics collection, application error rate monitoring, and synthetic availability checks every 10 minutes. APM tracing is configured at 1% sampling for normal traffic with 100% error sampling. Log retention is 7 days.

Dynamic tier promotion is automated for environments that consistently exceed their tier resource allocation or exhibit elevated alert frequency. An environment that triggers more than 10 alerts per week is automatically reviewed for tier promotion — high alert frequency often indicates that the environment has grown to a level of operational complexity that warrants higher monitoring depth.

Development and staging environments for internal teams are monitored at a minimal footprint — infrastructure metrics and application error tracking only, with no synthetic monitoring and 3-day log retention. Separate alert policies for development environments with much higher thresholds keep development noise from contaminating production alert channels.

The tiered monitoring strategy is codified in the Terraform modules that provision each environment tier. There is no manual configuration decision for monitoring depth — the tier assigned at customer onboarding determines the Terraform module used, which automatically configures the Atatus workspace, alert policies, and retention settings for that tier.

07

Alert Strategy and Noise Reduction at Scale

Managing alert quality when 1,500 environments each generate their own signals

The alert noise problem at 1,500-environment scale is severe if not addressed architecturally. Without deliberate noise reduction, even a 1% false positive rate generates 15 spurious alerts per day — enough to desensitize an oncall engineer within the first week of a rotation. The alert strategy must be built around three principles: high signal-to-noise ratio as the primary metric, environment-appropriate sensitivity, and intelligent grouping to correlate related alerts.

Atatus alert grouping is the most impactful noise reduction mechanism at scale. Rather than sending one PagerDuty notification per alert condition per environment, the alert grouping configuration correlates alerts that share common attributes — same region, same customer tier, same metric type, within a 10-minute time window — into a single grouped incident. A regional infrastructure issue causing 50 customer environments to trigger availability alerts simultaneously generates one grouped incident rather than 50 individual pages.

Baseline-aware alerting replaces fixed-threshold alerting for metrics that exhibit strong time-of-day and day-of-week patterns. Customer application request rates follow business-hours patterns — peak during business hours in the customer timezone, quiet overnight and on weekends. Atatus anomaly detection mode compares current metric values to the historical baseline for the same time window, alerting only when the deviation is statistically significant.

The platform team maintains a weekly alert quality review where they calculate the signal-to-noise ratio for the previous week alerts: what percentage of alert firings resulted in a human taking a meaningful action? Alert conditions where the signal-to-noise ratio falls below 50% are reviewed and recalibrated. This process is the continuous improvement mechanism that keeps alert quality high as the environment evolves.

Customer-facing alert visibility is managed through a customer status portal that is updated automatically when Atatus alerts fire for a customer environment. The integration between Atatus webhooks and the status portal API ensures that customers see their environment health status in real time without requiring the internal operations team to manually update the portal during incidents.

Maintenance windows are integrated into the alert pipeline to suppress expected alerts during scheduled maintenance. The provisioning automation creates Atatus maintenance windows as part of the maintenance scheduling workflow — any alert that fires during a customer environment scheduled maintenance window is acknowledged automatically without waking an oncall engineer. This automation alone saves an estimated 3 to 5 unnecessary pages per week across the customer fleet.

08

Dashboard Organization for 50+ Engineering Teams

A hierarchical dashboard architecture that scales to any number of environments

The dashboard architecture uses a three-tier hierarchy: the Platform Health dashboard at the top, Customer Tier dashboards in the middle, and Customer Environment dashboards at the bottom. Navigation between tiers uses URL-based linking with environment ID parameters so that operations team members can drill from a platform-wide anomaly to a specific customer environment detail in two clicks.

The Platform Health dashboard is designed for the VP of Engineering and the platform operations team. It shows total environments healthy versus degraded versus down, p99 response time and error rate aggregated across all environments with separate lines per customer tier, alert count by severity for the trailing 24 hours, and a geographic heat map showing which AWS regions have the highest current error rates. This dashboard refreshes every 60 seconds and is displayed on a wall monitor in the operations center.

Customer Tier dashboards provide a mid-level view for the team responsible for each tier. The Enterprise Tier dashboard shows all 100 enterprise environments as a table with columns for availability, p95 response time, error rate, and last alert time. Rows where any column is in a degraded state are highlighted in amber or red. Clicking a row navigates to that environment detail dashboard.

Customer Environment dashboards follow a standardized template generated automatically when a new customer workspace is provisioned. The template includes: application response time and throughput time series, error rate and error breakdown by type, database query performance showing the top 10 slow queries, infrastructure resource utilization, active alerts summary, and recent deployment events.

Dashboard version control is implemented using the Atatus API and a Git repository. Dashboard configurations are exported from Atatus nightly via API call, committed to the repository, and any drift between the committed configuration and the live Atatus configuration is flagged in a weekly audit report. This version control approach prevents the dashboard sprawl phenomenon where dashboards are created ad hoc during incidents and then abandoned.

Team workspace dashboards are maintained by each engineering team for their own internal use and organizational metrics. The 50+ engineering teams that contribute to the SaaS platform have their own workspace within Atatus where they can create custom dashboards tracking metrics relevant to their components without affecting the standardized operational dashboards.

09

Cost Optimization Techniques at Scale

How to reduce monitoring costs while maintaining coverage across 1,500 environments

Host right-sizing is the highest-impact cost optimization technique. Each customer environment host allocation is determined at provisioning time based on the customer contracted workload tier. As customers actual usage data accumulates in Atatus, the platform team runs monthly right-sizing analysis to identify environments where the allocated host count significantly exceeds actual utilization. Downsizing overprovisioned environments reduces both compute costs and monitoring costs simultaneously.

Trace sampling optimization by tier reduces the volume of APM data stored without meaningfully impacting incident diagnosis capability. The team analysis showed that 95% of production incidents could be diagnosed from error traces and traces with response time in the top 1% — both captured at 100% regardless of the overall sampling rate. Reducing the overall sampling rate from 100% to 10% for standard tier environments reduced APM data volume by approximately 85% for that tier.

Log sampling for health-check and heartbeat endpoints is a simple optimization that yields significant volume reduction. Application health check endpoints typically generate high-frequency log entries that carry no diagnostic value. Configuring Fluent Bit to drop log lines from known health check paths eliminates 15 to 25% of total log volume for most customer environments.

Metric aggregation for customer tier views reduces the custom metric count required for platform-wide dashboards. Rather than querying individual metrics from all 1,500 workspaces and aggregating in the dashboard query, the team implemented a metric aggregation pipeline that pre-computes tier-level and platform-level aggregates every minute. These pre-computed aggregates dramatically reduce the query complexity and cost of platform-wide dashboards.

Retention tier management assigns different log retention periods to different log types based on their diagnostic value over time. Application error logs and APM traces have high diagnostic value for incident retrospectives — these are retained for the full tier-defined retention period. Access logs, heartbeat logs, and routine batch operation logs have low diagnostic value after 24 to 48 hours — these are retained for 3 days regardless of the customer tier retention setting. This hybrid retention approach reduces total log storage volume by approximately 40%.

Annual contract optimization with the Atatus enterprise team resulted in a volume commitment discount for the total host count across all customer environments. By committing to a minimum host count for the full annual contract term with provisions for growth above the committed floor, the blended per-host rate was reduced by approximately 20% compared to pay-as-you-go pricing.

10

Results: Measured Outcomes After Full Migration

Quantified improvements in cost, MTTR, coverage, and engineer satisfaction

Cost reduction: Total monitoring spend dropped from $420,000 per month on Datadog at 1,000 customer environments to $96,000 per month on Atatus at 1,500 customer environments — a 73% cost reduction while simultaneously increasing the monitored environment count by 50%. On a per-environment basis, the cost dropped from $420 per environment per month to $64 per environment per month. The annual savings were approximately $3.9 million, redirected to product engineering hiring.

MTTR improvement: Mean time to resolution for customer-impacting incidents dropped from 47 minutes on the legacy tooling to 16 minutes on Atatus — a 66% reduction. The improvement was attributed to three factors: faster alert delivery with Atatus alert-to-PagerDuty latency averaging 45 seconds versus 3 minutes for the previous configuration, better trace context in alert notifications, and faster log search with proper customer ID tagging.

Monitoring coverage improvement: Under the cost-constrained Datadog configuration, 340 environments (23% of the total at that time) were operating with reduced monitoring depth — APM disabled, log retention reduced to 7 days, or synthetic monitoring absent. After the Atatus migration, 100% of customer environments receive full monitoring depth appropriate to their tier. This coverage improvement meant that incidents in previously under-monitored environments were detected proactively rather than through customer support tickets.

Alert quality improvement: The alert signal-to-noise ratio improved from 38% on the legacy configuration to 81% on Atatus. This improvement was driven by the implementation of baseline-aware alerting, alert grouping for correlated conditions, and a systematic reduction of the alert inventory from 4,200 discrete alert conditions to 890 well-designed conditions covering the same coverage scope more intelligently.

Engineer satisfaction improvement: An internal survey of 50 platform engineers conducted 90 days after migration completion showed that 82% rated Atatus as easier to use for incident investigation compared to the previous tooling, and 91% reported feeling more confident during oncall shifts. Oncall engineer burnout scores improved by 31% over the 12 months following migration.

New customer time-to-monitoring improvement: The automated provisioning pipeline reduced the time from new customer environment creation to full monitoring visibility in Atatus from 48 minutes with manual Datadog agent configuration to 8 minutes with fully automated Atatus workspace provisioning, agent deployment, and dashboard generation.

11

Lessons Learned and Recommendations

Hard-won insights for teams planning large-scale observability deployments

Invest more heavily in alert quality upfront. The team spent the first 60 days post-migration tuning alert conditions to reduce noise that was introduced by directly migrating Datadog monitor configurations to Atatus without reviewing them for quality. The migration was an opportunity to redesign the alert strategy from scratch, and the team wishes they had taken that opportunity rather than treating alert migration as a translation exercise.

Build the automation before starting the migration, not during it. The provisioning scripts, validation automation, and alert routing configuration that were built during the migration program would have been more effective if designed and tested before the first customer environment was migrated. Several migration delays were caused by automation issues discovered mid-migration.

Establish data retention policies before you have data. Retention configuration decisions made under time pressure during migration frequently prove to be wrong decisions that require costly remediation later. Spend time analyzing actual log access patterns — which logs are actually queried after more than 30 days, after 60 days — before deciding on retention tiers.

Involve customer success and support teams in the migration design. The monitoring tool serves not only internal engineering teams but also customer-facing teams who use monitoring data to respond to customer support tickets. The initial migration design did not adequately consult these teams, leading to dashboard gaps discovered through support team feedback after migration.

Treat monitoring configuration as production infrastructure. Every alert policy, dashboard, and retention rule should be version controlled, reviewed via pull request, and deployed through automation — never created or modified directly through the monitoring platform UI in production. Teams that manage monitoring through UI clicks accumulate configuration debt that makes incidents more dangerous and migrations more expensive.

Plan for multi-tier retention from the beginning. Not all observability data has the same value over time — high-frequency metrics and access logs are primarily useful in the immediate post-incident window, while error logs and APM traces retain diagnostic value for post-mortems weeks later. Designing retention tiers that reflect actual data access patterns from the start avoids the costly data archaeology exercises that occur when retention is set too aggressively.

Key Takeaways

  • Monitoring 1,500 customer environments requires a qualitatively different architecture than monitoring a single large application — multi-tenant data isolation, hierarchical dashboards, and intelligent alert grouping are essential architectural requirements, not nice-to-haves.
  • Linear cost growth with host count on usage-based pricing platforms is unsustainable at enterprise scale; at 1,500 customer environments, the difference between platforms is measured in millions of dollars per year.
  • The migration from Datadog achieved a 73% cost reduction from $420,000 per month to $96,000 per month while increasing monitored environment count by 50% and improving MTTR by 66%.
  • Alert signal-to-noise ratio is the primary metric for alert system quality at scale; achieving above 80% requires baseline-aware alerting, intelligent grouping, and ongoing quality review.
  • Monitoring configuration belongs in version control and infrastructure-as-code — teams that manage monitoring through UI clicks accumulate configuration debt that makes incidents more dangerous and migrations more expensive.
  • Customer-facing observability — customer portal dashboards, SLA reports, incident notifications — drives measurable business outcomes including reduced support ticket volume, higher NPS, and lower churn.
Get started today

Monitor your applications with Atatus

Put the concepts from this guide into practice. Set up full-stack observability in minutes with no credit card required.

No credit card required14-day free trialSetup in minutes

Related guides