Why Migrate a Metrics Pipeline at All?
When you're running a platform like Airbnb, metrics are the nervous system of your infrastructure. For years, they relied on StatsD with a Veneur sidecar—a setup that worked but showed its age as scale grew. The problems were classic: UDP packet loss under high throughput, limited histogram types, and a closed vendor lock-in.
The team decided to go all-in on open source: OpenTelemetry (OTLP) for instrumentation and Prometheus for storage. But the gap between where they were and where they wanted to be was enormous. This is the story of how they bridged it—without breaking production.
The Dual-Write Strategy: Don't Migrate, Parallelize
One of the most practical lessons from this migration is the dual-write approach. Instead of a risky cut-over, Airbnb enabled their shared metrics library to emit both StatsD (to the old pipeline) and OTLP (to the new OTel Collector) simultaneously. This allowed them to:
- Validate correctness by comparing both streams
- Expose scaling bottlenecks early (and they found big ones)
- Let users migrate dashboards and alerts at their own pace
Key takeaway: If you're planning a metrics migration, start by instrumenting dual emission. It's more work upfront, but it saves months of firefighting later.
The OTLP Performance Surprise
While most services handled dual-write smoothly, the highest-volume metric emitters (10K+ samples/sec per instance) hit a wall. Memory pressure spiked, GC activity increased, and heap grew. The culprit? Cumulative temporality in the OTLP SDK, which retains full state for every label combination between exports.
The fix was simple but surgical: switch those services to delta temporality. Delta mode only tracks changes between exports, dramatically reducing in-process memory. The trade-off? Gaps in data if the service crashes mid-interval. For most services, cumulative mode stayed on; only the top 1% of emitters needed the switch.
# Example: Configuring delta temporality for high-cardinality metrics in Python OTel SDK
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import AggregationTemporalitySelector
# Force delta for specific instruments
delta_selector = AggregationTemporalitySelector.delta_preferred()
provider = MeterProvider(
metric_readers=[...],
aggregation_temporality_selector=delta_selector
)
Streaming Aggregation at Airbnb Scale
Raw metrics are expensive. A single pod emits thousands of label combinations (pod, hostname, region, etc.). Storing all of them would multiply costs by 10x. The solution: streaming aggregation—strip instance-level labels before storage.
Airbnb evaluated several options (Veneur rewrite, recording rules, m3aggregator, OTel Collector, Vector) and landed on vmagent from VictoriaMetrics. Why vmagent won:
| Feature | vmagent | Alternatives (OTel Collector, Vector, etc.) |
|---|---|---|
| Streaming aggregation | ✅ Built-in | ❌ Not supported or experimental |
| Horizontal scaling via sharding | ✅ Simple config | ❌ Complex or missing |
| Codebase size | ~10K LOC (easy to modify) | 50K-100K+ LOC |
| Documentation | Excellent | Varies |
The architecture uses two layers of vmagents: routers (stateless, hash metrics by labels) and aggregators (stateful, maintain running totals). This design scaled to hundreds of aggregators ingesting over 100 million samples per second.
Practical advice: If you're building a metrics pipeline at scale, don't underestimate the value of a small, focused codebase. vmagent's ~10K lines made it easy for Airbnb to add custom features like native histogram support and Mimir-style multitenancy—and contribute back upstream.
The Sparse Counter Problem (and the Zero Injection Fix)
After migration, a nasty bug surfaced: PromQL queries using rate() were consistently undercounting compared to the old StatsD system.
The root cause is a known edge case in Prometheus semantics. When a counter is created and incremented before Prometheus scrapes it, the initial increment is lost. For high-rate counters this is negligible, but for sparse counters (e.g., requests per currency per user per region—maybe 1 increment per day), every increment matters.
Airbnb considered several workarounds (pre-initialize all counters, use logs, emit gauges, hacky PromQL), but all were impractical. Instead, they built a transparent fix: zero injection in the aggregation tier.
How Zero Injection Works
When the aggregator flushes a counter for the first time, it injects a synthetic zero instead of the actual value. The real increment is delayed by one flush interval. This ensures Prometheus's rate() sees a proper baseline and doesn't lose the first increment.
# Simplified pseudocode for the zero injection logic
class SparseCounterAggregator:
def __init__(self):
self.first_flush = True
self.running_total = 0
def flush(self):
if self.first_flush:
self.first_flush = False
return 0 # inject zero
else:
value = self.running_total
self.running_total = 0
return value
The trade-off: the first increment is delayed by one flush interval (e.g., 10 seconds). For sparse counters, that's invisible. The benefit: all counters are implicitly initialized to zero, matching Prometheus semantics perfectly.
Why this matters: This fix is a great example of solving a problem at the right layer. Instead of pushing complexity to every team writing dashboards and alerts, Airbnb fixed it once in the aggregation pipeline.
Limitations and Caveats
- Delta temporality trade-off: Services using delta mode lose data if they crash mid-interval. This is acceptable for high-cardinality services but not for critical counters.
- Zero injection delay: The first increment of any counter is delayed by one flush interval. For real-time alerting on brand-new counters, this could be an issue.
- vmagent customizations: Airbnb's changes to vmagent (native histograms, multitenancy) required Go expertise. If you're not comfortable modifying the codebase, you might face limitations.
- Complexity of dual-write: Running two pipelines simultaneously doubles operational overhead during migration. Plan for a multi-month transition.
What's Next? Learning Path
If you're considering a similar migration, here's your roadmap:
- Start small: Pick one service with low cardinality metrics. Dual-emit OTLP + StatsD for a week. Validate counts match.
- Scale up: Move to higher-volume services. Monitor memory and latency. Switch to delta temporality if needed.
- Build aggregation: Deploy vmagent (or another streaming aggregator) and test with a subset of metrics.
- Handle edge cases: Implement zero injection or similar fixes for sparse counters. Test with real dashboards.
- Cut over: Once you're confident, remove the old StatsD pipeline.
For more on building scalable systems, check out our article on Why Airbnb Built Its Own Embedded Workflow Engine. And if you're looking for a quick way to prototype automation, see how to Build a Slack Agent in Under 10 Minutes with Vercel's New Slack Agent Skill.
Based on the original engineering blog post by Eugene Ma and Natasha Aleksandrova at Airbnb. All product names, logos, and brands are property of their respective owners.
![]()