Why Routing Matters in ML Serving
When you're serving hundreds of model types and versions at over 1 million requests per second, the way you route traffic to the right model instance becomes a critical architectural decision. Netflix's ML serving platform powers personalized experiences across title recommendations, commerce, and fraud detection. The core challenge? How to route traffic to the correct model, on the right cluster shard, for the right user and use case — while keeping the API abstraction simple for both client services and model researchers.
This post (the first in a series from the Netflix Tech Blog) unpacks the evolution of their routing layer: from a centralized proxy called Switchboard to a decoupled, metadata-driven approach named Lightbulb. It's a masterclass in balancing abstraction with operational reality at hyperscale.
Why this matters for your architecture: If you're building any sort of model serving infrastructure — whether for recommendations, fraud detection, or generative AI — the routing patterns described here directly impact your latency, reliability, and iteration speed. The tradeoffs Netflix encountered are universal.

The Original Design: Switchboard as a Central Routing Proxy
Netflix's first approach was Switchboard, a custom-built service acting as the single entry point for all model inference requests. It sat directly in the request path, performing context-aware routing and request enrichment.
Key Capabilities
- Common Client Abstraction: Clients only needed to integrate once. After that, model iterations, A/B experiments, and shard migrations were opaque to them.
- Context-Aware Routing: Switchboard could route based on user device, locale, ranking surface (homepage vs. search), or A/B test allocation.
- Dynamic Traffic Splitting: Real-time canary deployments and gradual rollouts.
- Model Versioning & Lifecycle: Shadow mode testing, instant rollbacks.
The Glue: Switchboard Rules
Researchers defined routing logic via a JavaScript configuration:
/**
* Configuration rule written by a Model Researcher to add an A/B experiment.
* Cell 1: Uses the default, currently productized model
* Cell 2 and Cell 3: Use different experimental (candidate) models
*/
function defineAB12345Rule() {
const abTestId = 12345;
const objectives = Objectives.ContinueWatchingRanking;
const abTestCellToModel = {
1: {name: "netflix-continue-watching-model-default"},
2: {name: "netflix-continue-watching-model-cell-2"},
3: {name: "netflix-continue-watching-model-cell-3"}
};
return {
cellToModel: abTestCellToModel,
abTestId: abTestId,
targetObjectives: [objectives],
modelInputType: constants.TITLE_INPUT_TYPE,
modelType: 'SCORER'
};
}
This configuration was consumed by both Switchboard and the serving clusters via a pub-sub system (Gutenberg), enabling independent release cycles.
The Pain Points at Scale
- Single point of failure: A bug or traffic surge in Switchboard could take down all ML-powered experiences.
- Added latency: 10–20ms per request due to serialization/deserialization, plus tail latency amplification.
- Reduced client flexibility: Obscured visibility into request origins, making tenant isolation and test traffic separation difficult.

The Lightbulb Evolution: Decoupling Routing Metadata from the Request Path
Rather than abandoning Switchboard's design, Netflix refactored where and how its responsibilities were executed. The new architecture, Lightbulb, separates routing metadata resolution from actual request forwarding.
Architecture Changes
- Lightbulb (new service): Consumes minimal request context (use-case info) and returns a
routingKeyandObjectiveConfig(model ID, execution parameters). It is not in the request path — it's called once to resolve metadata. - Envoy Proxy: Handles actual request routing based on the
routingKeyheader. Envoy is already used for all egress communication at Netflix. - Client-side enrichment: The client attaches the
ObjectiveConfigto the request body (not headers) to avoid bloat.
How the Data Plane Flow Changed
// Before (Switchboard):
// Client -> Switchboard (deserialize, route, re-serialize) -> Serving Cluster
// After (Lightbulb + Envoy):
// 1. Client calls Lightbulb (out of band) to get routingKey + ObjectiveConfig
// 2. Client sends request directly to Envoy, with routingKey in header
// 3. Envoy routes to correct cluster VIP based on routingKey
// 4. Serving host reads ObjectiveConfig from request body
Benefits Realized
| Aspect | Switchboard | Lightbulb + Envoy |
|---|---|---|
| Request path involvement | In-path (mandatory) | Out-of-band (metadata only) |
| Latency overhead | 10–20ms per request | Minimal (header-based routing) |
| Failure domain | Single service can take down all ML | Routing failure isolated to metadata service; fallback via client caching |
| Tenant isolation | Poor (shared routing cluster) | Envoy per-tenant routing policies |
| Payload handling | Full deserialization required | Only headers inspected for routing |
Limitations and Considerations
- Increased client complexity: Clients now need to call Lightbulb first, then Envoy. This adds a small coordination step, though it's cached.
- Configuration drift risk: Two configs (model serving + routing) must stay in sync. Netflix's Gutenberg system mitigates this with versioned pub-sub.
- Not for all scales: If you're serving <10K requests/second, a centralized proxy like Switchboard is simpler and perfectly adequate.
Next Steps for Learning
- Explore Envoy's routing capabilities (header-based, weight-based, fault injection) for your own stack.
- Study Netflix's Gutenberg configuration management system for dynamic rule propagation.
- Consider client-side caching of routing metadata for latency-sensitive use cases (Netflix used this as a fallback).
For another perspective on building resilient, decoupled infrastructure, check out our analysis of Why Airbnb Built Its Own Embedded Workflow Engine. And for a completely different take on rendering performance, see Styling CSS Highlight Pseudo-elements.

Key Takeaways for Your Architecture
Netflix's journey from Switchboard to Lightbulb teaches us several universal lessons:
- Abstract clients from model internals — but don't let that abstraction become a bottleneck.
- Separate routing decisions from request forwarding — metadata services are easier to scale and fail independently.
- Leverage existing infrastructure — using Envoy (already present for inter-service communication) avoided introducing a new critical dependency.
- Plan for operational complexity — the decoupled architecture requires more coordination (two configs, client-side caching), but the reliability and latency wins at scale are worth it.
If you're building or evolving a model serving platform, start with a simple proxy, but design for the eventual need to decouple. The patterns here — context-aware routing, dynamic traffic splitting, and metadata-driven forwarding — are directly applicable whether you're using Kubernetes, serverless, or bare metal.
This analysis is based on the Netflix Tech Blog post "State of Routing in Model Serving" (2025).