How Netflix Routes 1M+ ML Inference Requests Per Second From Switchboard to Lightbulb

Why Routing Matters in ML Serving

When you're serving hundreds of model types and versions at over 1 million requests per second, the way you route traffic to the right model instance becomes a critical architectural decision. Netflix's ML serving platform powers personalized experiences across title recommendations, commerce, and fraud detection. The core challenge? How to route traffic to the correct model, on the right cluster shard, for the right user and use case — while keeping the API abstraction simple for both client services and model researchers.

This post (the first in a series from the Netflix Tech Blog) unpacks the evolution of their routing layer: from a centralized proxy called Switchboard to a decoupled, metadata-driven approach named Lightbulb. It's a masterclass in balancing abstraction with operational reality at hyperscale.

Why this matters for your architecture: If you're building any sort of model serving infrastructure — whether for recommendations, fraud detection, or generative AI — the routing patterns described here directly impact your latency, reliability, and iteration speed. The tradeoffs Netflix encountered are universal.

Source: Netflix Tech Blog

The Original Design: Switchboard as a Central Routing Proxy

Netflix's first approach was Switchboard, a custom-built service acting as the single entry point for all model inference requests. It sat directly in the request path, performing context-aware routing and request enrichment.

Key Capabilities

Common Client Abstraction: Clients only needed to integrate once. After that, model iterations, A/B experiments, and shard migrations were opaque to them.
Context-Aware Routing: Switchboard could route based on user device, locale, ranking surface (homepage vs. search), or A/B test allocation.
Dynamic Traffic Splitting: Real-time canary deployments and gradual rollouts.
Model Versioning & Lifecycle: Shadow mode testing, instant rollbacks.

The Glue: Switchboard Rules

Researchers defined routing logic via a JavaScript configuration:

/**
 * Configuration rule written by a Model Researcher to add an A/B experiment.
 * Cell 1: Uses the default, currently productized model
 * Cell 2 and Cell 3: Use different experimental (candidate) models
 */
function defineAB12345Rule() {
  const abTestId = 12345;
  const objectives = Objectives.ContinueWatchingRanking;
  const abTestCellToModel = {
    1: {name: "netflix-continue-watching-model-default"},
    2: {name: "netflix-continue-watching-model-cell-2"},
    3: {name: "netflix-continue-watching-model-cell-3"}
  };
  return {
    cellToModel: abTestCellToModel,
    abTestId: abTestId,
    targetObjectives: [objectives],
    modelInputType: constants.TITLE_INPUT_TYPE,
    modelType: 'SCORER'
  };
}

This configuration was consumed by both Switchboard and the serving clusters via a pub-sub system (Gutenberg), enabling independent release cycles.

The Pain Points at Scale

Single point of failure: A bug or traffic surge in Switchboard could take down all ML-powered experiences.
Added latency: 10–20ms per request due to serialization/deserialization, plus tail latency amplification.
Reduced client flexibility: Obscured visibility into request origins, making tenant isolation and test traffic separation difficult.

High-level network flow of context-aware routing from client to model serving cluster Algorithm Concept Visual

The Lightbulb Evolution: Decoupling Routing Metadata from the Request Path

Rather than abandoning Switchboard's design, Netflix refactored where and how its responsibilities were executed. The new architecture, Lightbulb, separates routing metadata resolution from actual request forwarding.

Architecture Changes

Lightbulb (new service): Consumes minimal request context (use-case info) and returns a routingKey and ObjectiveConfig (model ID, execution parameters). It is not in the request path — it's called once to resolve metadata.
Envoy Proxy: Handles actual request routing based on the routingKey header. Envoy is already used for all egress communication at Netflix.
Client-side enrichment: The client attaches the ObjectiveConfig to the request body (not headers) to avoid bloat.

How the Data Plane Flow Changed

// Before (Switchboard):
// Client -> Switchboard (deserialize, route, re-serialize) -> Serving Cluster

// After (Lightbulb + Envoy):
// 1. Client calls Lightbulb (out of band) to get routingKey + ObjectiveConfig
// 2. Client sends request directly to Envoy, with routingKey in header
// 3. Envoy routes to correct cluster VIP based on routingKey
// 4. Serving host reads ObjectiveConfig from request body

Benefits Realized

Aspect	Switchboard	Lightbulb + Envoy
Request path involvement	In-path (mandatory)	Out-of-band (metadata only)
Latency overhead	10–20ms per request	Minimal (header-based routing)
Failure domain	Single service can take down all ML	Routing failure isolated to metadata service; fallback via client caching
Tenant isolation	Poor (shared routing cluster)	Envoy per-tenant routing policies
Payload handling	Full deserialization required	Only headers inspected for routing

Limitations and Considerations

Increased client complexity: Clients now need to call Lightbulb first, then Envoy. This adds a small coordination step, though it's cached.
Configuration drift risk: Two configs (model serving + routing) must stay in sync. Netflix's Gutenberg system mitigates this with versioned pub-sub.
Not for all scales: If you're serving <10K requests/second, a centralized proxy like Switchboard is simpler and perfectly adequate.

Next Steps for Learning

Explore Envoy's routing capabilities (header-based, weight-based, fault injection) for your own stack.
Study Netflix's Gutenberg configuration management system for dynamic rule propagation.
Consider client-side caching of routing metadata for latency-sensitive use cases (Netflix used this as a fallback).

For another perspective on building resilient, decoupled infrastructure, check out our analysis of Why Airbnb Built Its Own Embedded Workflow Engine. And for a completely different take on rendering performance, see Styling CSS Highlight Pseudo-elements.

Cloud infrastructure diagram illustrating decoupled routing metadata service with Envoy proxy Technical Structure Concept

Key Takeaways for Your Architecture

Netflix's journey from Switchboard to Lightbulb teaches us several universal lessons:

Abstract clients from model internals — but don't let that abstraction become a bottleneck.
Separate routing decisions from request forwarding — metadata services are easier to scale and fail independently.
Leverage existing infrastructure — using Envoy (already present for inter-service communication) avoided introducing a new critical dependency.
Plan for operational complexity — the decoupled architecture requires more coordination (two configs, client-side caching), but the reliability and latency wins at scale are worth it.

If you're building or evolving a model serving platform, start with a simple proxy, but design for the eventual need to decouple. The patterns here — context-aware routing, dynamic traffic splitting, and metadata-driven forwarding — are directly applicable whether you're using Kubernetes, serverless, or bare metal.

This analysis is based on the Netflix Tech Blog post "State of Routing in Model Serving" (2025).

This content was drafted using AI tools based on reliable sources, and has been reviewed by our editorial team before publication. It is not intended to replace professional advice.

How Netflix Routes 1M+ ML Inference Requests Per Second From Switchboard to Lightbulb

Why Routing Matters in ML Serving

The Original Design: Switchboard as a Central Routing Proxy

Key Capabilities

The Glue: Switchboard Rules

The Pain Points at Scale

The Lightbulb Evolution: Decoupling Routing Metadata from the Request Path

Architecture Changes

How the Data Plane Flow Changed

Benefits Realized

Limitations and Considerations

Next Steps for Learning

Key Takeaways for Your Architecture

Share this post

Did you find this post helpful?
It helps the author a lot!

Subscribe

RSS / Atom Feed

Real-time Alerts

Comments 0

Why Routing Matters in ML Serving

The Original Design: Switchboard as a Central Routing Proxy

Key Capabilities

The Glue: Switchboard Rules

The Pain Points at Scale

The Lightbulb Evolution: Decoupling Routing Metadata from the Request Path

Architecture Changes

How the Data Plane Flow Changed

Benefits Realized

Limitations and Considerations

Next Steps for Learning

Key Takeaways for Your Architecture

Share this post

Did you find this post helpful?It helps the author a lot!

Subscribe

RSS / Atom Feed

Real-time Alerts

Comments 0

Did you find this post helpful?
It helps the author a lot!