The Problem: Data Demand Outpaced Human Capacity

At Spotify, with over 70,000 datasets and petabytes of data, asking a simple question like 'What was the DAU for podcast ads last week?' could take hours—find the right dashboard, ping the right data scientist on Slack, wait for a reply. The demand for data insights had quietly exploded beyond what any individual expert could handle.

Throwing all schemas into an LLM doesn’t work at this scale. Context windows are limited, even at a million tokens. And schemas don't capture semantics: a column typed INT64 doesn't tell you that values under 100 are legacy test data, or what 'active user' really means. The model confidently picks the wrong table.

Spotify needed a middle layer—a curated context owned by domain experts that captures what actually matters about a slice of the warehouse.

Spotify data assistant cluster model diagram showing datasets, pairs, and docs

The Solution: A Cluster Model with Human Curation

Spotify’s data assistant, internally called Vedder, has been active since August 2025. Over 2,100 Spotifiers have used it in 13,000+ conversations, sending 60,000+ messages across 177 clusters covering advertising, podcasts, music, audiobooks, finances, and more. More than a quarter of those users had never written SQL before.

When a question comes in, Vedder picks the right context, writes the SQL, runs it, and returns the answer along with the query and sources—using a ReAct loop (reasoning and acting in steps). You can see how the answer was produced, not just what it was.

The Cluster Model

Each cluster represents a data domain (e.g., advertising, podcast analytics) and is owned by a named team of domain experts. It has three components:

  • Datasets: warehouse tables with full schema and profiling—column cardinality, common values, partition structure. When the model generates a WHERE clause, it helps to know country has values like 'US', 'GB', 'SE' rather than guessing.
  • Pairs: vetted question-and-SQL examples. This is the few-shot mechanism. Domain experts write or approve each pair, teaching the LLM how to query the data and its semantics.
  • Docs: additional business context—terminology, gotchas, definitions that vary by team, which columns to use and which to avoid.

Human Judgement > Query History

A tempting shortcut: use the warehouse’s complete query history to auto-generate question-SQL pairs. Just take a query, ask an LLM to infer the question, and use those pairs to teach the model. It looks scalable.

But when Spotify tried this during curation, domain experts accepted only 12.5% of the proposed pairs. The other 87.5% were ad-hoc exploration, debugging sessions, one-off answers, wrong tables, or technically correct but misleading patterns. Query history is rich—but mostly noise.

Every example runs through an expert. The model reasons over context; it doesn’t decide what’s true about the data. That’s how you build trust.

Developer interacting with AI data assistant on Slack and web UI for querying datasets IT Technology Image

Keeping Clusters Healthy: Automated Health Scoring

Data changes. Schemas evolve, columns get renamed, tables get deprecated. A context that was accurate last month can be wrong today. Vedder needs current information without constant manual attention.

Each cluster has a health score made up of continuously monitored signals:

  • How healthy is the underlying data?
  • How many curated pairs are still valid after schema changes? (If a column gets renamed, pairs referencing it degrade immediately.)
  • How well does the context cover questions people are actually asking?
  • How reproducible is the generated SQL?

Data experts see the score and underlying signals on their cluster dashboard, and use them to decide where to spend curation time. Every conversation with Vedder becomes a data point that feeds back into the system—questions, answers, generated SQL, and user feedback are shown to cluster owners.

Limitations & Caveats

  • Cold start problem: A new cluster requires significant manual effort from domain experts to seed pairs and docs before it becomes useful.
  • Scalability of curation: As the number of clusters grows, the burden on expert curators increases. Automating some curation (e.g., flagging stale pairs) helps, but human-in-the-loop remains essential.
  • Schema-only blind spots: Even with profiling, some business logic lives outside the database—in documentation, runbooks, or team practices. The current model doesn’t ingest those automatically.

Next Steps & Learning Path

  • Explore external knowledge ingestion: How to bring documentation and process definitions into the context layer.
  • Improve automated pair generation: Use human-approved pairs as seed data to train a smaller model that can suggest candidate pairs for domain experts to review.
  • Extend beyond Spotify: The core principle—domain experts curate context—is architecture-agnostic. Any organization with a data warehouse and subject matter experts can adopt this model.

Cloud infrastructure with data warehouse and health score dashboard for clusters Algorithm Concept Visual

Conclusion: Context Curation Is the New Bottleneck

Spotify’s Vedder shows that the real bottleneck in scaling data insights isn’t the model—it’s the context. The people who best understand a data domain are the best ones to curate what the model sees. This doesn’t replace data scientists; it gives them more leverage. They spend less time answering one-off questions and more time shaping the knowledge layer that answers thousands.

For teams building similar systems, start small: pick one domain, have an expert write 10–20 question-SQL pairs, and measure how much that improves query accuracy. Then iterate.

Source: Spotify Engineering Blog - Encoding Your Domain Expert: The Context Layer Behind Spotify's Data Assistant

Related reads:

This content was drafted using AI tools based on reliable sources, and has been reviewed by our editorial team before publication. It is not intended to replace professional advice.