The Old Mental Model Is Broken
For years, the narrative was simple: training is expensive, evaluation is cheap. A frontier LLM costs $50–100 million to train, but running a few benchmarks? A rounding error. That mental model is now dangerously outdated.
In 2026, the cost of a single comprehensive evaluation can exceed the cost of training the model being tested. The Holistic Agent Leaderboard (HAL) recently spent approximately $40,000 to run 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. And PaperBench, a benchmark that requires replicating 20 ICML papers from scratch, costs $9,500 per evaluation.
This isn't an anomaly—it's a structural shift. Evaluation has become its own compute budget, with its own statistical methods, failure modes, and economic consequences. If you can't afford the eval, you can't write the leaderboard.
Why Agent Eval Costs Exploded
Static benchmarks like MMLU or HELM were relatively cheap to run because they required only a single forward pass per item. Agents change everything. Each benchmark task is now a multi-turn rollout involving tool calls, code execution, web navigation, and iterative reasoning. The cost per item is no longer a token—it's an entire session.
Consider the spread: on the Holistic Agent Leaderboard, the cost of a single benchmark run varies by four orders of magnitude across tasks, and by three orders within some individual benchmarks. A TAU-bench airline task can cost $0.31 or $180 depending on the agent configuration. That's not a bug—it's a feature of the agentic paradigm.
The pricing discrepancy between models compounds this. Claude Opus 4.1 charges $15 per million input tokens and $75 per million output. Gemini 2.0 Flash charges $0.10 and $0.40. A two-order-of-magnitude spread on input alone means that scaffold choice—the agent framework, prompt design, and tool-use pattern—becomes a first-order cost driver. The HAL paper found a 33× cost spread on identical tasks driven entirely by scaffold choice.
The Hidden Multiplier: Reliability
Most of the costs above buy only single-run measurements. But single-run accuracy is a noisy, unreliable metric. The field is slowly waking up to this fact.
Yao et al.'s τ-bench showed that performance can drop from 60% on a single run to 25% under an 8-run consistency check. Kapoor et al.'s "AI Agents That Matter" found that simple baseline agents Pareto-dominate complex SOTA agents on HumanEval at 50× lower cost. The HAL paper notes that a "do-nothing" agent passes 38% of τ-bench airline tasks under the original construction—and HAL's own log analysis revealed data leakage in the TAU-bench Few Shot scaffold, forcing its removal in December 2025.
To get statistically reliable results, you need multiple seeds per cell. A statistically credible HAL-style evaluation with k = 8 reruns per cell takes the $40K aggregate to roughly $320K. The same multiplier on PaperBench's $9,500-per-run cost pushes a single agent's evaluation past $75K. Reliability acts as a multiplier on every cost category.
The Training-in-the-Loop Benchmarks
Some benchmarks escape the API-cost framing entirely because their evaluation protocol trains models from scratch. The Well, a scientific ML benchmark, requires 3,840 H100-hours for a full four-baseline sweep—roughly $9,600 at current cloud rates. A single new architecture still costs about 960 H100-hours ($2,400).
PaperBench requires replicating 20 ICML 2024 Spotlight or Oral papers from scratch, graded against rubric trees with 8,316 leaf-node criteria. Each rollout uses an A10 GPU for 12 hours. The per-paper math is brutal:
- $400 in API per o1 IterativeAgent rollout, times 20 papers = $8,000 per evaluation
- $66 per paper for grading with o3-mini judge = $1,320 for the full benchmark
- Total: $9,500 per agent evaluation
OpenAI built PaperBench Code-Dev—a variant that drops execution—because many groups cannot afford the full benchmark. That variant halves rollout cost to about $4,000 and cuts grading to $10 per paper. The fact that a frontier lab needs to create a cheaper version of its own benchmark for the community to use it tells you everything about the current state of evaluation economics.
The Field Can't Keep Paying Retail
One reason these numbers stay high is that everyone pays for the same eval over and over. A frontier lab pays for a HAL sweep. An academic group pays again for a partial reproduction. An audit organization pays a third time. A journalist pays a fourth to spot-check the leaderboard. Almost none of the underlying instance-level outputs end up in a place where the next team can build on them.
Standardized documentation is the cheapest lever available. If a $9,500 PaperBench rollout exports its full grading trace in a shared schema, the next group studying the same papers can spend its budget on new perturbations instead of repeating the baseline. Even a 2× reuse rate on the high-cost benchmarks would put more money back in the ecosystem than every compression technique combined.
What This Means for the ML Community
The Compute Divide Now Includes Evaluation
Ahmed, Wahed and Thompson (Science 2023) documented that industry models in 2021 were 29× larger than academic ones by parameter count, and that about 70% of AI PhDs went to industry in 2020 versus 21% in 2004. The original "compute divide" story mostly ignored evaluation because evaluation used to look cheap next to training. That has reversed. A lab that can fine-tune a 7B model can no longer assume it can afford the benchmarks the field takes seriously.
Cost-Blind Leaderboards Reward Waste
When leaderboards report raw accuracy and omit cost, researchers can rationally pour tokens into a problem until the number ticks up. The HAL paper finds that higher reasoning effort actually reduces accuracy in the majority of runs—extra inference compute does not reliably improve even the metric it is supposed to optimize. Pareto frontiers fix the comparison by ranking accuracy against cost. HAL implements them, but most leaderboards still do not.
Governance and Accountability
Evaluation cost is now an accountability barrier. Academic groups, AI Safety Institutes, and journalists now hit the budget constraint before the technical one when they try to evaluate frontier agents independently. A single GAIA run can exceed an annual graduate student travel budget. If only frontier-lab compute budgets can produce statistically reliable benchmark numbers, the social process of evaluating AI systems becomes concentrated inside the same labs that build them.
Limitations and Caveats
- Compression techniques are partial. Flash-HELM, tinyBenchmarks, and Anchor Points work well for static benchmarks (100× to 200× reduction), but agent benchmarks are noisy, scaffold-sensitive, and only partly compressible. Mid-difficulty filtering achieves a 2× to 3.5× reduction—useful, but far short of the static-era gains.
- Training-in-the-loop benchmarks have no general compression method. Tabular precomputation and tight budget caps can reduce cost only by narrowing what the benchmark measures. The fundamental asymmetry—evaluation compute exceeding training compute by two orders of magnitude—is structural.
- The cost figures are lower bounds. Many evaluators are already priced out. The figures above assume optimal pricing and no retries. Real-world costs are often higher.
What to Read Next
If this analysis resonated with you, here are two related deep dives:
- Python 3.14.3 Released: A Deep Dive into Major New Features — New language features that can help you write more efficient evaluation pipelines.
- React Compiler v1.0 Is Here: A Deep Dive into Automatic Memoization — How automatic memoization changes the performance profile of web applications, a parallel to the optimization challenges in AI eval.
The Bottom Line
Evaluation now has its own compute budgets, statistical methods, and failure modes. Its price shapes who gets to evaluate powerful systems in the first place. The field still talks as if capability sets the main constraint, but evaluation points to reliability as the tighter one. Governance institutions should want to measure the gap between single-run accuracy and pass^k consistency, yet that gap costs the most to measure.
The economics have changed. Whoever can pay for the evaluation gets to write the leaderboard.
This analysis is based on the EvalEval Coalition Blog post by Ghosh, Mai, Channing, and Choshen (2026).
![]()