The Efficiency Challenge at Hyperscale
When your infrastructure serves over 3 billion users, even a 0.1% performance regression translates to massive additional power consumption. For years, Meta’s Capacity Efficiency organization operated a two-pronged strategy:
- Offense: Proactively searching for code optimizations to make existing systems more efficient.
- Defense: Monitoring production resource usage to detect regressions, root-cause them to a specific pull request, and deploy mitigations.
Both approaches worked well, but they hit a common bottleneck: human engineering time. Engineers had to manually query profiling data, review documentation, investigate recent deployments, and interpret results. No matter how good the tooling, there simply weren’t enough hours in the day.
The breakthrough came when the team realized that offense and defense share the same underlying structure. Both require gathering context (profiling data, code changes, documentation) and applying domain expertise to decide what to do next. This insight led to a unified AI agent platform that treats both problems with the same foundational architecture.

Architecture: Tools + Skills = Domain Expert Agents
Meta built the platform on two layers:
1. MCP Tools (Standardized Interfaces)
Each tool does one thing—query profiling data, fetch experiment results, retrieve configuration history, search code, or extract documentation. These are the atomic building blocks that any agent can invoke.
2. Skills (Encoded Domain Expertise)
Skills capture reasoning patterns that senior efficiency engineers developed over years. For example:
- "Consult the top GraphQL endpoints for endpoint latency regressions."
- "Look for recent schema changes if the affected function handles serialization."
A skill tells the LLM which tools to use and how to interpret results. Together, tools and skills transform a general-purpose language model into a specialized efficiency engineer.
# Simplified example of how a skill orchestrates tools
class RegressionMitigationSkill:
def __init__(self, tools: list):
self.tools = {t.name: t for t in tools}
def run(self, regression_event: dict):
# Step 1: Gather context
profiling_data = self.tools['query_profiling'].fetch(regression_event['function'])
pr_history = self.tools['get_config_history'].fetch(regression_event['time_window'])
# Step 2: Apply domain heuristic
if 'logging' in regression_event['type']:
# Logging regressions can be mitigated by increasing sampling rate
mitigation = self.tools['generate_code_patch'].create(
file=pr_history['changed_files'][0],
change="increase log sampling from 0.1 to 0.01"
)
else:
# Fallback: rollback or optimize hot path
mitigation = self.tools['find_optimization_pattern'].apply(
function=regression_event['function'],
pattern='memoization'
)
return mitigation
The same tools power both offense and defense—only the skills differ. This reuse dramatically reduces integration overhead and accelerates the addition of new capabilities.

Defense: AI Regression Solver
Meta’s internal regression detection tool, FBDetect, catches regressions as small as 0.005% in noisy production environments. Traditionally, when a regression was found, engineers were notified and expected to manually create a fix or rollback.
Now, the AI Regression Solver automates the entire resolution:
- Gather context: Find the regressed functions, look up the root-cause PR, and identify the exact files and lines changed.
- Apply domain expertise: Use a mitigation skill tailored to the codebase, language, or regression type (e.g., logging regressions → increase sampling).
- Create resolution: Produce a new pull request and send it to the original author for review.
This compresses ~10 hours of manual investigation into ~30 minutes of AI processing, with the engineer only needing to review and approve the generated fix.
Offense: AI-Assisted Opportunity Resolution
On the offensive side, engineers identify "efficiency opportunities"—conceptual code changes that could improve performance. The AI agent then:
- Looks up opportunity metadata, documentation, and past examples.
- Applies a skill encoding expert knowledge (e.g., memoization patterns for CPU reduction).
- Generates a candidate fix with guardrails, verifies syntax and style, and surfaces the code in the engineer’s editor ready to apply with one click.
What used to require hours of investigation now takes minutes to review and deploy.
One Platform, Compounding Returns
Within a year of deploying the platform, the same foundation powered additional applications:
- Conversational assistants for efficiency questions
- Capacity planning agents
- Personalized opportunity recommendations
- Guided investigation workflows
- AI-assisted validation
Each new capability required few or no new data integrations—they simply composed existing tools with new skills.

Impact and Key Takeaways
Meta’s Capacity Efficiency Program has recovered hundreds of megawatts of power—enough to power hundreds of thousands of American homes for a year. But the deeper change is cultural:
- Engineers who spent mornings on defensive triage now review AI-generated analyses in minutes.
- The daunting question of "where do I even start?" has been replaced by reviewing and deploying high-impact fixes.
- The platform scales MW delivery without proportionally scaling headcount.
Limitations and Caveats
- Skill maintenance: Domain expertise encoded in skills must be continuously updated as codebases and best practices evolve.
- False positives: AI-generated fixes still require human review—the agent is a copilot, not an autopilot.
- Generalization: The approach works best in environments with rich telemetry and well-documented code; it may not transfer directly to smaller organizations.
Next Steps for Learning
- Explore the MCP (Model Context Protocol) specification used by Meta for tool interfaces.
- Read about building robust AI agents with retrieval-augmented generation for similar patterns.
- For a deeper dive into responsible use of AI coding tools, check out our guide: "Beyond the Hype: A Responsible Developer's Guide to AI Coding Tools".
- To see how Meta applied similar principles to deprecate its internal FFmpeg fork, see: "How Meta Deprecated Its Internal FFmpeg Fork: A Deep Dive into Open Source Collaboration at Scale".
Final Thought
The most powerful insight from Meta’s journey is that offense and defense share the same structure. By building a unified platform with reusable tools and composable skills, they created a self-sustaining efficiency engine where AI handles the long tail of performance work. For any organization operating at scale, this pattern is worth studying and adapting.
Source: Meta Engineering Blog