Tool definitions are expensive at scale. Benchmark: traditional tool loading vs Agent-CoreX's semantic retrieval, using data from the Anthropic leak.
One of the clearer revelations from the Anthropic Claude Code leak is just how much lives in the tool layer. The base tool definition in the leaked harness spans 29,000 lines. At scale, tool definitions are a significant driver of per-request token cost — and they're mostly invisible until you start measuring.
This post runs a concrete benchmark: the same task, the same model, the same result — with all tools loaded versus with semantic retrieval.
Task: "Commit the current changes to the main branch on GitHub"
Tools available (a realistic production setup):
Total: 27 tools across 6 servers.
Each tool definition averages ~250 tokens (name, description, input schema with type annotations and descriptions). Some are simpler; some — like create_pr with its body, labels, reviewers, and branch fields — are considerably larger.
Traditional agent setup: load all enabled tools and inject them into every request.
// Fetch all tools from all enabled MCP servers
const { tools: allTools } = await fetch(`${ACX_API_BASE}/tools`, {
headers: { Authorization: `Bearer ${ACX_API_KEY}` }
}).then(r => r.json())
const message = await anthropic.messages.create({
model: "claude-opus-4-6",
tools: allTools, // 27 tools
messages: [{ role: "user", content: "Commit the current changes to main" }],
})
Token overhead from tool definitions: ~6,750 tokens
The model receives all 27 tool definitions — database tools, search tools, calendar tools — none of which are relevant to this git commit task. It has to scan all of them to find the 1–2 that actually apply.
Retrieve only the tools relevant to the current query before building the prompt:
// Retrieve only tools relevant to this specific query
const { tools } = await fetch(
`${ACX_API_BASE}/retrieve_tools?query=commit+current+changes+to+main+branch&top_k=3`,
{ headers: { Authorization: `Bearer ${ACX_API_KEY}` } }
).then(r => r.json())
// Retrieved: create_commit, push_branch, list_branches — 3 tools
const message = await anthropic.messages.create({
model: "claude-opus-4-6",
tools, // 3 tools
messages: [{ role: "user", content: "Commit the current changes to main" }],
})
Token overhead from tool definitions: ~750 tokens
The model receives exactly the tools it needs. No database schemas. No search API definitions. No calendar fields.
| Metric | All tools | Semantic retrieval | Delta |
|---|---|---|---|
| Tools injected | 27 | 3 | −89% |
| Tool definition tokens | ~6,750 | ~750 | −89% |
| Cost per request (tool overhead only, $15/1M) | $0.101 | $0.011 | −89% |
| Monthly cost at 100K requests | ~$10,125 | ~$1,125 | −$9,000 |
| Monthly cost at 1M requests | ~$101,250 | ~$11,250 | −$90,000 |
These numbers reflect tool definition overhead only — not the cost of the actual conversation, tool results, or model responses. The full token count includes the user message, conversation history, and tool outputs. But tool definition overhead is pure waste: it's context that doesn't change the outcome, it's just there because every tool was loaded.
The cost reduction is straightforward arithmetic. The accuracy argument is subtler but equally important.
When you present a model with 27 tools for a simple git commit task, you're asking it to correctly identify 1–2 relevant tools out of 27. For a capable model like claude-opus-4-6, it usually gets this right — but "usually" isn't "always," and the probability of a wrong tool selection increases with the number of irrelevant tools in context.
Semantic retrieval narrows the selection space before the model ever sees it. With 3 highly relevant tools instead of 27 mixed ones, the model is more likely to select correctly and less likely to produce a confused, multi-step response trying to reconcile unrelated capabilities.
The benchmark above uses 27 tools. Production setups grow beyond that. The pattern scales linearly:
| Total tools | Retrieval to 5 | Token savings |
|---|---|---|
| 50 tools | 10% loaded | ~90% |
| 100 tools | 5% loaded | ~95% |
| 200 tools | 2.5% loaded | ~97.5% |
As your tool count grows, the absolute savings increase — and so does the accuracy benefit from reduced context noise.
Enable MCP servers in Dashboard → MCP Servers, then call /retrieve_tools instead of /tools when building your prompts. The Playground shows relevance scores alongside retrieved tools so you can verify which tools are being selected and tune top_k before writing integration code.
For most single-domain queries, top_k=3 is sufficient. For multi-step or cross-domain workflows, top_k=5–8 gives the model enough breadth without reintroducing the noise problem.
512K lines of production AI agent code: a rare opportunity. Five lessons from the Claude Code leak that should shape how you build AI agents.
The Anthropic leak is a blueprint for how production AI agents handle tools. We built the same system with one key change: retrieve first, inject second.
The Anthropic leak exposed Claude Code's 29,000-line tool definition. Here's the engineering problem this reveals — and why it doesn't scale.
Connect 100+ MCP tools. Cut LLM costs by 60%. Setup in 2 minutes.
Get started free