All posts
BenchmarkToken CostClaudeTool RoutingCost Optimization

Claude Tool Loading vs Agent-CoreX: Token Cost Benchmark

Tool definitions are expensive at scale. Benchmark: traditional tool loading vs Agent-CoreX's semantic retrieval, using data from the Anthropic leak.

April 1, 2026 5 min readby Agent-CoreX

One of the clearer revelations from the Anthropic Claude Code leak is just how much lives in the tool layer. The base tool definition in the leaked harness spans 29,000 lines. At scale, tool definitions are a significant driver of per-request token cost — and they're mostly invisible until you start measuring.

This post runs a concrete benchmark: the same task, the same model, the same result — with all tools loaded versus with semantic retrieval.

Benchmark Setup

Task: "Commit the current changes to the main branch on GitHub"

Tools available (a realistic production setup):

  • GitHub (create_commit, push_branch, create_pr, list_issues, search_code, create_branch, merge_pr) — 7 tools
  • PostgreSQL (query, insert, update, delete, list_tables, describe_table) — 6 tools
  • File system (read_file, write_file, list_directory, search_files, move_file) — 5 tools
  • Slack (send_message, list_channels, get_thread, create_reminder) — 4 tools
  • Brave Search (search, get_page) — 2 tools
  • Calendar (create_event, list_events, update_event) — 3 tools

Total: 27 tools across 6 servers.

Each tool definition averages ~250 tokens (name, description, input schema with type annotations and descriptions). Some are simpler; some — like create_pr with its body, labels, reviewers, and branch fields — are considerably larger.

Approach 1: All Tools, Every Request

Traditional agent setup: load all enabled tools and inject them into every request.

// Fetch all tools from all enabled MCP servers
const { tools: allTools } = await fetch(`${ACX_API_BASE}/tools`, {
  headers: { Authorization: `Bearer ${ACX_API_KEY}` }
}).then(r => r.json())

const message = await anthropic.messages.create({
  model: "claude-opus-4-6",
  tools: allTools, // 27 tools
  messages: [{ role: "user", content: "Commit the current changes to main" }],
})

Token overhead from tool definitions: ~6,750 tokens

The model receives all 27 tool definitions — database tools, search tools, calendar tools — none of which are relevant to this git commit task. It has to scan all of them to find the 1–2 that actually apply.

Approach 2: Semantic Retrieval with Agent-CoreX

Retrieve only the tools relevant to the current query before building the prompt:

// Retrieve only tools relevant to this specific query
const { tools } = await fetch(
  `${ACX_API_BASE}/retrieve_tools?query=commit+current+changes+to+main+branch&top_k=3`,
  { headers: { Authorization: `Bearer ${ACX_API_KEY}` } }
).then(r => r.json())

// Retrieved: create_commit, push_branch, list_branches — 3 tools

const message = await anthropic.messages.create({
  model: "claude-opus-4-6",
  tools, // 3 tools
  messages: [{ role: "user", content: "Commit the current changes to main" }],
})

Token overhead from tool definitions: ~750 tokens

The model receives exactly the tools it needs. No database schemas. No search API definitions. No calendar fields.

The Numbers

MetricAll toolsSemantic retrievalDelta
Tools injected273−89%
Tool definition tokens~6,750~750−89%
Cost per request (tool overhead only, $15/1M)$0.101$0.011−89%
Monthly cost at 100K requests~$10,125~$1,125−$9,000
Monthly cost at 1M requests~$101,250~$11,250−$90,000

These numbers reflect tool definition overhead only — not the cost of the actual conversation, tool results, or model responses. The full token count includes the user message, conversation history, and tool outputs. But tool definition overhead is pure waste: it's context that doesn't change the outcome, it's just there because every tool was loaded.

The Accuracy Argument

The cost reduction is straightforward arithmetic. The accuracy argument is subtler but equally important.

When you present a model with 27 tools for a simple git commit task, you're asking it to correctly identify 1–2 relevant tools out of 27. For a capable model like claude-opus-4-6, it usually gets this right — but "usually" isn't "always," and the probability of a wrong tool selection increases with the number of irrelevant tools in context.

Semantic retrieval narrows the selection space before the model ever sees it. With 3 highly relevant tools instead of 27 mixed ones, the model is more likely to select correctly and less likely to produce a confused, multi-step response trying to reconcile unrelated capabilities.

Scaling the Tool Set

The benchmark above uses 27 tools. Production setups grow beyond that. The pattern scales linearly:

Total toolsRetrieval to 5Token savings
50 tools10% loaded~90%
100 tools5% loaded~95%
200 tools2.5% loaded~97.5%

As your tool count grows, the absolute savings increase — and so does the accuracy benefit from reduced context noise.

Setting Up Retrieval

Enable MCP servers in Dashboard → MCP Servers, then call /retrieve_tools instead of /tools when building your prompts. The Playground shows relevance scores alongside retrieved tools so you can verify which tools are being selected and tune top_k before writing integration code.

For most single-domain queries, top_k=3 is sufficient. For multi-step or cross-domain workflows, top_k=5–8 gives the model enough breadth without reintroducing the noise problem.

Create a free account and run the comparison yourself →

Try Agent-CoreX for free

Connect 100+ MCP tools. Cut LLM costs by 60%. Setup in 2 minutes.

Get started free