Claude Tool Loading vs Agent-CoreX: Token Cost Benchmark

One of the clearer revelations from the Anthropic Claude Code leak is just how much lives in the tool layer. The base tool definition in the leaked harness spans 29,000 lines. At scale, tool definitions are a significant driver of per-request token cost — and they're mostly invisible until you start measuring.

This post runs a concrete benchmark: the same task, the same model, the same result — with all tools loaded versus with semantic retrieval.

Benchmark Setup

Task: "Commit the current changes to the main branch on GitHub"

Tools available (a realistic production setup):

GitHub (create_commit, push_branch, create_pr, list_issues, search_code, create_branch, merge_pr) — 7 tools
PostgreSQL (query, insert, update, delete, list_tables, describe_table) — 6 tools
File system (read_file, write_file, list_directory, search_files, move_file) — 5 tools
Slack (send_message, list_channels, get_thread, create_reminder) — 4 tools
Brave Search (search, get_page) — 2 tools
Calendar (create_event, list_events, update_event) — 3 tools

Total: 27 tools across 6 servers.

Each tool definition averages ~250 tokens (name, description, input schema with type annotations and descriptions). Some are simpler; some — like create_pr with its body, labels, reviewers, and branch fields — are considerably larger.

Approach 1: All Tools, Every Request

Traditional agent setup: load all enabled tools and inject them into every request.

// Fetch all tools from all enabled MCP servers
const { tools: allTools } = await fetch(`${ACX_API_BASE}/tools`, {
  headers: { Authorization: `Bearer ${ACX_API_KEY}` }
}).then(r => r.json())

const message = await anthropic.messages.create({
  model: "claude-opus-4-6",
  tools: allTools, // 27 tools
  messages: [{ role: "user", content: "Commit the current changes to main" }],
})

Token overhead from tool definitions: ~6,750 tokens

The model receives all 27 tool definitions — database tools, search tools, calendar tools — none of which are relevant to this git commit task. It has to scan all of them to find the 1–2 that actually apply.

Approach 2: Semantic Retrieval with Agent-CoreX

Retrieve only the tools relevant to the current query before building the prompt:

// Retrieve only tools relevant to this specific query
const { tools } = await fetch(
  `${ACX_API_BASE}/retrieve_tools?query=commit+current+changes+to+main+branch&top_k=3`,
  { headers: { Authorization: `Bearer ${ACX_API_KEY}` } }
).then(r => r.json())

// Retrieved: create_commit, push_branch, list_branches — 3 tools

const message = await anthropic.messages.create({
  model: "claude-opus-4-6",
  tools, // 3 tools
  messages: [{ role: "user", content: "Commit the current changes to main" }],
})

Token overhead from tool definitions: ~750 tokens

The model receives exactly the tools it needs. No database schemas. No search API definitions. No calendar fields.

The Numbers

Metric	All tools	Semantic retrieval	Delta
Tools injected	27	3	−89%
Tool definition tokens	~6,750	~750	−89%
Cost per request (tool overhead only, $15/1M)	$0.101	$0.011	−89%
Monthly cost at 100K requests	~$10,125	~$1,125	−$9,000
Monthly cost at 1M requests	~$101,250	~$11,250	−$90,000

These numbers reflect tool definition overhead only — not the cost of the actual conversation, tool results, or model responses. The full token count includes the user message, conversation history, and tool outputs. But tool definition overhead is pure waste: it's context that doesn't change the outcome, it's just there because every tool was loaded.

The Accuracy Argument

The cost reduction is straightforward arithmetic. The accuracy argument is subtler but equally important.

When you present a model with 27 tools for a simple git commit task, you're asking it to correctly identify 1–2 relevant tools out of 27. For a capable model like claude-opus-4-6, it usually gets this right — but "usually" isn't "always," and the probability of a wrong tool selection increases with the number of irrelevant tools in context.

Semantic retrieval narrows the selection space before the model ever sees it. With 3 highly relevant tools instead of 27 mixed ones, the model is more likely to select correctly and less likely to produce a confused, multi-step response trying to reconcile unrelated capabilities.

Scaling the Tool Set

The benchmark above uses 27 tools. Production setups grow beyond that. The pattern scales linearly:

Total tools	Retrieval to 5	Token savings
50 tools	10% loaded	~90%
100 tools	5% loaded	~95%
200 tools	2.5% loaded	~97.5%

As your tool count grows, the absolute savings increase — and so does the accuracy benefit from reduced context noise.

Setting Up Retrieval

Enable MCP servers in Dashboard → MCP Servers, then call /retrieve_tools instead of /tools when building your prompts. The Playground shows relevance scores alongside retrieved tools so you can verify which tools are being selected and tune top_k before writing integration code.

For most single-domain queries, top_k=3 is sufficient. For multi-step or cross-domain workflows, top_k=5–8 gives the model enough breadth without reintroducing the noise problem.

Create a free account and run the comparison yourself →

Claude Tool Loading vs Agent-CoreX: Token Cost Benchmark

Benchmark Setup

Approach 1: All Tools, Every Request

Approach 2: Semantic Retrieval with Agent-CoreX

The Numbers

The Accuracy Argument

Scaling the Tool Set

Setting Up Retrieval

Anthropic Leak: AI Agent Architecture Lessons

Claude's Tool System, Rebuilt More Efficiently

The Hidden Problem in Claude's Tool System

Try Agent-CoreX for free

Benchmark Setup

Approach 1: All Tools, Every Request

Approach 2: Semantic Retrieval with Agent-CoreX

The Numbers

The Accuracy Argument

Scaling the Tool Set

Setting Up Retrieval

Related posts

Anthropic Leak: AI Agent Architecture Lessons

Claude's Tool System, Rebuilt More Efficiently

The Hidden Problem in Claude's Tool System

Try Agent-CoreX for free