Reduce LLM Costs with Semantic Tool Routing

If you're running AI agents in production, token costs compound quickly. One of the biggest hidden drivers is tool definitions — the JSON schemas you pass to your model so it knows what tools exist.

This post explains exactly why that's expensive, and how Agent-CoreX's semantic retrieval approach fixes it.

Why Tool Definitions Cost So Much

Every MCP tool has a name, description, input schema, and often examples. A typical tool definition is 100–300 tokens. If you have 30 enabled tools and pass them all to Claude on every request, that's 3,000–9,000 tokens of overhead — before your user's message even starts.

At scale:

Tools per request	Tokens overhead	Monthly cost at 100K requests ($15/1M tokens)
30 tools (all)	~6,000	~$9,000
5 tools (routed)	~1,000	~$1,500
3 tools (routed)	~600	~$900

The difference between passing all tools versus routing to 3–5 relevant ones is significant at any real traffic level.

How Agent-CoreX's Retrieval API Works

Agent-CoreX exposes a /retrieve_tools endpoint that takes your user's query and returns only the tools semantically relevant to it.

Under the hood, it embeds both your query and your tool descriptions into a vector space, then returns the closest matches. The V2 version of the endpoint uses Qdrant for even more accurate vector search.

The result: instead of dumping your entire toolset into Claude's context, you pass a small, focused subset.

Before: All Tools, Every Request

// ❌ Expensive — passes all tool definitions every time
const allTools = await fetch(`${ACX_API_BASE}/tools`, {
  headers: { Authorization: `Bearer ${ACX_API_KEY}` },
}).then(r => r.json())

const response = await anthropic.messages.create({
  model: "claude-opus-4-5",
  tools: allTools.tools,  // could be 30+ tools = thousands of tokens
  messages: [{ role: "user", content: userQuery }],
})

After: Semantic Retrieval Per Query

// ✅ Efficient — retrieves only relevant tools for this specific query
const retrieved = await fetch(
  `${ACX_API_BASE}/retrieve_tools?query=${encodeURIComponent(userQuery)}&top_k=5`,
  {
    headers: { Authorization: `Bearer ${ACX_API_KEY}` },
  }
).then(r => r.json())

const response = await anthropic.messages.create({
  model: "claude-opus-4-5",
  tools: retrieved.tools,  // 3–5 tools, not 30
  messages: [{ role: "user", content: userQuery }],
})

Same result for Claude. Fraction of the token cost.

The V2 Endpoint: User-Scoped Qdrant Retrieval

The V2 endpoint (/v2/retrieve_tools) adds user-scoped retrieval backed by Qdrant. This means tool vectors are stored and searched per user, so the retrieval is personalized to the specific servers and packs that user has enabled:

const retrieved = await fetch(
  `${ACX_API_BASE}/v2/retrieve_tools?query=${encodeURIComponent(userQuery)}&top_k=5`,
  {
    headers: { Authorization: `Bearer ${ACX_API_KEY}` },
  }
).then(r => r.json())

The system automatically falls back to V1 if V2 is unavailable — the same behavior you can test in the Playground.

Query Logging for Analytics

To track which tools are being selected and why, log each routing decision with the /query/log endpoint:

await fetch(`${ACX_API_BASE}/query/log`, {
  method: "POST",
  headers: {
    Authorization: `Bearer ${ACX_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    query: userQuery,
    source: "api",
    selected_tools: retrieved.tools.map(t => t.name),
    scores: retrieved.scores,  // relevance scores per tool
  }),
})

You can then review this data in Dashboard → Queries to see which tools are being selected, how often, and with what confidence scores.

Practical Guidelines

Set top_k to match your use case:

Simple single-domain queries: top_k=3
Multi-step or cross-domain queries: top_k=5–8
Exploratory or ambiguous queries: top_k=10

Organize tools into Custom Packs: Dashboard → Custom Packs lets you group servers by domain. Retrieval within a pack is faster and more accurate than searching across all your servers.

Monitor tool utilization: Dashboard → Usage shows per-query token counts and tool call patterns. Tools with very low utilization rates are candidates for disabling — they contribute noise to the retrieval space without adding value.

What This Does Not Do

Semantic retrieval is not a silver bullet:

It doesn't replace good tool descriptions. Poorly written tool descriptions hurt retrieval accuracy.
It doesn't help if your top_k is still too large. Tune it to your actual usage patterns.
It doesn't eliminate tool execution costs — only the context overhead of tool definitions.

Getting Started

Enable MCP servers in Dashboard → MCP Servers, then try the retrieval API in the Playground to see which tools get selected for your queries. The Playground shows relevance scores alongside results, which makes it easy to tune top_k and verify your server setup before writing any integration code.

Create a free account →

Reduce LLM Costs with Semantic Tool Routing

Why Tool Definitions Cost So Much

How Agent-CoreX's Retrieval API Works

Before: All Tools, Every Request

After: Semantic Retrieval Per Query

The V2 Endpoint: User-Scoped Qdrant Retrieval

Query Logging for Analytics

Practical Guidelines

What This Does Not Do

Getting Started

Claude Tool Loading vs Agent-CoreX: Token Cost Benchmark

Try Agent-CoreX for free

Why Tool Definitions Cost So Much

How Agent-CoreX's Retrieval API Works

Before: All Tools, Every Request

After: Semantic Retrieval Per Query

The V2 Endpoint: User-Scoped Qdrant Retrieval

Query Logging for Analytics

Practical Guidelines

What This Does Not Do

Getting Started

Related posts

Claude Tool Loading vs Agent-CoreX: Token Cost Benchmark

Try Agent-CoreX for free