Every tool definition costs tokens. Learn how Agent-CoreX's retrieve_tools API cuts waste by selecting only the tools relevant to each query.
If you're running AI agents in production, token costs compound quickly. One of the biggest hidden drivers is tool definitions — the JSON schemas you pass to your model so it knows what tools exist.
This post explains exactly why that's expensive, and how Agent-CoreX's semantic retrieval approach fixes it.
Every MCP tool has a name, description, input schema, and often examples. A typical tool definition is 100–300 tokens. If you have 30 enabled tools and pass them all to Claude on every request, that's 3,000–9,000 tokens of overhead — before your user's message even starts.
At scale:
| Tools per request | Tokens overhead | Monthly cost at 100K requests ($15/1M tokens) |
|---|---|---|
| 30 tools (all) | ~6,000 | ~$9,000 |
| 5 tools (routed) | ~1,000 | ~$1,500 |
| 3 tools (routed) | ~600 | ~$900 |
The difference between passing all tools versus routing to 3–5 relevant ones is significant at any real traffic level.
Agent-CoreX exposes a /retrieve_tools endpoint that takes your user's query and returns only the tools semantically relevant to it.
Under the hood, it embeds both your query and your tool descriptions into a vector space, then returns the closest matches. The V2 version of the endpoint uses Qdrant for even more accurate vector search.
The result: instead of dumping your entire toolset into Claude's context, you pass a small, focused subset.
// ❌ Expensive — passes all tool definitions every time
const allTools = await fetch(`${ACX_API_BASE}/tools`, {
headers: { Authorization: `Bearer ${ACX_API_KEY}` },
}).then(r => r.json())
const response = await anthropic.messages.create({
model: "claude-opus-4-5",
tools: allTools.tools, // could be 30+ tools = thousands of tokens
messages: [{ role: "user", content: userQuery }],
})
// ✅ Efficient — retrieves only relevant tools for this specific query
const retrieved = await fetch(
`${ACX_API_BASE}/retrieve_tools?query=${encodeURIComponent(userQuery)}&top_k=5`,
{
headers: { Authorization: `Bearer ${ACX_API_KEY}` },
}
).then(r => r.json())
const response = await anthropic.messages.create({
model: "claude-opus-4-5",
tools: retrieved.tools, // 3–5 tools, not 30
messages: [{ role: "user", content: userQuery }],
})
Same result for Claude. Fraction of the token cost.
The V2 endpoint (/v2/retrieve_tools) adds user-scoped retrieval backed by Qdrant. This means tool vectors are stored and searched per user, so the retrieval is personalized to the specific servers and packs that user has enabled:
const retrieved = await fetch(
`${ACX_API_BASE}/v2/retrieve_tools?query=${encodeURIComponent(userQuery)}&top_k=5`,
{
headers: { Authorization: `Bearer ${ACX_API_KEY}` },
}
).then(r => r.json())
The system automatically falls back to V1 if V2 is unavailable — the same behavior you can test in the Playground.
To track which tools are being selected and why, log each routing decision with the /query/log endpoint:
await fetch(`${ACX_API_BASE}/query/log`, {
method: "POST",
headers: {
Authorization: `Bearer ${ACX_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
query: userQuery,
source: "api",
selected_tools: retrieved.tools.map(t => t.name),
scores: retrieved.scores, // relevance scores per tool
}),
})
You can then review this data in Dashboard → Queries to see which tools are being selected, how often, and with what confidence scores.
Set top_k to match your use case:
top_k=3top_k=5–8top_k=10Organize tools into Custom Packs: Dashboard → Custom Packs lets you group servers by domain. Retrieval within a pack is faster and more accurate than searching across all your servers.
Monitor tool utilization: Dashboard → Usage shows per-query token counts and tool call patterns. Tools with very low utilization rates are candidates for disabling — they contribute noise to the retrieval space without adding value.
Semantic retrieval is not a silver bullet:
top_k is still too large. Tune it to your actual usage patterns.Enable MCP servers in Dashboard → MCP Servers, then try the retrieval API in the Playground to see which tools get selected for your queries. The Playground shows relevance scores alongside results, which makes it easy to tune top_k and verify your server setup before writing any integration code.
Connect 100+ MCP tools. Cut LLM costs by 60%. Setup in 2 minutes.
Get started free