Fewer tool calls, not quicker ones
Faster Inference Won't Save You: Part 1
Every modern coding agent is trying to accelerate code search. Faster models, parallel tool calls, RL-trained search agents, vector indexes. All of it aimed at the same goal: kill the latency and context-bloat of code search.
They're optimizing the wrong variable.
Code search is graph search
A codebase is a directed graph.
- is the set of code entities: functions, classes, types, modules
- is the set of directed edges between them, each labeled with a relationship type
When function A calls function B, that's an edge. When a file imports a module, that's an edge. When a type extends another type, that's an edge. A codebase isn't a bag of files. It's a web of relationships.
Every code search task is a search over this graph:
- "Where is
getUserByIddefined?" — find a vertex by name - "What calls it?" — find its neighbors
- "What's the blast radius if I change it?" — find every vertex reachable from it
These are textbook graph problems. BFS finds reachable nodes. DFS computes transitive closures. The algorithms are fast — linear in the number of edges. The question isn't which algorithm to use. It's who runs it.
The model as search loop
With grep, the model is the search algorithm. Each turn is one iteration of the loop: pick a node, text-search for it, process the results, decide which node to visit next.
Each iteration costs:
- — time-to-first-token. The model waking up.
- — the thinking tokens burned deciding what to do next. Invisible to you, not free.
- — the output tokens. The actual tool call.
- — the round-trip over the wire.
- — tool execution time.
For search tools, . Ripgrep returns in milliseconds. The search tool is fast. The model thinking about what to search, and then thinking about the results — that's what costs you.
Total time for a code search task:
is the number of iterations — one per turn. If the answer lives at depth in the graph, the model needs at least turns to reach it, because grep can only see one hop at a time.
The per-turn costs have hard floors:
- is bounded by model size and hardware
- is bounded by physics
- scales with noise — 50 grep matches cost more reasoning than 3 structured results
None of these can be optimized to zero.
Traditionally, and the per-turn costs are coupled. Smarter models use fewer turns — they make better search decisions, take fewer wrong paths, need less backtracking. But they're slower per turn: higher TTFT, more reasoning tokens, longer generation. Dumber models are fast per turn but waste turns on bad decisions. You can pick a fast model that needs 12 turns or a smart model that needs 6 but takes twice as long each time. Either way you end up in roughly the same place.
Most of the industry tries to push within this tradeoff. Claude Code runs an Explore subagent on a lighter model so each turn costs fewer tokens. Cognition's SWE-grep fires eight parallel grep calls per turn, trained with a parallelism penalty so the model doesn't search sequentially. Relace does something similar with 4 to 12 parallel calls. Cursor pre-indexes the entire codebase into a vector database. Aider builds a tree-sitter repo map that front-loads structural context before the model even asks.
Good engineering, all of it. But none of it breaks the tradeoff. The agent still greps through the codebase in turns. Packing more parallel calls into each turn helps, but you still need multiple turns of think-search-think. And each turn burns TTFT, reasoning tokens, and network latency — costs with hard floors that no amount of optimization can eliminate.
The better move is reducing without touching the per-turn costs. Not a smarter model — better tools. A mediocre model with the right tools takes 1 turn. A brilliant model with grep still takes 6+.
How we reduce
The difference between grep and code mode isn't speed per turn. It's what you can express in one turn.
With grep, each turn does one thing: search for a text pattern. The result comes back, the model reasons about it, decides what to search next, and issues another turn. Each step of the graph traversal is a separate think-search-think cycle.
Code mode flips this. Instead of calling a search tool, the agent writes a script. That script runs inside a sandbox with access to ripgrep search, file tree listing, tree-sitter AST parsing, LSP queries, and arbitrary control flow. All of it executes in one turn. The agent programs its search instead of doing it one grep at a time.
Operations that would take 9 turns of grep-think-grep collapse into a single script that traverses the graph programmatically. The agent pays TTFT + reasoning + generation + network once, then the sandbox does the rest.
And it compounds. When drops, drops with it. A grep agent processing 50 text matches burns hundreds of reasoning tokens parsing them and deciding what to search next. A code mode script returns precise, structured results — not walls of text to reason through. Reducing from 9 to 1 doesn't divide time by 9. It divides by more, because each remaining turn is also cheaper.
Three examples show how this scales.
1. Transitive impact analysis
Task: "What functions are affected if I change validateToken?"
With grep, the agent discovers each level of the call tree in a separate turn. Grep for validateToken, process results, grep for callers of those callers, process results, keep going. Nine turns across four levels, each one burning TTFT + reasoning + network. And it has to manually track which functions it already visited to avoid cycles. If it misses a branch, the analysis is wrong.
With code mode, the agent writes a script in a single turn:
// "What's affected if I change validateToken?"
const fn = file('src/auth/token.ts').symbol('validateToken')
const visited = new Set()
function blast(span) {
const refs = lsp.references(span.file.path, span.line, span.column)
for (const ref of refs) {
const key = `${ref.file.path}:${ref.line}`
if (visited.has(key)) continue
visited.add(key)
ref.expand(2).log() // show each caller with context
blast(ref) // recurse into their callers
}
}
blast(fn[0])
console.log(`${visited.size} functions affected`)
One turn. The script chains file().symbol() into lsp.references() into recursion into ref.expand().log() — four different APIs and arbitrary control flow, all executing inside the sandbox. With grep, each of those steps is a separate turn. The agent greps, reasons about the results, greps again, reasons again. The graph traversal happens outside the sandbox, one think-search-think cycle at a time. In code mode, the agent programs the traversal and the sandbox runs it.
2. Dead code detection
Task: "Find all exported functions in src/utils/ that nothing imports."
With grep, the agent has to grep the entire codebase once per exported function. If src/utils/ exports 30 functions, that's 30 separate turns, each one paying TTFT + reasoning + network just to check whether a single function is used. Then the agent cross-references results across turns to figure out which functions had zero matches outside their own file.
With code mode, one turn:
// "Find dead exports in src/utils/"
for (const f of tree('src/utils/**/*.ts')) {
for (const sym of f.symbols()) {
if (!sym.text.startsWith('export')) continue
const refs = lsp.references(f.path, sym.line, sym.column)
const external = refs.filter(r => r.file.path !== f.path)
if (external.length === 0) {
console.log(`Dead: ${sym.name} in ${f.path}`)
sym.expand(1).log()
}
}
}
The script pipes tree() into .symbols() into lsp.references() into .filter() — a pipeline you can't express in grep. You can't pipe grep results into an LSP query. Code mode can because it's code. Thirty symbol lookups, zero additional turns.
3. Type flow tracing
Task: "Trace the type of the user parameter from the API handler down to the database query."
With grep, the agent greps for "user" and gets hundreds of matches — every variable, comment, and string containing the word "user" across the entire codebase. It reads files to narrow down which user is the right one, greps for the function that receives it, reads that file, greps for the next function in the chain. Eight or more turns, most of them spent reasoning through noise from a common variable name.
With code mode, the agent uses the type system instead of text matching:
// "Trace user from handler to DB"
const param = file('src/api/handler.ts').symbol('user')
const hover = lsp.hover(param[0].file.path, param[0].line, param[0].column)
console.log(`Type: ${hover.text}`) // "(parameter) user: AuthenticatedUser"
const refs = lsp.references(param[0].file.path, param[0].line, param[0].column)
console.log(`Flows to ${refs.length} locations:`)
for (const ref of refs) {
const h = lsp.hover(ref.file.path, ref.line, ref.column)
console.log(` ${ref.file.path}:${ref.line} — ${h.text}`)
ref.expand(3).log()
}
Tree-sitter resolves the AST node, lsp.hover() reads the type, lsp.references() follows the symbol, and a for-loop ties them together — all in one script, one turn. No sequence of grep calls can replicate this because the type-awareness comes from having tree-sitter and LSP available together in the same execution. Grep for "user" returns hundreds of text matches and the model has to reason through all of them. This returns the ones that matter.