Your agent is fast. Your plumbing isn't.
Faster Inference Won't Save You: Part 3
The model is someone else's problem. You call an API, you pay per token, the provider worries about the GPUs. But everything between "agent decides to act" and "agent gets the result" is your infrastructure. Network hops. Serialization. Memory allocation. Tool execution. State management. At one agent, none of this matters. At a hundred concurrent agents, every inefficiency multiplies.
You're not optimizing for one fast agent. You're optimizing for how many agents fit on the same hardware. The model cost is fixed — the provider charges per token regardless. The infrastructure cost per agent is yours. A bloated conversation loop that allocates 500MB of heap per agent means you run 16 agents where you could run 200. A naive sync protocol that retransmits 100k tokens every turn means your network is the bottleneck, not the model.
Plumbing efficiency is the scaling multiplier.
Keep the loop next to the model
Most coding agents run the conversation loop on the user's machine. The CLI calls the LLM API, waits for a response, processes tool calls, calls the API again. Every turn is a round trip from wherever the laptop is to wherever the model lives. Home Wi-Fi to US-East, back again. For a 10-turn task, that's 10 round trips through consumer internet.
Move the loop to the edge. Run each agent's conversation state machine in a Durable Object on Cloudflare, co-located with the model provider's API endpoint. The DO calls the provider directly — same region, often same datacenter. The round trip drops from 200ms consumer internet to single-digit milliseconds between machines in the same facility.
The user's machine stops driving the loop. It receives events over a WebSocket — streaming text, tool call suggestions, status updates — but it doesn't orchestrate anything. The DO decides when to call the LLM, when to retry, when to compress. The user's connection can drop and reconnect without losing state. The loop keeps running.
At a hundred agents, this matters because each agent's loop runs independently in its own DO. No central orchestrator queuing LLM calls. No single process managing a hundred conversations. Each DO is an isolated unit with its own SQLite database, its own WebSocket connections, its own event log. They scale horizontally by definition.
Only send what changed
The naive approach: serialize the full conversation and send it on every turn. A 100k-token conversation is roughly 400KB of JSON. Every turn, every agent, full retransmission. At a hundred agents doing 5 turns per minute, that's 200MB/min of redundant data over the wire.
Most agents don't bother. The DO already has the conversation history in its SQLite event log. The node doesn't need to resend it. It sends a delta — only the new messages since the last sync. A tool result, a user message, maybe 2-5KB instead of 400KB.
The same principle applies to compression. The DO maintains a frozen prefix — the portion of the conversation that's already been compressed in prior turns. On each new turn, only messages after the frozen cursor get re-analyzed. The previous compression output is prepended unchanged. This means compression work scales with the number of new messages, not the length of the conversation. An agent 200 turns deep does the same compression work as one 5 turns in.
Bandwidth and CPU now scale with the rate of new information, not total accumulated history. A long-running agent doesn't get more expensive to maintain.
Prefix caching
LLM providers charge per input token, but they don't necessarily process every token from scratch. Anthropic and Gemini both support prompt caching — if the first N tokens of a request match a previously cached prefix, the provider skips reprocessing them. Cached tokens are cheaper and faster.
This aligns naturally with how the frozen prefix works. The compressed conversation prefix is stable across turns. Turn 50 has the same prefix as turn 49, plus a few new messages appended. The provider caches the prefix automatically. Each turn, the provider processes only the delta — the new messages and the latest tool results.
Concrete example: a 100k-token conversation where each turn adds 2k tokens of new content. Without caching, the provider processes 100k tokens per turn. With caching, it processes 2k. TTFT drops proportionally because the model doesn't re-attend to the cached prefix.
Each agent builds its own cached prefix with the provider independently. No infrastructure cost on our side — the provider handles it. But it only works if your conversation structure keeps a stable prefix across turns. The frozen prefix pattern isn't just a compression optimization. It's a cache alignment strategy.
Execute tools while the model is still talking
Most agents wait for the LLM to finish its entire response, parse the complete output for tool calls, then execute them one by one, then send all results back. The model generates for 5 seconds, sits idle while tools run for 3 seconds, then starts the next turn. Tool execution and model generation never overlap.
With streaming tool detection, tool calls are identified as soon as they're fully generated — while the model may still be producing more output. The system validates the tool call inline, checks autonomy rules (can this tool run without human approval?), and if approved, starts execution immediately. The model generates tool call #2 while tool call #1 is already running. By the time the model finishes its response, some tools have already completed and their results are ready for the next turn.
This isn't speculative execution. The tool call is complete and validated before execution starts. The model is just still generating additional text or additional tool calls. The overlap is between execution of early tool calls and generation of later ones.
Without overlap, each agent spends roughly 40% of its wall-clock time idle — either the model waits for tools, or tools wait for the model. Multiply that by a hundred agents and you're wasting almost half your compute capacity on nothing. With streaming overlap, that dead time approaches zero.
Rust for the hot paths, TypeScript for the rest
Business logic lives in TypeScript. Event handlers, state machines, tool registration, prompt assembly, autonomy rules. This code changes constantly as the product evolves. TypeScript is the right choice — readable, fast to iterate, good enough performance for logic that runs once per turn.
But some operations run on every token, every event, or every byte of file content. Diff computation between file versions. Token counting for compression decisions. Conflict detection across concurrent edits. Compression analysis over long conversation histories. These are the hot loops, and they run in Rust.
Two compilation targets depending on where the code runs. NAPI-RS compiles Rust to native Node addons for the local server — full CPU performance with zero serialization overhead. WASM compiles the same Rust to run in Cloudflare Workers where native binaries aren't allowed. Same code, different targets, both faster than the equivalent TypeScript by 10-50x for tight numerical loops.
IO-bound work — filesystem reads, process spawning, network calls — runs on tokio's async runtime via the native bindings. Non-blocking, no thread-per-operation overhead. An agent reading 50 files does it concurrently without spawning 50 threads.
TypeScript's single-threaded model is fine for business logic when the hot paths aren't in it. Each agent's event handler runs quickly and yields. CPU-intensive work never blocks the JS thread. IO never blocks a thread either. The per-agent overhead ends up being a few megabytes of JS heap for state — not a thread, not a process, just a lightweight object.
Incremental computation, not batch reprocessing
The naive version of every system operation: recompute from scratch each turn. Re-compress the full conversation. Re-project state from the entire event log. Re-evaluate all pending tasks. Re-sync all events to every connected client.
The incremental version: cache intermediate results and only recompute what changed.
Compression uses the frozen cursor. Everything before it is immutable — the output is cached in SQLite and prepended on each turn without re-analysis. Event replay uses a cursor per client — reconnecting clients receive only events after their last-known position, not the full log. State projection uses snapshots — on startup, load the latest snapshot and replay only subsequent events, not the entire history from event zero.
Same pattern everywhere: segment the computation, cache stable segments, only run the new part. O(delta) per turn instead of O(history).
An agent with 500 events in its log does the same per-turn work as one with 50, because the first 450 are frozen. Long-running agents don't get progressively more expensive. Memory stays bounded. CPU stays bounded. The system scales with the rate of new work, not the accumulated total — which is the only way to run hundreds of agents without the machine grinding to a halt as sessions grow.
The scaling equation
Part 1 reduced the number of turns an agent needs. Part 2 eliminated context rot and compaction latency. This post is about the infrastructure cost of each remaining turn.
The model is fast. The question is whether your plumbing can keep up when you multiply it by a hundred.