Plan mode is a crutch
Faster Inference Won't Save You: Part 5
Plan mode: the agent stops, writes a plan, waits for you to approve it, then executes step by step. Every modern coding agent has some version of this. It's treated as a feature — "look, the agent thinks before it acts." It's actually a symptom of a single-agent architecture that can't parallelize work.
If your agent could fork tasks to workers that run concurrently, you wouldn't need to plan everything upfront. You'd identify the first actionable task, hand it off, and keep thinking about the rest while the first one runs. Planning and execution would overlap. Plan mode exists because a single agent has to finish thinking before it can start doing.
Planning is serial. Execution doesn't have to be.
Plan mode forces a serial pipeline: plan all steps, get approval, execute step 1, execute step 2, all the way down the list. Every step waits for the previous one. Including the planning step itself.
The planning step is often the longest part. The agent reads files, reasons about dependencies, figures out ordering, writes a multi-step plan. Only after all of that does it touch a single file. For a complex refactor, planning alone might take 30-60 seconds of model time. Then execution starts — also serial, one step at a time.
Most of this serialization is unnecessary. Plans routinely contain tasks that are independent of each other — refactor module A and module B, update the API layer and the test suite. These could run in parallel. And the first task almost never depends on knowing what task 8 will be. The agent could start task 1 while still figuring out the rest.
Plan mode serializes work that doesn't need to be sequential. It does this because a single agent can only do one thing at a time. The plan is a serialization artifact.
The orchestrator never stops thinking
Replace the single agent with two roles: an orchestrator that plans and delegates, and workers that execute. The orchestrator's job is to decompose the task, identify dependencies, and hand off work. It never writes code itself.
The orchestrator reads the codebase, identifies the first actionable task, and delegates it to a worker agent immediately. Then it keeps thinking. While the worker implements the auth module, the orchestrator is figuring out the database schema migration. By the time the auth module merges back, the next task is already scoped and ready to delegate.
This is pipelining. Planning and execution run concurrently. The orchestrator is always either scoping the next task or reviewing the results of the last one. Workers are always either executing or merging. Nobody waits for a plan document to be finished and approved before anything happens.
In plan mode, the timeline looks like: plan everything (30s) → task 1 (20s) → task 2 (15s) → task 3 (25s) → done (90s total). With pipelining, the orchestrator delegates task 1 after 10 seconds of planning and keeps going. Task 1 finishes while the orchestrator is still planning. Tasks 2 and 3 might run in parallel if they're independent. Wall-clock time drops by the entire planning phase plus whatever parallel overlap you get between independent tasks.
Fork-join and pipelining
Two primitives replace plan mode:
Fork-join handles independent tasks. The orchestrator identifies work that doesn't depend on each other — refactoring two separate modules, writing tests for different components — and forks both to separate workers. They run in parallel, potentially on different machines (Part 4 gave each agent its own sandbox with dedicated compute). When both finish, results merge back to the orchestrator. Two tasks that would have taken 40 seconds serially finish in 20.
Pipelining handles sequential dependencies. When task B needs the output of task A, the orchestrator delegates A first. While A runs, the orchestrator plans B — reads the relevant code, identifies the constraints, scopes the work. When A finishes and merges, B is ready to start immediately. No gap between A finishing and B starting, because the planning happened during A's execution.
These compose. A complex task might have a pipeline of fork-join stages: tasks 1-3 in parallel, then task 4 depends on all three, then tasks 5-6 in parallel again. The orchestrator builds this structure dynamically as it understands the problem — not as a static upfront document, but as a running process that adapts to what the workers produce.
Subagents are parallelism, not context management
Most of the industry frames subagents as a context management tool. "Spawn a subagent so it gets a clean context window." The subagent explores something, comes back with a summary, and the main agent continues with a tidy context. Claude Code's Explore subagent does this. So do most implementations of "research agents."
Context isolation is a real benefit, but it's secondary. The primary value of subagents is that they run concurrently. Two subagents on two machines do twice the work in the same wall-clock time. Ten subagents on ten machines do ten times the work.
When subagents are just context helpers, the main agent still runs serially. It spawns a subagent, waits for the summary, processes it, then continues. The subagent saved context space but didn't save time. The overall execution is still one thread doing one thing at a time.
When subagents are parallelism primitives, the orchestrator delegates real work — implementation, not just research. Multiple workers write code simultaneously on separate machines. The wall-clock time for a ten-task project isn't ten times the single-task time. It's closer to the length of the critical path through the dependency graph — however long the longest chain of sequential dependencies takes, regardless of how many independent branches exist.
DAGs, not documents
Traditional plan mode produces a document. A numbered list of steps. The agent reads it and follows it linearly.
An orchestrator produces a DAG — tasks with dependencies and statuses. The DAG is executable. When a task's upstream dependencies are all done, it's ready to run. The orchestrator doesn't scan a document and figure out what's next. It queries the graph: which nodes have all predecessors completed?
The DAG also drives the compression strategy from Part 2. Active tasks protect their context. Completed terminal tasks compress. The plan and the context management system are the same data structure. A plan document sits in the context window and takes up space. A task DAG actively manages what stays and what goes.
Documents are for humans to review. DAGs are for agents to execute. Plan mode produces the wrong artifact for the wrong audience.
The parallelism payoff
Parts 1 through 3 made individual turns fast. Part 4 gave agents unlimited compute via cloud sandboxes. Part 5 is about using that compute — fork-join and pipelining fill those machines with parallel work.
Plan mode can't do this. A single agent can't fork. It can write a plan that describes parallel work, but it still executes that plan one step at a time. An orchestrator with worker agents actually runs the work in parallel. The plan isn't a document. It's a running process.