Compaction is a symptom, not a strategy
Faster Inference Won't Save You: Part 2
Every coding agent eventually hits its context window. When it does, it calls the LLM to summarize the conversation so far. This is compaction. The agent stops working. The LLM reads the entire history, produces a summary, and the summary replaces the original. Then the agent resumes, working from a compressed version of what it used to know.
Compaction is slow. It's an LLM call — of dead time, the same TTFT and reasoning and generation from Part 1. The agent is frozen while the compactor runs. On a long session this happens multiple times.
But the speed hit isn't the real problem. Compaction fires when the context window is already full. The model spent the last 30 turns reasoning over a bloated history, and its performance degraded on every single one. Attention quality drops as token count rises. The agent picks the wrong file, misreads a type, forgets a constraint from 40 turns ago. This is context rot, and it happens gradually — no hard failure, just a slow decline in quality across every turn leading up to compaction.
By the time compaction triggers, the damage is done. Compaction responds to context rot after the fact. It doesn't prevent it.
Compaction is slow. Context rot is worse.
A 100k-token conversation takes real time to compact. The compactor reads the full history, reasons about what to keep, and generates a summary. It pays , same as any productive turn — except this one produces nothing. The agent sits idle.
The speed cost matters. But context rot matters more.
Transformer attention doesn't scale linearly with context length. As the window fills, the model spreads attention across more tokens. Each token gets less of it. The agent starts making subtly worse decisions on every turn, long before the window is actually full. There's no cliff. It's a slope.
Claude Code compacts when approaching the context limit. Cursor retrieves over past context via vector search. Others use sliding windows. All reactive. They let the window fill, let rot accumulate turn after turn, and then act. The compaction call is just the bill arriving. The 30 turns of degraded performance that preceded it — that's the actual cost.
The todo list is the compression strategy
The question isn't how to build a better compactor. It's how to keep the context window lean on every turn, so rot never accumulates and compaction never needs to happen.
This is a task-tracking problem, not a context-management problem. If you know what the agent is working on, you know what context it needs. Everything else can go.
Formalize it. The agent maintains a task graph:
- is a set of task nodes, each with a status: pending, active, done, or abandoned
- is a set of dependency edges — means task must finish before can start
This is a DAG. Tasks have dependencies. Completing one task can unblock the next.
The set that matters is the active frontier — every task currently in progress, plus every task on a path leading to one:
Every conversation turn maps to the task it contributed to. The compression rule falls out of the DAG:
- serves a task in → protected. Don't touch it.
- serves a done task with no path to anything active → compress it now.
- serves an abandoned task → compress it aggressively.
No LLM call. No heuristics. No scoring function trying to guess which turns are "important." The task graph already knows. A task is either on the active frontier or it isn't. Its context is either needed or it isn't. The compression decision is a graph query, not a judgment call.
A compactor has no task graph. It sees a bag of conversation turns and has to decide, on the fly, which ones to keep. It compresses by age, or by token count, or by asking the LLM to score importance — all of which are proxies for information the task graph provides directly. Compaction is what you do when you don't know what your agent is working on.
With a task graph, compression is proactive and continuous. As soon as a task completes and nothing downstream depends on it, its context compresses. The window stays lean on every turn. Not just after a compaction event — always. Context rot doesn't accumulate because the context never bloats.
Knowledge artifacts, not conversation memory
Compression handles the dead branches. But what about the knowledge inside them?
When an agent spends 15 turns investigating a module, it builds understanding: which files matter, how the types connect, what the edge cases are. If those turns compress after the task completes, the understanding goes with them.
Unless the agent wrote it down.
The agent externalizes knowledge into persistent files. Task lists with dependencies and statuses. Notes about specific modules or directories. Architectural observations. Per-file annotations. These files live outside the context window. They survive compression. When the agent needs that knowledge again — because a downstream task touches the same module — it reads the file. The knowledge is there, exactly as the agent wrote it, not as a compactor's summary of it.
This is not retrieval-augmented generation. RAG retrieves chunks of past conversation — fragments of what the agent said or saw, re-embedded and re-ranked. Knowledge artifacts are different. They are structured documents the agent authored to capture its own understanding. Not transcripts. Not embeddings of old turns. Deliberate, organized notes written for future use.
With both in place, the agent compresses old turns knowing it already extracted and externalized anything worth keeping. Compression becomes garbage collection, not information triage.
Compression, not compaction
Put these together and compaction drops out of the system.
Task graph determines what's still needed. Artifacts preserve what's been learned. Compression runs continuously, clearing dead branches as tasks complete. The window stays small. The agent doesn't degrade over long sessions because the context never bloats. It doesn't pause for an LLM summarization call because there's nothing to summarize — dead context is already gone, live context is still in the window.
The gap isn't between a good compactor and a bad one. It's between needing one and not needing one. Compression still runs. Dead branches still get cleaned up. But no LLM reads the conversation history and tries to decide what matters. The task graph already decided.
Part 1 reduced . This post eliminates the two costs of context limits: compaction latency and the rot that precedes it.