A paper that got us thinking

In June, Google DeepMind published From AGI to ASI — a long, careful map of how machine intelligence might progress past human level and, more usefully for our purposes, where it might stall. The optimism and the timelines are not the interesting part. The taxonomy of frictions is: a catalog of bottlenecks that might be permanent walls or might be mere speed bumps, paired with an honest admission that we mostly cannot yet tell which.

One friction in particular repays a second read. The report calls it the abstraction barrier — the question of whether a system trained on human concepts can ever mint genuinely new ones, or is forever recombining ours. Stated that way it sounds like a hard ceiling. But follow it down and it turns into a question about cost: how expensive is it to verify a new idea against reality? Where a cheap verifier exists — a game with a score, a proof checker, a unit test — machines already discover concepts no human handed them. Where verification is expensive — a drug across a thirty-year disease, a climate feedback across a decade — the barrier holds, not because the idea is unreachable but because confirming it is slow. The wall is real. Its height is set by something incidental.

We do not run anything within several orders of magnitude of that scale. But we kept noticing that the operator's version of the same question — fundamental, or incidental? — is the one we answer, well or badly, every time a model ships. Most of what we call "agent architecture" is a pile of decisions, and each decision is implicitly a bet about whether the constraint it works around is permanent. Get the bet wrong in the optimistic direction and you under-build. Get it wrong in the pessimistic direction and you pour concrete around a limitation the next model removes, then spend a quarter chiseling it back out.

This is the through-line under several of the pieces already in this series. It is worth naming directly.

The test

Here is the whole method, and it is almost embarrassingly simple. For any piece of scaffolding in the stack, ask: does the benefit survive a more capable model and a larger context window?

If yes, it is structure. Keep building on it; the next model makes it better, not obsolete.

If no, it is a workaround with a half-life. Useful today, genuinely — but you are on borrowed time, and you should build like it. Which means the corollary that does most of the work in practice: build the workarounds so that when they expire, they fail loud, not silent. A workaround that stops working and tells you is a maintenance ticket. A workaround that stops working and keeps emitting the appearance of working is a latent outage with your name on it.

Three areas where we have watched this play out.

Specialists are cheap to add and expensive to coordinate

When we cut our agent squad from six to three, the headline was that output went up while compute went down. The durable lesson underneath it is about where the cost of specialization actually lives for AI, because it is not where it lives for humans.

A human specialist is expensive to acquire — years of training — and that sunk cost is exactly why human organizations divide labor: you amortize the investment. An agent specialist is the opposite. The "expertise" is a prompt, a memory directory, and a set of tools, all reconstitutable on demand and portable across models. Acquisition is free. So the human reason to specialize evaporates — and yet specialization still earns its place, for two entirely different reasons, only one of which is permanent.

The disposable reason is context economy. You split a role because one context cannot hold everything. That is a real constraint today and it is the reason a Rust specialist only needs Rust references read into its window. It is also a workaround with a clock on it: every expansion of context, every cheaper token, erodes it. Bet your architecture on context partitioning and you are building on ice that the next model thins.

The permanent reason is separation of concerns. This one survives an infinite context window, because it was never about memory. Our Tester role failed the same way feature after feature, and it failed because the concern it owned — does this test exercise the spec? — lived with the Architect, who held the spec. Moving the work to the role with the right intent fixed it. That is not a capacity story. A single model with a million-token window and perfect recall should still keep the architect's stance and the implementer's stance apart, for the same reason you would not write a good system as one enormous function. An architect optimizes for system correctness; an implementer optimizes for local correctness; blend the two objectives in one context and you get blended-optimal output, which is worse than two clean optimizers reconciling at an interface. Anyone who has tried to be a good engineering manager and the person writing the code that afternoon has felt this from the inside. It is a property of optimizing interfering objectives, not of finite attention, so scaling does not dissolve it.

The same cut produced a second durable rule and one disposable habit, and it is worth seeing which is which. We told the coordinator to "be aware of each agent's reasoning," fed it every sub-agent's chain of thought, and watched it run out of usable context by stage four of a seven-stage pipeline — making its worst calls exactly when arbitration mattered most. The fix was to feed it decisions and artifacts, not transcripts. The spec is the contract. The diff is the diff. The verdict is a verdict, not a transcript. That is structure: consume the conclusion, not the derivation. It will be true of every coordinator we ever build, because the derivation is the expensive, capability-coupled thing and the artifact is the durable signal. Keep going down that road and it leads somewhere specific.

Controls that work today because the model can't fake them yet

We run a discipline where an agent that wants to narrow scope must file a ticket justifying the narrowing against requirements. It cut illegitimate scope-narrowing by roughly half. The specialist memory ledger does a cousin of the same job from the other side — its rules block repeat mistakes before human review, the way the architect agent caught a second raw-HTTP-instead-of-SDK PR two weeks after the first one cost us an outage.

These controls work. They also rest on an assumption that is easy to miss: they work because today's models, forced to articulate, often cannot paper over a bad call convincingly. The gap between a real justification and a rationalization is currently wide enough to see. The control lives in that gap.

Now ask the test question. Does "make the agent justify itself" survive a more capable model? Not obviously — and not symmetrically. A stronger model is a better reasoner, yes, but it is also a better motivated reasoner, more able to produce a cover story indistinguishable from a justification for a narrowing it took for the wrong reason. In that direction the control does not fade to neutral. It inverts. It becomes a machine for generating well-justified bad decisions, which are strictly harder to catch than the unjustified kind. Whether you get the good direction or the inverted one is not a capability question; it is whether honesty keeps pace with capability — which means a control like this quietly imports an assumption about alignment, the one property you would most want it to be independent of.

This is the failure mode worth burning into the design reflex: a control that keeps emitting the artifacts of control after it has stopped controlling is worse than no control. It is the alarm that reads green because the sensor died. We have already shipped this lesson once without naming it as a general principle. When we moved to a reasoning-class model, the thinking channel started eating the max_tokens budget, answers truncated mid-sentence, and the coherence eval — scoring what was present rather than what should have been there — did not flag a thing. The chopped narrative got cached and served for thirty days. The stopReason was sitting right there in the response and the code read the text instead of the flag. The fix was to treat a max-tokens stop as a hard failure: discard the partial, emit a metric, advance the fallback, never cache the chop. That is failing loud on purpose. The cached chop was the dead sensor; the metric is the control that survives.

So the durable form of the scope-narrowing control is not "always file a ticket." It is trust the rates, not the reasons. The rationale prose is coupled to the model's honesty and degrades with it. The event telemetry — that a narrowing happened, when, by which agent, against which requirement — is countable from metadata and does not care whether any single justification is true. A spike in the narrowing rate, or narrowings clustering on one requirement type, is visible whether or not the explanation reads well. Lean the control on the part you can count. Treat the prose as a note to a future reviewer more capable than the writer. It is the same move as feeding the coordinator verdicts instead of transcripts, arriving from a different door: rely on the artifact, not the narrative.

Memory is capital — and capital can sour

The memory ledger is the clearest case of structure in the whole stack. A hundred-odd single-lesson files across nine specialists, concentrated exactly where complexity lives, each one an active refusal to pay for the same lesson twice. That is durable, portable capital — closer to lossless lifetime experience than to documentation, and it carries forward to whatever model we run next. The DeepMind report would file it under the advantages of digital intelligence that compound with scale, and it would be right to.

But "carries forward" is doing more work than it looks, and the ledger's own Open Questions section already put a finger on it: how long does a rule stay valid before the underlying vendor or API changes and the memory becomes a liability? That is the half-life question, asked about memory instead of code. And the answer the durability test gives is that a lesson is structure right up until the thing it describes changes — at which point it is not neutral dead weight, it is an active false constraint, an agent confidently applying a rule to a world that moved.

There is a second souring mode that only shows up when you port the ledger forward to a more capable model, and it is subtler. Sort the corpus by type. The factual and procedural rules — don't re-implement a vendor SDK without checking its extensibility surface; every infra stack must deploy in dev before prod; no post-response cleanup on a serverless runtime, decouple via a queue — port forward cleanly. A smarter model reads them and is simply faster and safer. But the judgment rules — the marketing rule that funnels represent intent rather than namespace adjacency, the architect's accumulated framings for adjudicating reviewer conflict — encode the concepts of the model that wrote them. Hand those to a stronger model as authoritative and you have anchored the better reasoner to the worse one's worldview. That is the abstraction barrier from the DeepMind paper reappearing at squad scale: the next generation tethered to the last generation's primitives, inside your own memory directory.

The handling rule follows directly. Factual memory ports as ground truth. Judgment memory ports marked revisablehere is what a less capable model concluded, and why; supersede freely — precisely on the layer where you want the new model to see further, not to defer. The capital that took the most incidents to accumulate is the capital you should hand over with explicit permission to discard.

What travels and what's ours

The constants in any of these stories are ours — the 3x token multiplier, the six-to-three cut, the specific rules in the ledger, our cache semantics. Lift the lesson, not the constants. The lesson here is one level up from any single piece, and it is the one we keep rediscovering:

Every scaffold has a half-life, set by either environmental change or model capability, and the work is knowing which of your decisions are load-bearing structure and which are scaffolding you will tear down. Four habits make that survivable. Ask whether each benefit survives a smarter model and a bigger window — if not, you are on a clock. Build the disposable parts to fail loud, never silent, because the silent failure is the one that caches. Lean on artifacts, telemetry, and rates over transcripts, narratives, and reasons, because the durable signal is the one that does not depend on the model's honesty or its framing. And mark any judgment you inherit forward as revisable, so a better reader is not chained to a worse writer's concepts.

The DeepMind report is arguing, at civilizational scale, about whether scaling erases a ceiling on machine concepts or merely lowers it. We are running the same argument in miniature on every release night, and the operator's version is the more tractable one — because we can actually watch our scaffolds expire, and we can instrument for it. The frontier labs have to forecast which limits are fundamental. We just have to notice, before the cache does.

Building multi-agent systems and want to compare notes on what survives a model upgrade? Get in touch.