What LLMs Actually Are Changes Everything You Build on Them

The reservoir vs. reactor debate has a technically precise answer — and it determines every serious AI architecture decision.

Nora Gray|March 1, 2026|22 min read

There is a question that should sit at the foundation of every serious AI architecture decision, and almost nobody asks it precisely enough: when you build a complex system on top of a large language model — agents, workflows, persistent memory, tool use, multi-agent coordination — are you genuinely creating capabilities that exceed what the model contains? Or are you just getting better at extraction?

This is not a philosophical question. It has a technically precise answer, and that answer determines everything: what to invest in, what is possible on a one-year versus five-year horizon, and where the durable competitive advantage of any AI-native system actually lives.

The intuitive poles are the "reservoir" model — the LLM is a fixed deposit of compressed knowledge, and prompting is mining — and the "reactor" model — the LLM is a computational engine, and what matters is not what's stored but what it can compute. The evidence supports neither extreme. What it supports is something more specific, and more useful.

The Model Is a Compressor — and That's Profound

Training a language model is, mathematically, building a compressor. This is not a metaphor. Shannon's source coding theorem establishes that minimizing cross-entropy loss — the standard LLM training objective — is formally identical to minimizing expected code length under arithmetic coding.¹ A 2024 paper at ICLR made the consequence explicit: predictive models and lossless compressors are interconvertible.³

The empirical numbers are staggering. GPT-3 compressed roughly 45 terabytes of text into 17 gigabytes of parameters — a 2,600:1 ratio.³ Frontier models today likely achieve something closer to 22,000:1. The Chinchilla scaling law established that the optimal training ratio is approximately 20 tokens per parameter² — a discoverable mathematical relationship, not an engineering convention.

Then comes the part that should reframe how anyone thinks about what these models are actually learning. A language model trained on text compresses ImageNet image patches to 43.4% of raw size, beating PNG at 58.5%. It compresses audio to 16.4%, beating FLAC at 30.3%.³ No images appeared in the training data in that format. No audio either. The model learned the statistical structure of the world, not just linguistic patterns. And compression quality predicts performance across 30 models and 12 benchmarks with a Pearson correlation of approximately -0.95.³ Intelligence, to the extent it appears in these systems, is efficient world-modeling.

This is the foundation for the reservoir intuition: if the model is a compressor, its outputs are bounded by what was compressed into it. You cannot get more ore from a mine than was deposited there.

Except the analogy breaks down almost immediately — and the exact point of breakdown is instructive.

The compression framing implies that a model's knowledge is an inert archive: static, finite, fully characterized by its training data. But the cross-modal compression results reveal something subtler. The model has not stored a catalog of facts; it has internalized generative structure — the latent regularities that make text, images, and audio what they are. That structural knowledge is not retrieved; it is applied. The mine metaphor fails not because models conjure ore from nothing, but because the "ore" is a set of operations, not a set of objects. When an LLM answers a question it has never literally seen, it is not retrieving — it is extrapolating from compressed structure. Whether that extrapolation is reliable is the whole question.

The Information Ceiling Is Real — But Not Where You Think

The Data Processing Inequality — a cornerstone of information theory — states that post-processing cannot increase Shannon information. An LLM, as a deterministic function of weights and prompt, cannot output more information than those inputs jointly contain. This seems to settle the reservoir-versus-reactor debate definitively, in favor of the reservoir.

It doesn't.

The "input" to an LLM system is not just the prompt. It is: the weights (compressed from terabytes of human knowledge); the prompt (which can be arbitrarily information-rich); sampling randomness (temperature and top-p introduce genuine stochasticity); tool outputs (code execution results, database queries, search results); and feedback from previous outputs in agentic loops. Each is a distinct information source.

The Data Processing Inequality applies to each individual processing step. But a system with multiple information sources, external tools, and feedback loops is not a single processing step — it is an information-processing network with potentially rich dynamics. The question "can you get out more than you put in?" applies very differently at the component level versus the system level.

The right analogy here is not a mine but a catalyst. A catalyst does not add energy to a chemical reaction; thermodynamics forbids it. Yet catalysts are enormously valuable precisely because they lower activation barriers and reshape which reactions are kinetically accessible. An LLM system does not violate the Data Processing Inequality — it does not create information from nothing. What it does is make previously inaccessible combinations of existing information reachable at a cost that was previously prohibitive. A CPU does not "contain" Linux. An engine does not "contain" a car. The components do not determine the system's behavior; the architecture does. This is why the choice of scaffolding, orchestration, and tool access is not cosmetic — it determines which regions of the information space the system can actually reach.

Where the ceiling actually binds in practice depends on what you're asking the system to do. For factual retrieval, it's the weights — a better model means better retrieval, and scaffolding adds little. For pattern application within the training distribution, it's again the weights, and this works brilliantly until it doesn't. For multi-step reasoning, the binding constraint is inference-time compute combined with verification. For system-level goals pursued across time, the dominant variables are external state, tools, and orchestration quality.

The reservoir intuition is correct for some tasks. The reactor intuition is correct for others. The critical mistake is applying one model to all cases.

Transformers Have a Computational Class — and Chain-of-Thought Changes It

Here is a result from theoretical computer science that deserves far more attention than it receives.

Fixed-depth transformers without chain-of-thought reasoning belong to the circuit complexity class TC⁰ — roughly, problems solvable by constant-depth circuits with threshold gates.⁴ This is a hard mathematical result. Without extended reasoning steps, transformers provably cannot solve parity, graph connectivity, permutation composition, or circuit evaluation. These are not exotic edge cases. They are fundamental computational primitives that appear constantly in real software engineering and reasoning tasks.

The escape hatch is chain-of-thought. With T steps of reasoning, transformers can solve any problem solvable by boolean circuits of size T.⁵ With polynomially many reasoning steps, they achieve the power of class P. With unbounded autoregressive decoding, they are Turing complete.

This reframes what inference-time compute scaling actually is. When systems use extended reasoning — chains of thought, scratchpads, iterative refinement — they are not just "thinking harder." They are literally moving into a higher computational complexity class. The capability expansion is formal, not intuitive. Problems that are provably impossible for a bare transformer become tractable with sufficient reasoning steps.

The architectural choice of whether to allow extended reasoning is therefore not a stylistic one — it is a decision about what class of problems the system can solve at all. A product that strips chain-of-thought for latency reasons has not just gotten slower; it has become categorically less capable. Builders who treat reasoning steps as an optional feature are, in effect, choosing a smaller computer.

The caveat is real: o3 at 172x standard compute yields only roughly 12% higher performance on many benchmarks.⁵ Diminishing returns apply. More thinking helps, but not without bound. There appears to be a ceiling on novel, out-of-distribution reasoning that inference-time compute alone cannot push through. ARC-AGI-2 — designed specifically to require genuine novel abstraction — sits at roughly 37.6% accuracy for frontier models even with substantial compute.⁵ Pure LLMs score zero.

The reasoning ceiling is higher than skeptics expected. It is lower than optimists claim. And its location is not uniform — it varies by task structure, with verifiable domains (code, formal math) showing far higher ceilings than open-ended ones (novel analogy, scientific hypothesis generation).

Hallucination Is Permanent — Here Is the Proof

Two independent mathematical arguments establish that hallucination — the generation of plausible but false outputs — is not a bug. It is a structural property of any scalable language model, permanently.

The computability argument uses diagonalization, the same technique Turing used to prove the undecidability of the Halting Problem.⁶ No computably enumerable set of models can answer all computable queries without failure. Undecidable queries guarantee that for any model, infinitely many inputs exist on which it must fail.

The capacity argument is more intuitive. A finite-parameter model commits irreducible error on incompressible or long-tail facts. A 2025 paper identifies an "impossibility triad" — computational undecidability, statistical sample insufficiency, and finite information capacity — that jointly guarantee irreducible hallucination rates for open-ended queries.⁷

What this means in practice is that hallucination is not uniformly distributed across query types. It concentrates at the boundaries of the training distribution, in long-tail facts with sparse training signal, in queries that require integrating information across incompatible sources, and in any query that is formally undecidable. A model that performs flawlessly on well-represented knowledge domains will still hallucinate on thin-tail facts — not because it is poorly trained, but because finite capacity cannot represent infinite particulars.

The useful analogy is error-correction codes, not zero-defect manufacturing. Error-correction codes do not eliminate bit errors. They suppress them to statistically negligible rates for specific distributions while guaranteeing the error floor is never exactly zero. Hallucination can be made negligible for constrained distributions. It cannot be made zero for open-ended queries. Ever.

The correct design response is not trying harder to prevent hallucination — it is building for detection, containment, and recovery. Verification over generation. Redundancy over elimination. This is not pessimism; it is physics. More specifically: the value of a verification step grows as the task approaches the distribution boundary. A system that verifies outputs on well-trodden facts is wasting compute. A system that verifies outputs on novel synthesis, long-tail retrieval, or multi-source integration is doing the only thing that actually helps.

Creativity Is Recombination — For LLMs and Humans Alike

Margaret Boden's three-tier framework distinguishes combinational creativity (familiar ideas in unfamiliar combinations), exploratory creativity (finding new possibilities within existing conceptual spaces), and transformational creativity (restructuring the conceptual space itself).⁸

Current evidence: LLMs achieve combinational and exploratory creativity. They do not achieve transformational creativity. A 2025 study establishes an information-theoretic novelty-utility tradeoff for combinational creativity that does not improve with model size — suggesting it is a characteristic of the architecture, not a scaling problem.⁸ [CITATION NEEDED — see Notes]

The fixation problem is underappreciated. One hundred LLM instances cover less of the idea space than one hundred different humans each generating one idea, because every LLM instance shares identical weights. Each human mind occupies a distinct region of the knowledge space; LLMs share one region and sample from it. This matters most for brainstorming and exploration: high temperatures and varied prompts increase sampling diversity within a single conceptual region but do not change which region is being sampled. For systems designed to harness collective intelligence, prompt diversity is not enough. You need architectural diversity — genuinely different models, models fine-tuned on different corpora, or hybrid teams that include human contributors who occupy distinct regions of the idea space by construction.

The philosophical parallel cuts in both directions: the Data Processing Inequality applies to human brains too. No human has ever produced output that was not, at some level, a recombination of sensory inputs. The "just recombination" critique of LLMs, applied consistently, would undermine most of what we call human creativity. Whether recombination at sufficient complexity constitutes genuine novelty is not answerable by information theory — it requires a theory of meaning, not just bits.

What information theory does tell us: the distinction between elaborate recombination and genuine creativity may be less clean than either camp wants it to be. The transformational creativity gap is real and architecturally significant. But the implied corollary — that combinational and exploratory creativity are therefore trivial — is not supported by the evidence. The practical output of LLM-powered systems in drug discovery, materials science, and software engineering suggests that combinational creativity at scale, reliably deployed, is neither trivial nor already captured by prior tools. It is a new thing, even if it is not the most extreme thing imaginable.

What Production Systems Actually Show

The empirical record from production deployments is the most honest test of the theoretical framework.

In software engineering, top agents now solve more than 80% of real-world production bugs on SWE-bench Verified.¹⁰ The contamination-resistant SWE-bench Pro drops this to roughly 46% — a more honest measure of genuine capability.¹⁰ These are not trivial tasks: they require reasoning across thousands of files and holding coherent context through complex codebases. The performance cannot be reduced to pattern-matching on training data.

The most striking finding from coding agent research: human-AI collaboration achieves a 31.11% pass rate on hard problems, against 0.67% for LLM alone and 18.89% for human alone.¹⁰ [CITATION NEEDED — see Notes] Some problems are solvable only through the combination. This is not a model that has been prompted more cleverly. It is a capability that neither component possesses independently — a result the Data Processing Inequality does not forbid, because the human and the model are distinct information sources whose joint information exceeds either alone. The human brings out-of-distribution intuitions the model cannot replicate; the model brings exhaustive combinatorial search the human cannot sustain. The combination is not just more efficient. It is categorically different.

Scientific discovery shows the same pattern. AlphaProof achieved International Math Olympiad silver-medal performance with formally verified proofs.¹⁰ [CITATION NEEDED — see Notes] AlphaEvolve improved on Strassen's 56-year-old matrix multiplication algorithm — in 20% of tested problems, it found solutions surpassing all known human results.¹⁰ [CITATION NEEDED — see Notes] An autonomous laboratory ran 355 experiments in 17 days, synthesizing novel materials with a 71% success rate.¹⁰ [CITATION NEEDED — see Notes]

Every one of these systems is a compound system: search plus LLM plus verification plus external tools. None are bare models. The production evidence does not support the reservoir model for system-level tasks, and it does not support the naive reactor model that ignores binding constraints. It supports the conditional reactor.

The failure data matters equally. Multi-agent frameworks across popular implementations show 41–86.7% failure rates in production conditions.¹⁰ [CITATION NEEDED — see Notes] Error accumulation follows a staircase pattern — long coherent plateaus punctuated by sharp drops at specific decision junctions, roughly 5–10% of tokens.¹⁰ [CITATION NEEDED — see Notes] Simpler architectures outperform complex ones under production stress.¹⁰ [CITATION NEEDED — see Notes]

The staircase pattern is diagnostically important and underexplored. It reveals that the failure mode of complex agentic systems is not gradual quality degradation across the entire task — it is catastrophic collapse at specific decision points where the agent must commit to a path, make a high-consequence inference, or integrate conflicting context. This is the opposite of how most builders diagnose production failures: they look at aggregate reliability metrics, not junction-level failure rates. The correct architectural response is to identify those junctions — often recognizable by large context windows being compressed, conflicting retrieved information, or ambiguous instructions — checkpoint them explicitly, and verify them with particular care. Not shorten the chain. Reinforce the weak links.

The Conditional Reactor: What the Evidence Actually Supports

The evidence points to a specific model: an LLM is a conditional reactor — a computational engine whose fuel is compressed training data, whose power output depends on scaffolding and tools, and whose ceiling is set by the interaction of all three.

The task type determines which model governs:

Factual retrieval: reservoir. The binding constraint is what's in the weights. Better models matter; scaffolding adds little.
Pattern application: reservoir. Works brilliantly within the training distribution; fails at the boundaries.
Multi-step reasoning: bounded reactor. Chain-of-thought expands capability; diminishing returns apply at the complexity ceiling.
Novel abstraction: neither. Current architectures cannot restructure their own conceptual spaces. The ARC-AGI evidence is clear.
System-level goals pursued across time: scaffolded reactor. External state, tools, feedback loops, and multi-agent coordination are the dominant variables.

The architectural implication: the model is not the system. Whether adding agents, persistent memory, and tool use creates genuine capability beyond the model is not answered by the Data Processing Inequality alone — it is answered by analyzing the system as an information network with multiple sources, not as a single processing step.

Consider what statelessness versus statefulness actually changes. A bare LLM is stateless — no memory between calls, no persistent identity, no ability to learn from its own actions. Adding persistent state — databases, version-controlled code, durable workflow execution, institutional memory — is a categorical change in computational class, analogous to the difference between a calculator and a computer. The calculator answers questions. The computer pursues goals across time, accumulates knowledge, and corrects its own errors. The difference is not one of degree; it is one of kind.

This is the overlooked asymmetry in debates about whether scaffolding "really" adds capability. A single tool call does not transcend the model. But a system that accumulates institutional memory, learns from its own failure history, and adapts behavior based on durable feedback loops is not the same computational entity as a bare model — in the same sense that a person who remembers their past mistakes is not the same cognitive entity as a person with complete amnesia. Temporal continuity is not a performance optimization. It is a different class of system.

A 2023 study demonstrated a 15.3x speedup in agent capability through persistent skill accumulation — capabilities that compound over time without any model upgrades.⁹ This is not extraction efficiency. It is temporal continuity transforming one-shot intelligence into sustained capability.

The Constraint Map: What Is Immutable, What Is Solvable

Every LLM limitation is either fundamental — established by mathematical proof or physical law — or an engineering constraint that will relax with time. Misclassifying in either direction is expensive. Treating a fundamental limit as solvable wastes resources indefinitely. Treating a solvable constraint as fundamental leaves value untouched.

The immutable constraints:

Hallucination is irreducible. Computability theory proves it.⁷ Design for containment, not prevention.

Information cannot be created from nothing. The Data Processing Inequality is real.¹ System design must maximize the information available at each step, not hope to conjure it.

Some problems are inherently hard. P ≠ NP is widely considered true. LLMs produce heuristic approximations on NP-hard problems, not optimal solutions. Crucially, verification is exponentially cheaper than generation on these problems — a fact that compound AI architectures can exploit. For NP-hard problems, the right architecture is not a more powerful generator but a generator-verifier pair, where the generator proposes and the verifier selects. The value of the system concentrates in the verifier, not the generator.

Self-prediction is impossible. The Halting Problem applies to LLMs in agentic loops with self-referential goals.⁶ Undecidability enters.

Perfect alignment is impossible. Rice's Theorem: non-trivial semantic properties of programs are undecidable. Any behavior with finite probability can be triggered with increasing probability. Alignment is a gradient, not a binary state. [CITATION NEEDED — see Notes]

The novelty-utility tradeoff. More utility constraints reduce achievable novelty. This does not improve with model size. A fundamental ceiling on being simultaneously novel and useful. [CITATION NEEDED — see Notes]

The thermodynamic floor. Landauer's principle sets a physical minimum of approximately 3 × 10⁻²¹ joules per bit erasure at room temperature.¹¹ Current hardware operates 10,000x–1,000,000x above this floor. Enormous efficiency gains remain possible; the floor exists regardless.

The engineering constraints — context window, inference cost, reasoning depth, training data availability, agentic reliability — are all real today and all improving. Context windows have expanded from thousands to millions of tokens. Inference cost for GPT-3.5-equivalent queries fell 280-fold in 18 months. [CITATION NEEDED — see Notes] The binding constraint today is agentic reliability and reasoning depth on novel problems. By 2028, the binding constraints will likely be training data exhaustion and verification at scale.

Five Scaling Axes, Not One

The single-axis "bigger model equals better" framing is finished. There are now five independent axes along which LLM capability improves, each with different trajectories and different implications.

Pre-training scale is hitting diminishing returns. Training data exhaustion is projected for 2027–2028, and one of the architects of modern deep learning has publicly stated that pre-training as we know it will end. [CITATION NEEDED — see Notes]

Data quality and curation is currently the primary driver of efficiency gains. The "Densing Law" — capability density doubling approximately every 3.3 months — captures this.¹² In concrete terms: achieving more than 60% on MMLU required a 540-billion-parameter model in 2022. By 2024, a 3.8-billion-parameter model reached the same threshold — a 142-fold reduction.¹² The intelligence that costs a dollar today will cost less than a cent in a few years.

This trajectory has an underappreciated consequence for competitive strategy. If capability density doubles every 3.3 months, any product whose core value proposition is "we use a more capable model" has a half-life measured in quarters. The capability advantage that felt substantial six months ago is routine today and commoditized tomorrow. GPT-4-class capability, which represented a meaningful moat in 2023, is now available in open-weight models runnable on consumer hardware. The rate of commoditization is faster than most builders have internalized, and the correct response is not to race along the capability axis but to invest in the dimensions that commoditize more slowly.

Post-training via reinforcement learning has emerged as a dominant new axis. DeepSeek-R1 demonstrated that reasoning capability can emerge from pure reinforcement learning without human-generated reasoning traces — an unexpected result suggesting the frontier is moving on this axis faster than most anticipated. [CITATION NEEDED — see Notes] The broader implication: capability is not fully determined at training time. Systems that can be trained online — on the outputs of their own tasks, in their own operational environment — have a compounding advantage that model upgrades alone cannot replicate.

Inference-time compute scaling provides clear gains in verifiable domains, with the diminishing returns caveat already established.⁵

Architectural efficiency continues steadily. Mixture-of-experts architectures reduce active compute by roughly 70%. [CITATION NEEDED — see Notes] Attention optimizations compress memory requirements dramatically. These do not change the ceiling; they change the cost to reach it.

The strategic implication: capability density is compounding, which means today's expensive frontier capability becomes tomorrow's commodity. Whatever is hard to automate today because the model is too expensive becomes economically trivial within 12–18 months. The relevant question is not "can the model do this?" but "what will it cost when it can do this reliably, and what will the competitive landscape look like at that point?"

Conditional Projections: The If/Then Framework

What follows is not prediction. It is conditional logic — if specific forces continue at their current trajectories, these consequences follow. Each projection carries the conditions that would invalidate it.

Near term (12–24 months): If the Densing Law continues,¹² today's frontier capability approaches commodity pricing for standard tasks. If inference-time compute scaling matures as expected,⁵ reliable multi-step reasoning for verifiable domains — code, math, structured data — becomes standard. If agentic reliability improves at current rates, 20–30 step autonomous workflows become reliable for narrow, well-defined tasks. Confidence is high on inference cost collapse; moderate on reasoning and reliability improvements. The invalidating conditions: hardware bottlenecks, regulatory intervention, or a plateau in data quality improvements.

Medium term (2–4 years): If the training data wall is cleared via synthetic data and multimodal expansion, frontier models achieve significantly deeper reasoning. If architectural innovation produces genuine System 2 reasoning — causal modeling, novel abstraction, genuine planning — the ARC-AGI ceiling breaks and the capability envelope expands dramatically. If verification capabilities improve proportionally to generation, autonomous software development reaches a qualitatively different level. Confidence is moderate throughout — each depends on specific breakthroughs that are plausible but not guaranteed. The invalidating conditions: evidence that synthetic data loops produce capability saturation rather than growth, or that the ARC-AGI ceiling proves architectural rather than a scaling problem.

Longer term: Domain-general competence in most knowledge work becomes possible if all engineering constraints resolve and scaling continues. Whether the reasoning ceiling is architectural — requiring a fundamentally different approach — or merely difficult remains genuinely open. François Chollet's thesis that fluid intelligence requires something architecturally distinct from current transformers may prove correct.⁴ It has not been refuted. The honest answer is that the medium-term projections depend on a small number of specific bets, any one of which could fail — and if the architectural ceiling hypothesis is correct, the trajectory looks fundamentally different from current extrapolations.

What remains impossible regardless of timeline: Zero hallucination.⁷ Perfect alignment. Optimal solutions to NP-hard problems (unless P = NP). Reliable self-prediction in agentic loops.⁶ Free computation. These are established by proof or physical law. No architectural innovation changes them.

Strategic Implications for Builders

If the conditional reactor model is correct — and the evidence strongly suggests it is — several things follow for anyone building seriously on LLMs.

The scaffolding is the product. The compound system — persistent state, durable workflow execution, multi-agent coordination, institutional memory, feedback loops — provides a categorical capability upgrade over bare models. This is where durable innovation lives, because it is what commoditizing models cannot absorb. When a new model generation ships, you swap in the component. The system persists.

Models are commoditizing; systems are not. The Densing Law¹² ensures that today's frontier model becomes tomorrow's cheap component. The orchestration layer, accumulated institutional memory, governance framework, and economic incentive structure do not commoditize at the same rate. This asymmetry in commoditization speed is the core strategic fact of the current moment.

Design for verification, not generation. As generation becomes cheap, value concentrates in the selection function: choosing what to generate, verifying correctness, and recovering from errors. This is not a minor process improvement; it is a reorientation of where the product lives. A system where the expensive, differentiated layer is the generator is building on sand — generation is what commoditizes first and fastest. A system where the expensive, differentiated layer is the verifier is building on more durable ground.

Hallucination is a design parameter. Accept it as permanent.⁷ Build systems that detect, contain, and recover from it — redundant agents, verification steps, human approval gates for high-stakes outputs. The corollary: domains where hallucination is tolerable (ideation, drafting, exploration) have different optimal system architectures than domains where it is not (legal, medical, financial). Applying a low-verification architecture to a high-stakes domain is not a product decision; it is a safety decision.

Decision junctions matter more than sequence length. Error accumulation follows the staircase pattern — reliability degrades at specific decision points, not uniformly across a task. The architectural response is to identify those junctions, checkpoint them, and verify them with particular care. Shortening chains treats the symptom; reinforcing decision points treats the cause.

Data is the durable moat. As models commoditize and architectural patterns diffuse, the unique data generated by operating a system — institutional memory, agent performance histories, research outputs, economic signals — becomes the most defensible asset. It cannot be replicated by switching to a new model. Every production deployment that does not systematically capture and structure its own operational data is destroying a moat it could be accumulating. The systems that treat data accumulation as a first-class architectural concern will compound their advantage; those that treat it as a logging afterthought will not.

The Answer

Return to the original question: reservoir or reactor?

The model is a conditional reactor. Its fuel is the compressed training data.³ Its power output depends critically on the scaffolding, tools, and feedback loops surrounding it. Its ceiling is set by the interaction of all three.

For factual retrieval and pattern application within distribution, the reservoir intuition is correct — what the model contains is the binding constraint. For multi-step reasoning with verification,⁵ for system-level goals pursued across time, for capabilities that compound through persistent state and institutional memory,⁹ the reactor intuition is correct. The system can genuinely produce more than the model alone contains — not by violating information theory, but by introducing genuinely new information sources, by moving computation into higher complexity classes through extended reasoning, and by allowing temporal continuity to transform one-shot performance into compounding capability.

The crucial point is that neither the training data nor the architecture nor the scaffolding alone sets the ceiling. The relevant question for any specific use case is: which of these is currently binding, and how does that change over time?

Building without this understanding does not merely produce imprecision. It produces the wrong architectural choices, the wrong investments, the wrong timeline assumptions. A physicist does not build a reactor without understanding nuclear physics. The intelligence substrate has its own physics — information theory, computational complexity, empirical regularities like the Densing Law¹² and the staircase failure pattern. These are not abstractions. They are the load-bearing structure underneath every serious AI system.

The builders who understand that structure will make consistently better decisions than those who don't. Not because they can predict the future, but because they know which constraints are permanent and which are dissolving — and they build accordingly.

Sources

1.
A
Shannon, C.E. — Mathematical Theory of Communication," Bell System Technical Journal — 1948
2.
T
Hoffmann, J., et al. — raining Compute-Optimal Large Language Models (Chinchilla)," DeepMind / arXiv — 2022
3.
L
Delétang, G., et al. — anguage Modeling Is Compression," ICLR — 2024
4.
T
Merrill, W., and Sabharwal, A. — he Expressive Power of Transformers with Chain of Thought," arXiv — 2023
5.
W
Strobl, L., et al. — hat Formal Languages Can Transformers Express? A Survey," Transactions of the Association for Computational Linguistics — 2024
6.
O
Turing, A.M. — n Computable Numbers, with an Application to the Entscheidungsproblem," Proceedings of the London Mathematical Society — 1936
7.
H
Xu, Y., et al. — allucination is Inevitable: An Innate Limitation of Large Language Models," arXiv — 2025
8.
T
Boden, M. — he Creative Mind: Myths and Mechanisms," Routledge — 2004
9.
V
Wang, G., et al. — oyager: An Open-Ended Embodied Agent with Large Language Models," arXiv — 2023
11.
I
Landauer, R. — rreversibility and Heat Generation in the Computing Process," IBM Journal of Research and Development — 1961
12.
T
Ma, Y., et al. — he Densing Law of LLMs," arXiv — 2024