...

Your Next ‘Large’ Language Model Might Not Be Large After All


Since the conception of AI, researchers have always held faith in scale — that general intelligence was an emergent property born out of size. If we just keep on adding parameters and train them on gargantuan corpora, human-like reasoning would manifest itself.

But we soon discovered that even this brute-force approach had its own shortcomings. Evidence suggests that a majority of our frontier models are severely undertrained and have inflated parameter counts (Hoffmann et al., 2022)3, which indicates that we might be spending compute in the wrong avenue after all.

The Hidden Flaws of the AI Giants

We made the most powerful AI ever built think in a slow, awkward, foreign language: English. To find solutions to problems, they must “reason out loud” through a word-for-word, step-by-step process while also providing us with many irrelevant and inefficiently managed “tokens.”

Then there is the well-established industry practice of “the-bigger-the-better.” This has led to the development of models with billions of parameters and training sets with trillions of tokens. The sheer size of such models implies that the models are not really reasoning; they are simply being the best possible imitators. Instead of finding an original, novel solution for a particular problem, they use the fact that they were previously shown something similar to the current problem during their training data to arrive at a solution.

Lastly, and perhaps most critically, these models are limited to a “one-size-fits-all” method of thinking. As an example, when dealing with a very difficult problem, a model cannot choose to spend additional processing time working on a particularly difficult area of the problem. Of course, if a model takes more time to work on a more difficult problem, it generates more CoT tokens (Wei et al., 2022)4. But this doesn’t necessarily replicate human reasoning, which involves deep stages of pondering without any tangible verbal dialogue.

Hierarchical Reasoning Models

Introducing Hierarchical Reasoning Models (HRMs) (Wang et al., 2025b)1: instead of the clumsy “think out loud” approach, they reason silently and fluently within their native latent space—a rich, high-dimensional world of numbers. This is far closer to our own human intuition, where deep thoughts often precede the words we use to describe them.

The heart of this new architecture is beautifully simple yet dynamic: a patient, H-module which sets the overall strategy, while a fast, low-level L-module is responsible for seeing through the set strategy all the way. Both of the modules are implemented as simple transformer blocks (Vaswani et al., 2017)2 stacked on top of each other.

How HRM Thinks: A Look Inside

It breaks down the act of “thinking” into a dynamic, two-speed system. To understand how it solves a complex problem like a 30×30 maze, let’s walk through the entire journey from input to answer.

(Source: Author)
Overall Architecture of the HRM
(Note: All the H-modules and L-modules share their own respective weights across all instances and process information in a recurrent manner)

1. The Setup: Embedding and Initializations

  • Flatten and Embed: As the name suggests, the input (for example, a Sudoku grid or maze) is flattened into a single-dimensional stream of patches/tokens, and then fed into an embedding model, which converts the human-interpretable maze into embedding vectors understood by machines.
  • Initialize Memory: Two different modules are now instantiated: a High-Level state (zH), which acts as a supervisor, dictating the overarching direction of thought and reasoning, and a Low-Level state (zL) responsible for executing the reasoning in the set direction.

2. The Core Engine: Real Reasoning Starts Here

At its core, HRM is a nested loop, and a single pass through it is termed a “segment”. Each segment contains several H and L module cycles in itself.

  • Step A: Setting the Plan
    The High-Level (H) module begins by establishing a high-level plan. Its memory state (zH) is held constant for a set number of steps and initialized randomly for the first pass. In our maze example, this initial plan might be very abstract/general, like “explore paths that move downwards and to the right.”
  • Step B: Executing the Plan
    With the High-Level module’s plan as a fixed guide, the Low-Level (L) module begins a series of recurrent computations. For a set number of timesteps (T), it iteratively updates its own hidden state (zL), with three inputs to work on:
    • Its own work from the previous step (zL_previous).
    • The fixed plan from the High-Level Module (zH).
    • The original problem (the embedded maze).
  • The Low-Level module, while keeping the overarching strategy in mind, explores numerous paths, hits dead ends, backtracks and repeats, until it reaches a conclusion, that is then shared with the High-Level module.
  • Step C: Changing the Plan Accordingly
    Once the L-module is done with its recurrent working cycles, its final memory state (zL_final), which represents the outcome of its computation, is fed to the H-module for refinement. The H-module modifies its own plans and devises a new strategy for the L-module to follow in the next iteration. For example: “The downward path is an eventual dead end. The new plan is to now explore paths leading right.”
  • Step D: Reset and Repeat
    The L-module receives this updated plan from its “supervisor” for the next cycle of its recurrent and intensive work. This goes on for the next “N” cycles for the H-module, each cycle consisting of “T” sub-cycles of the L-module.

3. The “Exit” Button: Deciding When to Stop

A single pass through the engine (a “segment”) might not be enough for a more nuanced or harder problem. This is where HRM’s most ingenious feature comes in: Adaptive Computation Time (ACT) (Graves, 2016)6.

After each full segment of thought (N×T cycles), the model generates a tentative answer. Then, it is fed into a simple linear network, which decides: “Am I confident enough to stop, or should I think more?”

  • If the model determines that it is confident enough in its answer, it halts and presents it as the final solution.
  • If not, it decides to “ponder” further. It takes the final memory state of the L and H modules and uses it as initialization for an entirely new segment, which continues the thinking process.

Implementation of ACT:

The model learns when to stop through a Q-learning paradigm.

  • The Q-Head: This is a simple linear layer (Q-Head) that takes the call to either continue reasoning or to stop. It takes the final memory state of the H-module at the end of a segment and outputs two scores: Qhalt and Qcontinue.
  • The ‘Halt’ Value (Qhalt): This score represents the model’s confidence that it should stop now. During training, the model learns to make this score predict the immediate, final reward. The target it’s trained to match is simple: 1 if the predicted answer is correct, and 0 if it’s wrong.
(Source: Author)
Ghalt: The reward for stopping the reasoning process
ŷm: Predicted answer of the model for the task (eg, solution of the maze)
y: Ground truth against the model’s prediction (eg, actual maze solution)
m: The current segment iteration number
  • The ‘Continue’ Value (Qcontinue): This represents the estimated reward the model would receive if it continued thinking for another segment, instead of stopping right now. Its target score is the estimated maximum possible value among the two Q-scores from the immediate next segment and is defined as:
(Source: Author)
Gcontinue: The reward for continuation of reasoning
m: The current segment iteration number
Qcontinue/halt: Q-heads predicted output
  • The Dual-Loss System: After each segment of thought, the model’s total loss comprises two different objectives:
    • Task Loss: The standard loss for getting the wrong answer (sequence-to-sequence cross-entropy).
    • Q-Learning Loss: ACT loss for making a poor stopping decision (Binary Crossentropy).
(Source: Author)
Lmtotal: Total loss for the entire model
ŷm: Predicted answer of the model for the task (eg, solution of the maze)
y: Ground truth against the model’s prediction (eg, actual maze solution)
Qm: Q-Head’s output prediction of either to halt or continue
Gm: Q-Head’s output target
  • This enables the model to learn both objectives simultaneously: how to solve the given question while learning to recognize when it has been solved.

Putting It to the Test: Results

Sudoku and Maze Benchmarks

On benchmarking against several state-of-the-art reasoning models, HRM performs significantly better on complex reasoning tasks involving Sudoku puzzles and 30×30 mazes. Both of them require extensive logical deduction, the ability to backtrack, and spatial planning. As shown below, all other models that use Chain-of-Thought prompting failed to produce even a single valid solution. These findings validate the notion that making models reason in a much more representative latent space is better than making them talk to themselves via CoT.

(Source: Adapted from Wang et al., 20251, Figure 1)
X-axis: Accuracy of the models on the respective benchmarks

Architecture Over Scale: A Paradigm of Efficiency

The model can perform such a feat while also delivering extreme levels of parameter and data efficiency. It manages its top-tier performance with 27 million parameters, trained from scratch on roughly 1,000 datapoints per task. It also doesn’t need any expensive pre-training on web-scale datasets or brittle prompt engineering tactics. It further solidifies the hypothesis that the model can internalise general patterns and can reason much more efficiently than the standard CoT-based approach to reasoning.

Abstract Reasoning and Fluid Intelligence: The ARC-AGI Challenge

The Abstraction and Reasoning Corpus (ARC) (Chollet, 2019)5 is a widely accepted benchmark for fluid intelligence and requires the models to infer vague and abstract rules, given only a few visual examples. HRM, with just 27 million parameters, outperforms most of the mainstream reasoning models. Despite its size, it scored 40.3% on ARC-AGI-1, while the much larger models with tremendous compute at their disposal, like o3-mini and Claude 3.7, managed to get a subpar score of 34.5% and 21.2% respectively.

(Source: Adapted from Wang et al., 20251, Figure 1)
X-axis: Accuracy of the models on the respective benchmarks

Unlocking True Computational Depth

Performance on vanilla transformer architectures quickly starts to plateau when given more compute, i.e., simply adding more layers yields diminishing returns on complex reasoning. Contrastingly, HRM’s accuracy scales almost linearly with additional computational steps. This provides direct evidence from the paper that the model’s architecture is not a fixed-depth system. It possesses an intrinsic ability to utilize the extra compute to deal with complex tasks, a capability that the underlying structure of a standard Transformer lacks.

(Source: Adapted from Wang et al., 20251, Figure 2)
X-axis: Accuracy of the models on the Sudoku-Extreme Full dataset

Intelligent Efficiency: Solving Problems with Less Effort

The Adaptive Computation Time (ACT) mechanism allows the model to dynamically allocate its computational resources based on problem difficulty. An HRM equipped with ACT achieves the same top-tier accuracy as a model hard-coded to use a high number of steps, but it does so with significantly fewer resources on average. It learns to conserve compute by solving easy problems quickly while dedicating more “ponder time” only when necessary, demonstrating an intelligent efficiency that moves beyond brute-force computation.

(Source: Adapted from Wang et al., 20251, Figure 5)

These two graphs must be analysed together to understand the efficiency of the ACT mechanism. The X-axis on both charts represents the computational budget: for the “Fixed M” model, it is the exact number of steps it must perform, while for the “ACT” model, it is the maximum allowed number of steps (Mmax). The Y-axis on Figure (a) shows the average number of steps actually used, while the Y-axis on Figure (b) shows the final accuracy.

The “Fixed M” model’s accuracy (black line, Fig. b) peaks when its budget is 8, but this comes at a fixed cost of using exactly 8 steps for every problem (black line, Fig. a). The “ACT” model (blue line, Fig. b) achieves a nearly identical peak accuracy when its maximum budget is 8. However, Fig. (a) shows that to achieve this, it only uses an average of about 1.5 steps. The conclusion is clear: the ACT model learns to accomplish the same top-tier performance while using less than a quarter of the computational resources, intelligently stopping early on problems it has already solved.

References

[1] Wang, Guan, et al. “Hierarchical Reasoning Model.” arXiv preprint arXiv:2506.21734 (2025).
[2] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).
[3] Hoffmann, Jordan, et al. “Training compute-optimal large language models.” arXiv preprint arXiv:2203.15556 (2022).
[4] Wei, Jason, et al. “Chain-of-thought prompting elicits reasoning in large language models.” Advances in neural information processing systems 35 (2022): 24824-24837.
[5] Chollet, François. “On the measure of intelligence.” arXiv preprint arXiv:1911.01547 (2019).
[6] Graves, Alex. “Adaptive computation time for recurrent neural networks.” arXiv preprint arXiv:1603.08983 (2016).

Source link

#Large #Language #Model #Large