You might have noticed that GPT-5 sometimes thinks for a very long time. Reasoning as a concept sometimes still feels like an infinite loop to nowhere. As a result, most “thinking” models have very high latency and are expensive in everyday usage.
Maybe the whole concept of language-based reasoning is wrong?
China and Singapore-based Research Lab Sapient Intelligence published with Hierarchical Reasoning Model (HRM) a 27 million parameter (weirdly small) model that is one of the most interesting ideas to solve the shortcomings of existing reasoning approaches, and even open-sourced it. And since HRM fits nicely into a series with my previous post on symbolic reasoning, I am now reviewing HRM and assessing its strengths and weaknesses. I hope to find some time to have it play 4x4 TTT against Qwen3 at some point in time in the near future.
Historically, it is not the first time I am evaluating hierarchical approaches, Hierarchical Task Network Planning (Sep 2024), and my assessment of reasoning strategies (also Sep 2024), where I stated:
Thoughts are organized in a hierarchical manner, with high-level strategies branching into more specific sub-strategies or steps.
But let’s dive right in.
What makes HRM unique?
Similar to my quote above, the HRM architecture uses two interdependent recurrent networks:
a slower, high-level module (H-Module) for abstract planning and deliberate reasoning
a faster, low-level module (L-Module) for rapid, detailed computations
Note that this is not related to the “Thinker: Learning to Think Fast and Slow”, where faster refers to LLM having to answer within a strict token budget. In Zhang's paper, "fast" simply means fewer computational steps due to resource limitations
HRM’s Architecture
HRM is modeled after modes of cognitive processing called System 1 and System 2 thinking introduced by Daniel Kahneman in his book Thinking, Fast and Slow (source). Neuroscientific evidence (Matthew D Lieberman, 2007) shows that these modes share overlapping neural circuits, particularly within regions such as the prefrontal cortex and the default mode network. Yet more interesting is that this overlap indicates that the brain can use either one and can switch based on task complexity and expected rewards.
Yet the brain processes more than just hierarchical processing because, in order to distinguish “fast” and “slow”, the brain needs to have an understanding of timescales. This is called temporal separation and provides stability to the high-level guidance of rapid L-Module operations. This is quite similar to sequence-to-sequence LSTM Models. where both input and output are represented as token sequences.
And then there is recurrent connectivity. One of the problems of depths in transformer stacks is that it doesn’t scale well because the model converges quickly, the gradient disappears, and accuracy declines as a result.
source - figure 2
HRM here as well takes a page from our physical brain, using a recurrent network to scale depths. All boxes here are recurrent layers of the recurrent neural network (RNN)
My drawing skills (above) are not that good, but it is an abbreviation of the drawing in the paper.
But the HRM model consists of four learnable components:
input network,
a low-level recurrent module,
a high-level recurrent module,
and an output network.
The goal of using this setup is to avoid rapid convergence of standard recurrent models. They call this “hierarchical convergence”, and the slow “reasoning” H-modules only advance after the faster L-modules have reached a local equilibrium. At this point, the L-module resets and begins a new phase. HRM executes in a single forward pass without explicit supervision of the intermediate process. The H-module only updates once per cycle
source - Figure 4
###Comparison of forward residuals and PCA trajectories. HRM shows hierarchical convergence: the H-module steadily converges, while the L-module repeatedly converges within cycles before being reset by H, resulting in residual spikes. The recurrent neural network exhibits rapid convergence with residuals quickly approaching zero. In contrast, the deep neural network experiences vanishing gradients, with significant residuals primarily in the initial (input) and final layers.
The halt
Since not all tasks are equally complex, the project team also introduced a halting mechanism that decides to stop computation early once it has gathered enough information to make a confident prediction. Instead of running for a fixed number of steps, the model dynamically chooses between continue (do another step) or halt (stop and output). This makes computation adaptive to the complexity of the input and allows for scalable depths.
Latent Reasoning
Reasoning in traditional methods happens in the language space. That means each misstep can derail the reasoning chain. This dependency on explicit linguistic steps tethers reasoning to patterns at the token level. As a result, CoT reasoning often requires a significant amount of training data and generates a large number of tokens for complex reasoning tasks, resulting in slow response times. In addition, all language is given meaning by the intent and context of the user. There is nothing inherent to the letters C A K E or a string of sounds that compels us to create the human language token "cake". This learned behaviour is obviously not the same in other languages. Reasoning in latent space might actually be a new paradigm of thinking.
Thinking in latent space is silent thinking / inner monologue.
Thinking in token space is thinking out loud. Inefficient. Humans do that.
The model team implements reasoning in the latent spaces to conduct computations. In practice, and this is quite similar to my work on 4x4 TTT games
You start with a 2D grid (e.g., a 4×4 board).
Each grid cell (value, color, symbol, etc.) is mapped to an embedding vector by the input network.
The sequence of these embeddings is then fed into the low-level recurrent module (L-module), which iteratively updates its hidden state.
The high-level module (H-module) coordinates across cycles, guiding how local grid-cell information is combined into global reasoning.
Then you index into the input embedding corresponding to the cell, and then track how the L-module’s hidden state evolves as it processes those embeddings.
So in latent reasoning, the domain of thought happens in latent space; hence, there is no need to translate it back to human language. That has the benefit that training data can be minimized. The HRM team claims that they have only used 1,000 pairs as training data (The Arc Team argued that even using only 300 is enough!). The disadvantage is that you can’t “talk” to latent spaces in natural language, and secondly, that the model likely won’t generalize well. But maybe that could be an advantage in highly vertical, specialized domains.
But let’s talk more about that data requirement.
Data Augmentation
The team mentioned that they used data augmentation.
What is it and why does it work?
When training the model, the team used “golden” input-output pairs. The dataset is then augmented by applying translations, rotations, flips, and other permutations.
source Arc Prize
The problem here again is that data augmentation is very domain-specific. This takes away general capability but improves performance for narrow skills. So if you are striving for AGI/ASI, I have to disappoint you; this approach might not work as shown.
The Arc Prize
On Aug 15, 2025 the Arc Prize team published their assessment on the model’s performance, revealing five key findings that challenge the prevailing narrative around the Hierarchical Reasoning Model (HRM).
First, while HRM achieved strong results on ARC-AGI-1 (32% on the Semi-Private set), its purported advantage from a “hierarchical” brain-inspired architecture turned out to be marginal; performance was nearly matched by a similarly sized transformer.
Second, the real performance gains stemmed from HRM’s under-documented “outer loop” refinement process, where iterative prediction and self-correction proved far more impactful than the model’s structural design.
Third, cross-task transfer learning was shown to have limited value: most performance gains came from effectively memorizing solutions to evaluation tasks, making HRM’s approach closely resemble test-time training rather than generalizable reasoning.
Fourth, task augmentation played a critical role, but contrary to the paper’s claims, only ~300 augmentations were needed to achieve near-maximum performance, highlighting diminishing returns beyond that point.
Finally, inference-time augmentation and majority voting added little value compared to pre-training augmentations, suggesting HRM’s strength lies more in preprocessing than dynamic reasoning at test time.
source
In my opinion, getting 32% on ARC-AGI-1 is an insane score with such a small model.
Together, these findings reposition HRM not as a fundamentally novel architecture, but as a refinement-heavy, augmentation-driven system whose strongest component—the outer refinement loop, could be applied more broadly across model classes. The Arc Prize team concludes that while HRM represents meaningful progress, especially for its modest scale, its innovations are narrower than initially claimed and raise critical questions about what truly drives success on ARC-AGI benchmarks.
In Closing
The Hierarchical Reasoning Model makes a strong case that reasoning shouldn’t be confined to language tokens. By moving computation into latent space, HRM bypasses the inefficiencies of chain-of-thought reasoning, adapts its depth through halting, and demonstrates that a small model with 27M parameters can outperform much larger systems on reasoning-heavy tasks.
But HRM alone isn’t enough. As I’ve argued before in Sense → Symbolize → Plan → Act, language is only one channel of intelligence—true reasoning needs grounding in sensory models that can perceive and structure the world before reasoning even begins. Latent chain-of-thought is powerful, but without sensory grounding, it risks becoming abstract play with no grip on reality.
This points toward a hybrid engagement model: HRM-like hierarchical latent reasoning for structure and planning, paired with sensory modules that translate perception into symbols and actions. In that sense, HRM could be the algorithmic “engine,” but it still needs the “eyes and ears” of multimodal models to become truly useful.
That’s why a comparison with models like Qwen3 vs HRM in game 4×4 Tic-Tac-Toe could be instructive. HRM shows how latent reasoning scales, but against a token-based opponent, we may see its limitations in adaptability and interaction. Language alone might not be enough—but latent reasoning without sensory input isn’t either.
HRM feels less like a finished solution and more like a structural breakthrough, a possible “ResNet moment” for reasoning. If combined with sensory grounding and symbolic planning, it could form the backbone of hybrid systems capable of moving from perception to reasoning to action in a much more human-like way.
Packing all of this into a small model like this is quite impressive.
Maybe Apple should buy them, not Mistral.
Sources
[1] Hierarchical Reasoning Model: https://arxiv.org/abs/2506.21734
[2] Hierarchical Reasoning Model Github: https://github.com/sapientinc/HRM
[3] You Don't Need Domain-Specific Data Augmentations When Scaling Self-Supervised Learning https://arxiv.org/abs/2406.09294
[4] HST-LSTM: A Hierarchical Spatial-Temporal Long-Short Term Memory Network for Location Prediction:https://www.ijcai.org/proceedings/2018/324
[5] The Hidden Drivers of HRM's Performance on ARC-AGI https://arcprize.org/blog/hrm-analysis
[6] Thinker: Learning to Think Fast and Slow: https://arxiv.org/abs/2505.21097