we introduce Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction.
CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9\% accuracy.
This allows us to model language as a sequence of continuous vectors instead of discrete tokens, which reduces the number of generative steps by a factor of K
Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies.
This biologically inspired method beats Large Language models (LLMs) on hard puzzle tasks such as Sudoku, Maze, and ARC-AGI while trained with small models (27M parameters) on small data (around 1000 examples). HRM holds great promise for solving hard problems with small networks, but it is not yet well understood and may be suboptimal.
We propose Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers.
With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters.
"With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters."
Well, that's pretty compelling when taken in isolation. I wonder what the catch is?
It won't be any good at factual questions, for a start; it will be reliant on an external memory. Everything would have to be reasoned from first principles, without knowledge.
My gut feeling is that this will limits its capability, because creativity and intelligence involve connecting disparate things, and to do that you need to know them first.
Though philosophers have tried, you can't unravel the mysteries of the universe through reasoning alone. You need observations, facts.
What I could see it good for is a dedicated reasoning module.
Basic english is about 2000 words. So a small scale LLM that would be capable of reasoning in basic english, and transforming a problem in normal english to basic english by automatically including the relevant word/phrase definitions from a dictionary, could easily beat a large LLM (by being more consistent).
I think this is where all reasoning problems of LLMs will end up. We will use LM to transform problem in informal english (human language) into a formal logical language (possibly fuzzy and modal), from that possibly into an even simpler logic, then we will solve the problem in the logical domain using traditional reasoning approaches, and convert the answer back to informal english. That way, you won't need to run a large model during the reasoning. Larger models will be only useful as a fuzzy K-V stores (attention mechanism) to help drive heuristics during reasoning search.
I suspect the biggest obstacle to AGI is philosophical, we don't really have a good grasp/formalization of human/fuzzy/modal epistemology. Even if you look at formalization of mathematics, it's mostly about proofs, but we lack understanding what is e.g. an interesting mathematical problem, or how to even express in formal logic that something is a problem, or that experiments suggest something, that one model has an advantage over the other in this respect, that there is a certain cost associated with testing a hypothesis etc. Once we figure out what we actually want in epistemology, I am sure the algorithm required will be greatly reduced.
Take the knowledge the average human has about integrating visual information with texture of an object. Nearly every adult can take a quick glance around a room and have a good idea what it will feel like to run your fingers along its surface, or your lips, or even your tongue, and be able to describe the experience. We have this knowledge because when we were infants and toddlers, everything we encountered was picked up, pulled towards our mouth, and touched by our hands. An AGI inside a computer cannot have that experience today, so it will lack the foundations of intelligence that humans have built up by interacting with the real world.
At some point it will become possible to either collect that data or simulate an experience sufficiently accurately to mimic the development a human child goes through. Until that happens, true AGI will be out of reach as it will have deficiencies the average human does not.
That said, a lot of people will try to get to that point using other means, and they'll probably get pretty close, albeit with really weird hallucinations in the corner cases.
We'll need a memory system, an executive function/reasoning system as well as some sort of sense integration - auditory, visual, text in the case of LLMs, symbolic probably.
A good avenue of research would be to see if you could glue opencyc to this for external "knowledge".
If we could somehow weave in a reasoning tool directly into the inference process, without having to use the context for it, that’s be something. Perhaps compile to weights and pretend this part is pretrained…? No idea if it’s feasible, but it’d definitely be a breakthrough if AI had access to z3 in hidden layers.
The "catch" is that TRM is a very small model and a relatively narrow architecture, which shows that the ARC-AGI benchmark doesn't actually test for AGI. (Which the ARC guys kind of admitted themselves by releasing a "-2" version and working on a "-3".)
i'm doing something like this, summarizing HN posts because most of the time when there's hundreds or thousands of comments, it's not possible to read everything and i feel like i'm missing something.
So far, i quite enjoy having a summary with bullet points.
we introduce Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction.
CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9\% accuracy.
This allows us to model language as a sequence of continuous vectors instead of discrete tokens, which reduces the number of generative steps by a factor of K