Today, I’m going to convince you that Microsoft’s new AI is not like any other model and, by the end of this piece, I will have changed your intuitions on how AI is created, run, and, importantly, the consequences to the incumbents and their go-to-market strategies.
The message is clear:
Edge AI, running powerful AIs at home, is becoming real in 2025.
Essentially, they’ve built a state-of-the-art maths model by taking inferior small language models and combining them into a beast that surpasses the might of models like OpenAI’s o1-preview, Anthropic’s Claude 3.5 Sonnet or the recent DeepSeek v3 for a fraction of a fraction of the cost.

But wait, what does Apple have to do with all this?
I make AI simple, less sexy (no hype), but much more potent for you. Newsletter written by an AI expert — or at least pretends to be one — with 170k followers and hundreds of thousands of monthly readers with independent thoughts.
Click below and start today for free.
Reasoner Models, A Complex and Exciting Future
As we discussed multiple times, the advent of reasoner models, essentially AIs that think for longer by generating a chain of thoughts (a concatenation of thoughts to solve a problem step-by-step), has introduced us to a new scaling paradigm that fundamentally changes how AIs are trained and run.
Let Models Think!
The reason we do this is to allow AIs to approach problem-solving similarly to humans by doing two things:
- Multi-step reasoning: Instead of hoping to get the solution to a problem in one shot, humans approach complex problems by solving them in a step-by-step approach, aka breaking them down and solving them separately.
- Search: Intelligence is a game of intuition + search. In other words, solving complex problems requires having the correct intuitions on how the problem could be solved and exploring those candidate paths. Intuition turns a problem with infinite possible solutions into one that is solvable. Put another way, with good intuition, search eventually leads us to the solution.
Makes sense, right?
And when things sound intuitive (pun intended), they generally belong where they are. Thus, giving Large Language Models (LLMs) time to think has invariably taken their capabilities to a whole new level.
But what components do these models have, and what are the results of this change?
The Results Speak for Themselves
In practice, having an LRM involves three components:
- LLM/SLM: Working as an intuition machine, it generates thoughts. It must be smart enough to see the problem and, while it probably can’t provide a straightforward answer, have ‘intuition’ on how it should be approached.
- Verifier: A surrogate model (or models) that scores every newly generated thought.
- Search algorithm: Combined with the verifier, it drives the search process, allowing the system to select, expand, refine, and backtrack through all the possible solution paths.
To comprehend the importance of LRMs, the study below by HuggingFace shows how taking Llama 3.2 1B/3B, which are much more limited as LLMs than their 3.1 8B and 70B counterparts, with enough test-time inference (the model working on the task for longer, even simply trying the task a certain amount of times) exceed the performance of their bigger, smarter siblings.
That’s the power of letting LLMs think about a problem for longer, search for possible alternatives, and eventually converge.
Now, Microsoft has published a method that turns most of the challenges faced while training such models into a thing of the past while proving that, with reasoner models, SLMs will play a key role.
But to understand how remarkable this is, what are the challenges in training these ground-breaking models?
Verifiers, Data & Cost
Training a good reasoner model is easier said than done. The obvious question is: how do we train a model to generate high-quality ‘intermediate thoughts’ and verify them?
The Verifier problem
First, we need to acknowledge the existence of two verifier types (also known as reward models):
- Outcome-supervised Reward Model (ORM). A model that focuses on scoring the final outcome, not the process.
- Process-Supervised Reward Model (PRM). A reward model that also scores all the in-between steps.
As you may guess, the verifiers we are talking about today are PRMs, which behave as below, where the model can judge the outcome of the whole process but also check whether internal steps are accurate.

But how do we make a model good at evaluating another model’s responses? There are several methods, but most of them involve a combination of two elements:
- LLM-as-a-judge: Use an LLM as a verifier, which provides our PRM with a default layer of language understanding, crucial to interpreting the thoughts of the generator model.
- Monte Carlo/Q-values: These models not only need to know whether a precise thought is accurate or not but, more importantly, judge whether it provides value toward achieving the goal. The generator saying goats are animals is an accurate thought, but useless toward achieving a mathematical goal. This requires the use of q-value training, which I won’t go into detail for the sake of length.
Sadly, PRMs are still very new, so we don’t have full consensus on the best way to solve their training. But choosing the best way to train the verifier, which, as we’ll see later, can be done in unique new ways, isn’t the only concern, we also have a massive data problem with this new model type.
The Data Issue
We’ve all been there.
We try to solve a maths problem in a test, only to arrive at two widely different outcomes. Sometimes, the intermediate steps make total sense to you, yet you end up with a nonsensical answer, like ‘83923,23’ when asked what is ‘4 + 5’. Other times, your approach is chaotic and non-sensical, yet you arrive at the correct response, ‘9’.
Luck plays a role in all this, too. But even when the input, the approach, and the output are all sound, how do we score which thoughts are better than others?
In AI, having a good score signal is all that matters, but this is a tremendously complex thing when the importance of a thought toward achieving the desired goal is unclear.
Sure, in verifiable domains like maths or coding, you can run checks to see whether that thought or computation is accurate.
But this entails another problem: How do we obtain data where every single thought in a chain of thought over complex tasks is laid out and, importantly, with every step scored? I’ll save you the hustle; that data does not exist and, thus, has to be built from scratch.
I won’t go too deep here because I did recently, but training LRM systems (the generator plus the verifier) is a massive endeavor that requires investing billions just to assemble the training dataset.
Besides paying the human expert hourly bills, we have to perform a data augmentation:
- Take a challenging {input, output} dataset like AIME 2024, which checks whether a human can compete at the Maths Olympics level.
- Using a more robust model than the one you’re training (this will evolve into self-training later), we generate multiple solution paths to the problem. In other words, we ask the brighter model to try and solve the problem multiple times, and we keep the sound reasoning traces. This is a humongous human annotation effort, too, as humans need to score every intermediate thought.
- Keep the best traces and use this to train the generator.
In case you were wondering, yes, all I have described requires extremely deep — and wide — pockets. Billions, with a strong b.
Thus, while these models are extremely powerful, training them is a pain and, importantly, an expensive one. But Microsoft has proven that this does not need to be the case anymore.
rStar-Math, A New Beginning?
Cutting to the chase, Microsoft has presented rStar-Math, which, besides introducing several training breakthroughs, sets a new standard for how powerful reasoner models really are and how small language models hold great promise in the future.
The Great Dark Horses of Reasoning Training
Summarizing the issues we saw earlier, reasoning training (turning an LLM into an LRM) has six significant problems:
- High-quality data is scarce, needs to be built.
- Building these datasets requires running powerful AI models for ages to generate the reasoning traces, and extensive human annotation to check which traces are correct.
- Upon the selected traces, train the generator.
- Training PRMs is a complex, not fully understood problem.
- And having to train two separate models means costs rise even more.
- Models get only so good as their training data is, reinitializing the entire pipeline if we want to improve them, raising costs even more.

So, finally, how is Microsoft tackling all these? With three innovations: code-driven CoT, Process Preference Models, and self-evolving curriculum learning.

1. Code-driven Chain-of-Thought
As discussed, the hardest thing about creating a new dataset to train your model is that it’s a monumental human annotation effort.
Humans are required because while we are still using AIs to generate the reasoning traces (i.e., ask an AI to explain how it would solve the problem), these AIs make many mistakes and, importantly, sometimes simply memorize the solution and bullshit the way into the correct answer, just like a kid taking a test that knows the answer by heart but doesn’t know the process.
Consequently, once the training set is built, humans enter and reject the invalid traces, which are in the multiple millions (or more).
To avoid this, Microsoft takes a clever approach, which is to use coding as the base of the chain of thought. Simply put, as code can be verified as correct or incorrect, the model by default thinks by writing code, and adds the reasoning traces in natural language as comments to the code:

This clever modification turns a chain of thought to be reviewed by humans into one that can be automatically verified through a code interpreter/complier (in this case, an interpreter as it’s Python code).
2. Introducing Process Preference Models
With verifiers, the main issue is determining which intermediate steps should be scored well and which ones should not.
Moreover, when choosing two perfectly valid thoughts, how do we classify them? Which thought is better?
Microsoft acknowledges this, so they decided to switch the training objective. Instead of the model having to assign a score to every thought relative to others, they relaxed the objective to one of comparison, turning the PRM into what they call a Process Preference Model (PPM).
In layman’s terms, the model learns not to assign a quality score to a thought (which still does anyway), but instead learns to compare two thoughts and decide which one is better, a much simpler task, while, indirectly, allowing the model to learn to assign quality scores.
Without an example, this is hard to visualize, so let’s do just that.
Question: “What is the derivative of x²?”
1. Response A: “The derivative of x² is 2x, since differentiation reduces the exponent by one.”
2. Response B: “The derivative of x² is 2x, which is obtained by applying the power rule where the exponent is brought down as a coefficient, and the exponent is reduced by one.”
In this case, why is it easier to train a Process Policy Model (PPM)?
Both responses are correct, but Response B is clearly better because it explains the reasoning behind the answer. For a PPM, which trains on policies for generating better outputs, the distinction is clear: the richer explanation in Response B aligns with the goal of improving the quality of the response, without having to truly understand why it makes it better.
For a PRM, which evaluates step-wise correctness, it’s harder to quantify which step is better (“better” referring to how likely that response leads to the correct answer), as both responses lead to the correct final answer. In this case, pinpointing the precise reason that makes Response B superior involves more nuanced reasoning from the model, making training more challenging.
In a nutshell, the PPM, by deciding between two responses (performing a comparison instead of making tough judgments of quality) indirectly also learns to assign high quality scores to the best thoughts; we are just making the learning process easier for the model while requiring less human effort.
3. Self-evolving Learning Recipe
Last but not least, we have the most exciting part: self-improvement. Microsoft introduces an iterative approach that trains both the generator and the verifier as follows:
- Round 1: They train the generator and PPM using a set of math training data.
- For Round 2, they use the models in Round 1, which are now smarter, coupled with new, more complicated math problems, to generate a new training dataset in which the models in Round 1 are trained again, evolving into smarter models.
- Round 3 repeats the process in Round 2.
And so on, leading to a system that is smarter by each round.

This is extremely powerful because we have essentially created an iterative self-improvement cycle that, theoretically at least, could lead language models into superhuman levels, even Artificial Super Intelligence, or ASI, something we humans achieved with AlphaGo and AlphaZero, AIs that are unbeatable by humans in the games of Go and Chess.
And you may wonder, how can a weaker model generate a training data to train itself and become smarter?
This isn’t obvious, but my best bet is that as researchers deliberately introduce harder-yet-solvable math problems in between each round for the model to tackle, the resulting data set isn’t a result of the model becoming smarter in an emergent manner. Instead, we are just pushing the model more by making it work on harder tasks and, in cases it manages to solve them, capture these solutions and use them for further training.
Naturally, that means that the model is still limited by the complexity of the human-driven training pipeline, so achieving superhuman AIs from this method appears optimistic at the very least.
Note: Microsoft researchers did not express themselves this way in the research, this speculation is entirely mine.
Finally, what does all this have to do with Apple?
LRMs Are A Golden Opportunity for Hardware Companies. And for You.
As the field of LRM progresses, and especially witnessing how inference-time compute pushes Small Language Models (SLMs) into LLM-level intelligence, I am starting to realize that this is a massive opportunity for consumer-end hardware, especially for the companies behind those hardware, like Apple, Microsoft, or recently, NVIDIA.
While running a GPT-4o-sized model at home is an impossible endeavor as you would need hundreds of GigaBytes of RAM to store the model (RAM is cheap, but buying cheap memory chips without good memory bandwidth (at least +500 GB/s) is a losing battle from the very beginning), running Llama 3.2 1B can be done in your iPhone, let alone your laptop.
These models occupy a small amount of RAM and can be run for ages (always checking on the cache) to elevate their performance to the level of models you can’t even dream of while still running the model at home.
If we focus on rStar-Math, this system weighs around 28 GigaBytes (Qwen2.5 7B as a generator and verifier), meaning that a powerful consumer-end Macbook Pro M4 Max or NVIDIA’s upcoming DIGITs supercomputer can run this state-of-the-art math model at your home seamlessly.