What did Apple see? Nothing good.

A group of Apple researchers has published a paper claiming that large language models (LLMs), the backbone of some of AI’s most popular products today, like ChatGPT or Llama, can’t genuinely reason, meaning their intelligence claims are highly overstated (or from a cynical perspective, that we are being lied to).

Through a series of tests, they prove that their capacity to reason is most often — or totally — a factor of memorization and not real intelligence.

This adds to the growing trend of disillusionment around LLMs, which could cause a massive shift in investment and directly impact the future of many multi-billion-dollar start-ups. Naturally, it also seriously questions Big Tech’s billion-dollar AI expenditures and frontier AI labs’ future, who are dependent on this precise vision being true.

So, what is the base of these strong adversarial claims against LLMs?

This article is an extract from my newsletter, the place where AI analysts, strategists, and decision-makers use to find answers to the most pressing questions in AI.

A Growing Disenchantment

If you lift your head over the media funnel of AI outlets and influencers that simply echo Sam Altman’s thoughts every time he speaks, you will realize that, despite the recent emergence of , the sentiment against Large Language Models (LLMs) is at all-time highs.

The reason?

Despite the alleged increase in ‘intelligence’ that o1 models represent, they still suffer from the same issues previous generations had. In crucial aspects, we have made no progress in the last six years, despite all the hype.

A Mountain of Evidence

Over the last few weeks, especially since the release of o1 models, which are considered a new type of frontier AI models known as Large Reasoner Models (LRMs), an overwhelming amount of evidence has appeared to suggest that this new paradigm, while an improvement in some aspects, still retains much of the issues the very first Transformer presented back in 2017.

  • As proven by Valmeekam et al, they are still terrible planners (the ability to break complex tasks into a plan with simpler steps), underperforming brute-force search AI algorithms like Fast Downward released more than ten years ago,
  • As proven by MIT researchers, they underperform ARIMA, a ’70s statistical method, in time-series anomaly detection,
  • Another group of researchers has also proven that, in the absence of experience or knowledge in a subject, LLM performance considerably degrades, even for o1 models, even when having all available data as part of the prompt. Long story short, LLMs/LRMs can’t seem to follow basic instructions, especially when instruction length increases.
  • As evidenced by University of Pennsylvania researchers, they are extremely susceptible to seemingly irrelevant token variations in the sequence. For instance, the example below shows how a simple switch between ‘Linda’ and ‘Bob,’ utterly irrelevant to the reasoning process required to solve the problem, confuses the LLM and leads to failure.

However, none have been harsher on LLMs than Apple, coming out of the gate swinging with a lapidary statement: “LLMs do not perform genuine reasoning.”

And how are they exposing this alleged farce?

Token Biased & Easily Fooled

GSM8k is a very popular math-focused benchmark that tests LLMs’ capacity to solve grade school problems. Today, this benchmark is considered solved because most frontier LLMs saturate their scores.

But Apple researchers posited: How much of this performance is due to memorization and superficial pattern matching and not actual reasoning?

And the results are concerning to say the least.

Heavily token biased

For starters, it’s becoming clear that these models’ ‘reasoned’ outputs are more based on sequence familiarity than real reasoning.

As we saw in the previous image of the ‘Linda’ and ‘Bob’ switch, a simple name change is enough to make the model fail. The reason for this is that the model, far from having internalized the reasoning process, has simply memorized the training sequence.

But why does Linda work but Bob doesn’t?

The example above is the famous conjunction fallacy, when people think that a specific set of conditions is more likely than a single general one, even though that’s not logically true.

The LLM gets it right when the name used is ‘Linda’ because that’s the name that Kahneman and Tversky used in their work to illustrate this fallacy, which means LLMs have seen this problem several times during training where the name used was Linda. Thus, failure to adapt to new names suggests that LLMs simply memorize the entire sequence instead of fully internalizing the fallacy.

In other words, the model has literally memorized the sequence “Linda is 30…” continuing with “This question is a classic example of the conjunction fallacy…”, a sequence most definitely seen during training.

As it’s pure memorization, a simple change to Bob breaks the superficial pattern, showcasing how LLMs are mostly vacant of higher-level abstractions resembling deep human reasoning (that minor change wouldn’t fool us).

But Apple wanted to test this further. Thus, they created an alternative dataset, GSM-Symbolic, which used templates from the original questions that allowed them to modify specific tokens in the sequence, generating identical problems reasoning-wise with small variations:

In doing so, the results across all evaluated LLMs show varying decreases in performance, even for frontier AI models, although scale seems to be a factor in favor; the larger the model, the less prone to such issues.

But Apple didn’t stop here.

Difficulty degrades performance

Next, they wanted to test the model’s capabilities in harder questions, building on the original dataset but adding progressively harder parts to the question:

In doing so, as expected, performance degrades consistently across all models, including o1-mini, and also increases variance. In other words, their supposed intelligence is not only exaggerated, but robustness decreases the higher the complexity (although this is expected).

But the most interesting results came with the next testing round.

Easily Fooled

They decided to test the model’s capability to acknowledge inconsequential clauses that Apple describes as not having “operational significance,” aka utterly irrelevant to solving the problem, thereby creating GSM-NoOp.

In layman’s terms, these are clauses added to the problem statement that appear to be relevant but aren’t, in an effort to show how superficial these models’ pattern-matching capabilities are.

As you can see in the example below, they add a statement that appears relevant (it still refers to the kiwis) but is absolutely irrelevant to the problem (we are counting kiwis; size does not matter in this case).

By admin

Leave a Reply

Your email address will not be published. Required fields are marked *