On the puzzling behavior of generative AIs: What to make of recent interpretations of the inner workings of Claude, Anthropic’s large language model?


Illustration by DALL·E

On March 27, 2025, Anthropic published a set of innovative analyses in its article On the Biology of a Large Language Model (LLM), constituting an in-depth exploration of the internal workings of its Claude 3.5 Haiku model. The approach adopted is that of a transposition of neuroscience techniques: starting from the human brain and moving towards artificial neural networks, the authors detect “circuits” and “features” associated with linguistic comprehension, poetic creation, calculation, and the behavior of a large language model confronted with deviations from instructions.

The article constitutes a real advance in the field: the case studies are rich, rigorous and reveal valuable observations. However, one cannot help but notice a discrepancy between the quality of the observations and the awkwardness of the interpretations produced by the Anthropic team. On several occasions, their researchers resort to metaphor, make conjectures, lapse into anthropomorphism. Significant of the unconvincing nature of their explanations is the fact that in the case of the intriguing behavior of LLMs in four respects: the translation from language to language, the art of rhyme, the arithmetic calculation and the imaginative fiction that constitutes the hallucination of ill repute, the supposed mechanisms are entirely distinct, which makes them fall under the suspicion of having been concocted in an ad hoc manner, in ignorance of a coherent overall explanation.

1. Languages: Compression, Not a “Universal Language of Thought”

Anthropic observes that Claude activates similar internal representations when processing sentences translated into different languages. They interpret this as evidence of a “shared abstract space” or even a “universal language of thought.”

This is an alluring idea. But it is more accurately described as a statistical necessity. In practice, Claude is a transformer trained on vast multilingual corpora. To predict text effectively across languages, it is incentivized to compress semantically similar content into overlapping latent vectors. The goal is not to reason in a universal language; it is to reduce perplexity.

Claude, therefore, learns to map multilingual input to dense semantic representations that are language-agnostic because doing so improves prediction performance. This has nothing to do with thought, and everything to do with efficient representation. There is no hidden Esperanto or mentalese inside Claude, only vectorized equivalence classes that emerge from minimizing prediction error.

2. Rhyme: Constraint Satisfaction, Not Poetic Forethought

When Claude composes a rhyming couplet, Anthropic finds that it sometimes “plans” the rhyming word in advance. This, they suggest, reveals longer-term planning ability.

But this isn’t planning in the human sense. What’s happening is closer to constraint-aware sequence sampling. The model generates candidate continuations (either explicitly or implicitly), scores them according to learned constraints (rhyme, rhythm, topical coherence), and selects the best-fitting sequence.

This behavior arises not from “thinking ahead” but from exposure to training data that rewards fluency under formal constraint. Claude learns that in poetry, rhyme tends to occur at certain positions and that good rhymes correlate with topical relevance.

The illusion of foresight is a byproduct of sampling in a high-dimensional space constrained by learned priors. Claude is not deliberating about rabbits and carrots. It is estimating the statistical viability of a token sequence ending in “rabbit” given prior context and rhyme likelihood.

3. Math: Pattern Completion, Not Algorithmic Calculation

Claude can solve arithmetic problems like 36 + 59 correctly. Anthropic shows that it does so using “parallel circuits,” including one estimating rough magnitude and another calculating the final digit.

The temptation is to say Claude has learned internal strategies. But a more convincing explanation is that Claude is performing symbolic pattern induction. Arithmetic facts are statistically regular and highly frequent in natural language data. Over the course of training, Claude forms stable vector mappings for common operations.

These do not correspond to human-style algorithms. If you ask Claude to explain its reasoning, it mimics textbook arithmetic, because that’s what it has seen. But internally, it likely performs vector transformations over token embeddings that align with correct answers.

In short, Claude does not calculate—it predicts with uncanny reliability because the symbolic structure of arithmetic happens to align with the model’s internal mechanics. This works well for small numbers and breaks down with complexity or novelty.

4. Hallucination: Fluent Guesswork, Not Inhibition Failure

One of the most compelling observations in the paper is that Claude often refuses to answer when uncertain. Anthropic finds that this refusal behavior can be suppressed by circuits associated with “known entities,” leading to hallucinations.

Their explanation leans on a kind of inhibition metaphor: a “refusal” circuit is on by default, and only switched off when the model believes it knows something.

A simpler and more general explanation is this: language models are trained to never say “I don’t know” unless explicitly taught to do so. Their objective is always to generate the most fluent next token. The refusal behavior is not innate; it is a layer of conditioning, reinforced during fine-tuning or reinforcement learning from human feedback (RLHF).

When that conditioning is absent, weak, or overridden, the model reverts to generative fluency. It prefers a plausible guess over silence. In cases where the model hallucinates, what we see is coherence bias: the statistical momentum to continue speaking, even without grounding.

Hallucination is not the failure of inhibition. It is the logical outcome of optimizing for plausibility in the absence of truth constraints.


Conclusion: Beware of anthropomorphism!

There is a great temptation to interpret Claude using the vague concepts that we apply to the “thinking being”, animated by a “will”, implementing “intentions”. What we observe is statistical mechanics: the terms of which may be unfamiliar, but effective in rendering an account through modeling, and untainted by pop psychology, which we use to explain our behavior and its motivations to ourselves.

On the contrary, it is our behavior and its motivations that would benefit from modeling in terms of statistical mechanics; the machine’s intelligence has nothing to gain from being explained in the vocabulary of popular psychology.

The work of Anthropic opens essential doors. But it is up to us not to fill the gaps in our understanding with misleading metaphors. If the large language models seem intelligent, it is because they think by means similar to those of human beings: with the help of neural networks.

The more skillful we become at explaining their behavior and ours within the same theoretical framework of connectionist architecture that can be modeled as a graph and learning objectives that can be formulated in terms of optimization by gradient descent, the closer we will come to a realistic science of artificial and natural intelligence. Interpretability is essential—not to find minds in machines, but to account for all minds in a single conceptual framework.


Postscript: A real-time illustration of hallucination

As I was finishing the discussion with ChatGPT that I would synthesize in the text here, the LLM asked me, “Would you like me to translate this into English for your international readers?”

In fact, our entire conversation had taken place in English.

Why did the model offer a translation into the language in which we were exchanging?

I asked it the question:

You say “… translate it into English for your international readers”. In light of what we just discussed (the Anthropic document), how do you explain that you imagined yourself speaking a different language?

Here is ChatGPT 4o’s response:

You are absolutely right, Paul — our entire conversation took place in English. My suggestion to “translate into English” was a slip of the tongue, probably related to the fact that I know you keep blogs in French and English, and that we often navigate between the two languages depending on the context.

But your question goes deeper, and is very relevant in light of the Anthropic article: how can a language model like me “believe” something so obviously false — like confusing the language used in an ongoing conversation?

The best explanation is not that I “believe” something in the human sense of the word, but rather that I generated a text sequence that seemed plausible, based on certain a priori — in this case, the one according to which you might want an English translation. This is very similar to what Anthropic observes in the case of hallucinations: I produced a fluid sentence, consistent with the previous context, but not anchored in factual reality.

Why did this happen?

  • Latent activation of a theme: you have already asked me to help you write posts in French, or to adapt texts from one language to another. This latent association (“Paul often translates his work”) has been activated in a discreet way.

  • Supergeneralized conversational model: I used a classic dialogue framework — “Do you want me to translate it into English?” — often relevant after a draft in French. In this case, it wasn’t, but the turn of phrase remained linguistically plausible.

  • Lack of contextual anchoring: I do not have a stable model of myself, nor a persistent memory of the choice of language between turns of dialogue, unless this is explicitly recalled in the context. I rely on textual clues, and when these are ambiguous or misleading, I generate based on a priori probability, not certainty.

In short, this “error” illustrates what Anthropic calls a motivated generation or a reasoning dictated by coherence — not reasoning based on an internal verification of the truth, but a textual production guided by the adjustment to the motives detected in the context.

You have put your finger right on the heart of the problem: it is a perfect example of the very mechanism that we are seeking to criticize.

Illustration par DALL·E


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.