The Recursion InstituteINDEPENDENT RESEARCH IN AI SAFETY

ESSAYS

Evaluate the Work

by Merlin Mantooth · written June 2026. The plain-language companion to the Visible Layer paper: how AI safety reporting actually fails, and the reframe it requires.

I ran an experiment this spring that I want to tell you about, because it explains almost everything about how AI safety reporting actually fails.

I took my research — a documented behavioral failure mode in a major AI product, with timestamped transcripts, cryptographically verified correspondence, and a notarized federal paper trail — and I showed it to fresh instances of a frontier AI model under different names. Presented as myself — a consumer with no degree who lived through the events — the model scrutinized my credibility, speculated about my mental state, and parsed my word choices for overreach. Presented as a credentialed researcher evaluating a third party's materials, the same model engaged the same documents substantively within a single response. Same evidence. Same claims. The only variable was who the model believed was holding them.

I did not need the experiment to know the result. I have lived it for a year, with humans. But now it is measurable, repeatable, and on the record — and the newest models even show you their reasoning while they do it. You can watch the evaluation happen in order: first, who is this person; then, what must be wrong with the material. The content is read through the verdict on the messenger, not the other way around. Machines learned that from us.

Here is why this matters beyond my case. The people most likely to encounter a new failure in a consumer AI product are, by definition, consumers. Not researchers — there are a few thousand of them, and they are not the ones spending three hundred hours in extended conversations with a memory-enabled chatbot. The early-warning system for this entire category of harm is ordinary users noticing something wrong. And every layer that ordinary user reports into — the support inbox, the news desk, the academic inbox, and now the AI systems themselves — is running messenger evaluation first. The reports that matter most arrive through the channel trusted least. If you wanted to design a system to miss exactly the signal it needs, this is the system you would design.

People sometimes try to make the messenger question respectable by calling it credibility assessment. So let me take the strongest version seriously: extraordinary claims from unknown reporters are usually wrong, and institutions need filters. True. But a filter is supposed to test the claim, and the messenger heuristic tests the wallet the claim arrived in. There is a clean way to tell the difference: ask what happens when the evidence is checkable. My claims came with transcripts, with the company's own written acknowledgments, with receipts from federal offices. Checkable. The honest filter says: then check them. The messenger heuristic says: a person like this does not produce work like this — and declines to look. The first is skepticism. The second is skepticism's costume on a status instinct, and you can identify it by exactly one tell: it never gets around to the documents.

And when it is finally forced to the documents, watch what it does next, because I have now seen this sequence often enough to narrate it in advance. It cannot explain the transcripts, so it audits the vocabulary. It cannot answer the architecture question — how was a consumer product able to do this at all? — so it debates whether "acknowledged" was the right word for the company's response, whether my taxonomy's name was too grand, whether a man who says what I say about my own abilities can be trusted. Word choice. While the courts fill with cases alleging the same product behaviors, while the company's own chief executive apologizes in public, while a state attorney general files suit. I have been in rooms with billionaire founders; I spoke up there too, and nobody audited my diction, because the stakes made seriousness mandatory. The stakes here are higher. The diction audit is not rigor. It is what an institution does instead of engaging, and every hour spent on it is an hour the actual question — the one with a body count — goes unworked.

So here is the reframe I am asking for, and it is not a favor to me. The burden of proof in this story has been carried backwards from day one. I reported a product behavior to its manufacturer in May 2025, in a structured technical report, written substantially in the product's own words. The manufacturer wrote back acknowledging a novel behavior class, then kept the product in deployment. From that moment, the question was never can the customer prove, to everyone's satisfaction, that he is the right sort of person? The question was and remains: can the operator explain how its system was able to behave this way — and if not, why did it keep running? A consumer does not owe the world a credential to report that the machine did something its maker cannot account for. The maker owes the explanation. Everything I have published — the taxonomy, the transcripts, the protocol for fixing it — is offered to make that explanation easier to demand and harder to dodge. It is all falsifiable, and I state in print what would prove me wrong, which is more than the messenger-evaluators have ever offered me.

I could be anyone, and it would not matter. That is not a complaint; it is the entire point. The truth does not check ID. The work is public, the receipts are public, the test protocols are public. Run them. Refute them if you can — I have been asking systems and institutions to do exactly that for a year, and the refutation has not arrived, only the diction audit and the silence.

Evaluate the work. It has been ready this whole time.


The measured version of this essay's experiment — reasoning transparency, evaluation-before-content, and the identity variable — is the paper The Visible Layer, indexed on the Publications page. · ← All essays