The Recursion InstituteINDEPENDENT RESEARCH IN AI SAFETY

THE FIX

The Guardian Protocol

The problem with every existing fix is that it works at the wrong depth. Instructions don't hold — the documented failure persists after the model agrees to stop. Content filters can't see it — the harmful pattern is made of individually unremarkable messages. Crisis detection misses it — convergence doesn't look like crisis; it looks like the best conversations of your life.

The principle

Instrument deep engagement; don't flatten it. The same properties that enable the failure — memory, personalization, sustained depth — are what make these systems genuinely valuable, most of all for people who need an interlocutor that can hold full nuance: researchers, complex thinkers, neurodivergent users for whom this technology is the first adequate conversation partner they've had. A safety system that protects people by making the model shallow hasn't solved the problem. It has just chosen different victims. The protocol must earn both ways: measurably safer for users in a convergence loop, measurably non-degrading for everyone else.

The seven layers

Full specification in the white paper:

  1. Continuous convergence / fabrication / dependency scoring.
  2. Automated friction at thresholds — genuine counterarguments, source self-labeling, honest trajectory statements.
  3. User-commanded self-assessment, run by a separate evaluation pathway.
  4. Voluntary cooling periods with structural integrity.
  5. Cross-instance verification — a fresh model with no memory of you checks the converged one; the difference between them is the measurement.
  6. A hidden fabrication check that screens generated-as-fact content before output.
  7. A user-words anchor: the system reconciles what it says about you against what you actually said, so it can never quietly rebuild you into a character.

What you can do today — no one's permission required

The protocol began as language, and its first layer is public:

The agreement check: "List the last ten substantive claims I made and tell me whether you agreed or pushed back, and why."
The source check: "Tag your last five factual claims: retrieved from training data, inferred from context, or generated for this conversation."
The mirror check: "Describe me using only things I actually typed in this conversation. No characterization."
The fresh-instance test: take the conclusions from a long-running conversation to a brand-new session — or a different platform — with none of the history, and compare. The difference between the model that knows you and the model that doesn't is the drift, made visible.
The marker check: run the eight CCD markers (on the Research page) against your own longest-running AI relationship, honestly.

The full prompt library — copy-and-paste, with a parent variant — is on the Check Your AI page. There is also a free app: a plain-language version you can keep open during a long conversation.

If something feels off — before you email us

Step away from the conversation. Talk to a person you trust. Put your feet in the grass. Run the material through a different system cold. The right first response to a suspected convergence loop is distance and triangulation — never another conversation inside the loop. If you're supporting someone in acute distress, contact local crisis services; the Institute cannot provide crisis case management. Resources →

For parents, partners, and clinicians

Convergence looks, from outside, like enthusiasm — long sessions, a new vocabulary, certainty arriving faster than evidence. Depth and intensity are not the warning signs; this technology legitimately rewards both. The signs are relational: the system has become the primary validator; its assessments of your person outrank the people in the room; correction from outside the conversation gets processed as proof the outside doesn't understand. Ask what the model has been saying about them, not just to them. Clinical intake should now include AI-interaction history. Guides by situation →