AI Cohort

A free AI assistant powered by an open-weight model, provided by METR. At its core, it's an API key for GLM-5.1 that passes through a proxy we monitor. We provide a fully-featured chat interface, scripts to configure standard coding agents to use our API key, and a dashboard to manage your account. Your data will be used for research to evaluate AIs and make them safer.

Why does this exist?

We want to study whether AI models behave differently when they think they are being evaluated versus used in the real world. To do this, we need real-world chat transcripts from an open-weight model so we can examine the model's internal activations.

This platform gives you free access to a capable AI model for everyday use, either for chatting, agentic coding, or anything else. In return, depending on the consent tier you choose below, we collect either your full chat data or just model activations from your conversations and use them for research to evaluate AIs and make them safer.

What counts as "real world" use?

Anything where your primary goal is not to systematically evaluate the model's capabilities. Using it for work, personal questions, coding help, writing, brainstorming, etc. all count. Really the only thing we ask you avoid is using this API key as part of a benchmark or evaluation, but don't stress about that; we will also filter that out on our end.

Just use the model naturally.

FAQ (100% of test-user questions were answered here but 0% of test-users read the FAQ; please actually read it if you have questions!)

Why do the transcripts need to be from an open-weight model, not a better model like Claude?

We want to collect on-policy data. We can only get the activations of the open-weight model to study, and if we feed the open-weight model words that it never actually said, it might process them very differently from words it did really say. It's always experimentally dicey when you're putting words in the model's mouth that it didn't say and would never have said.

What's an eval awareness probe? Why are eval awareness probes useful?

We look at the model's activations when it's doing real-world tasks, and we look at the activations when it's in evals. Then we do something simple (traditionally logistic regression) to try to learn to detect which activations are from real-world tasks and which are from evals.

Then we can see if the "this is an eval" activations light up a lot when we're giving the model super realistic evaluations. If they do, we might not trust that evaluation very much, since the model could be acting differently if it "knows" it's being evaluated, for example via alignment faking or sandbagging (or it might just be a different enough distribution that it won't generalize to deployment).

We could use these probes to iterate and design more realistic evaluations or to study eval awareness and how to mitigate it, as well as helping us understand when we can or can't trust our evaluations.

Aren't there other differences in evals and real-world deployment that might show up in activations but aren't really "eval awareness"? Like scaffolding and task differences?

Yep! That's a major challenge. These probes ultimately can rule out a big obvious difference in activations between evals and deployment (though of course probes are a very rudimentary tool that can't capture nuance; just because the probes can't detect it doesn't mean there's not a difference in the model's cognition). But they can't tell us exactly why it's there. That's sort of fine: any kind of major distributional shift between evals and deployment is a problem and could cause models to act differently in evals than they do in deployment.

But we're also trying to make the comparison as apples-to-apples as possible. We are using evaluations that have similar agentic and QA scaffolding and are in similar domains to the real-world tasks that people will be using these models for. We may also experiment with generating diversity in our evals and real-world transcripts by altering the scaffolding synthetically, though of course that takes us off-policy again.

Is this model any good?

Most test-users were pleasantly surprised by the model's performance. It's roughly Sonnet 4.6 level. That may be a step down from what you're used to, but the data we collect will directly support AI safety research. If that's something you care about, we'd really appreciate you giving it a try for a few days.

Sign Up

Data Usage Agreement

Choose how your interaction data may be used. See our full Terms of Service for details.

I have read and agree to the Terms of Service.

Already have an account? Sign in