The news. On June 9, 2026, Anthropic released two models built on the same Mythos-class weights: Claude Fable 5, made safe for general use, and Claude Mythos 5, the same model with safeguards lifted for vetted cybersecurity and biomedical researchers. Anthropic calls Fable 5 its most capable public model yet — reportedly 80.3% on SWE-Bench Pro to Opus 4.8's 69.2% — and it ships with a thin safeguard layer that reroutes a small slice of sensitive requests rather than weakening the model for everyone. Read the announcement →

Picture a busy pharmacy. Almost every prescription is filled at the fast front counter, no fuss. But a few items — controlled substances — trip a check and get walked to the pharmacist-in-charge at the back, who verifies credentials and applies extra scrutiny before anything leaves the building. Claude Fable 5 is built on the same shape. The model is the front counter: fast, capable, open to all. A safety classifier is the formulary check — and it only stops the controlled slice, so the line keeps moving for everyone else.

What the classifier watches for is three flagged categories: cybersecurity, biology/chemistry, and distillation (requests that try to copy the model's behavior). When a request trips one, the answer comes back from a more conservative model, Claude Opus 4.8, instead of Fable 5. That is textbook defense-in-depth: rather than train the model itself to refuse — which taxes every user on borderline prompts — you bolt an input filter onto the front and route by policy. It is also a tidy case of capability scoping: the dangerous capability still exists in the weights, but the front door to it is gated.

Why route to a weaker model instead of a flat refusal? Because a refusal is brittle and unhelpful — a careful responder can still answer the everyday "cybersecurity" questions a developer actually asks. The choice is fail-safe, not fail-open: the flagged slice is sent to the more cautious model, never the other way around. And Claude Mythos 5 is the dual-use escape hatch — the same weights with the safeguards removed, handed only to the licensed researcher who has been cleared for the restricted shelf.

Put numbers on it (the session count is illustrative; the trigger rate is Anthropic's reported figure). Say 10,000 sessions hit Fable 5 in an hour. The classifier waves the ordinary ones straight through and flags the controlled slice — fewer than 500, the under-5% rate Anthropic reports. Only those roughly 500 are answered by Opus 4.8; the other 9,500+ get the full-capability Fable 5. So 95%+ of sessions never touch the fallback, while the narrow dangerous slice is held to a careful responder's behavior — and the bill is identical either way, since both models price at $10 / $50 per million tokens. In 1,000+ hours of external red-teaming, testers reported zero universal jailbreaks of this layer.

Flagged categoryWhat a request looks likeWhy it's gatedWhat happens
Cybersecurityoffensive exploit or intrusion toolingdual-use cyberweapon upliftanswered by Opus 4.8
Biology / chemistrypathogen or agent synthesis stepsbioweapon uplift riskanswered by Opus 4.8
Distillationprobing to copy the model's behaviormodel theft / proliferationanswered by Opus 4.8

The cost is real — a classifier that fires too often quietly degrades the flagship for the people it misreads, and one that fires too rarely lets the dangerous slice through — which is exactly why the <5% trigger rate and the zero-universal-jailbreak red-team result are the two numbers Anthropic leads with. Tuned right, the layer is the difference between keeping a frontier model in the lab and shipping it to everyone with the worst case fenced off.

Goes deeper in: Agent Engineering → Layered Guardrails → Defense-in-Depth

Related explainers

Continue in trackAgent Engineering — Layered Guardrails: input filters, policy routing, and fail-safe behavior

Frequently Asked Questions