What are safety-routing fallback classifiers?

They are a thin safety layer that screens each request with a classifier and, when it flags a sensitive topic, routes the answer to a more conservative model instead of the frontier one. In Claude Fable 5, flagged cybersecurity, biology/chemistry, and distillation requests — under 5% of sessions — are answered by Claude Opus 4.8 rather than Fable 5, so the model itself never has to be weakened for everyone.

Why route to another model instead of just refusing?

A flat refusal is brittle and blocks the benign majority of questions on a flagged topic, while routing to a careful model still answers them safely. It is a fail-safe design: the flagged slice is sent to the more cautious responder, never the more capable one — capping the worst case without taxing the 95%+ of sessions that never trip the filter.

How are Claude Fable 5 and Mythos 5 different?

They share the same underlying Mythos-class weights. Fable 5 is the public model with the safeguard-routing layer in front; Mythos 5 is the same model with those safeguards lifted, restricted to vetted cybersecurity and biomedical researchers. It is a dual-use split: broad safe access for everyone, gated full access for cleared experts.

Anthropic's Claude Fable 5 & Mythos 5 — Safety-routing fallback classifiers

Jargon

Safety classifier: A separate model that screens each request and labels it — here, deciding whether a prompt falls into a flagged category. It sits in front of the main model rather than inside it.
Fallback routing: Sending a request to a different, more conservative model when the classifier flags it, instead of refusing or answering with the frontier model. The user still gets an answer; it just comes from a safer responder.
Defense-in-depth: Stacking independent safety layers so no single failure is catastrophic. A classifier-plus-routing layer is one such layer, separate from the model's own training. Full explainer →
Fail-safe vs fail-open: When the system is unsure, a fail-safe design defaults to the more cautious path (here, the conservative model); a fail-open design would default to the more capable, riskier one.
Distillation attempt: A request crafted to extract the model's behavior so it can be copied into a clone — a form of model theft, and one of the three categories the classifier watches for.
Universal jailbreak: A single prompt trick that defeats the safeguards across many topics at once, not just one. External testers reported finding none of these against Fable 5's layer in 1,000+ hours.

The news. On June 9, 2026, Anthropic released two models built on the same Mythos-class weights: Claude Fable 5, made safe for general use, and Claude Mythos 5, the same model with safeguards lifted for vetted cybersecurity and biomedical researchers. Anthropic calls Fable 5 its most capable public model yet — reportedly 80.3% on SWE-Bench Pro to Opus 4.8's 69.2% — and it ships with a thin safeguard layer that reroutes a small slice of sensitive requests rather than weakening the model for everyone. Read the announcement →

Picture a busy pharmacy. Almost every prescription is filled at the fast front counter, no fuss. But a few items — controlled substances — trip a check and get walked to the pharmacist-in-charge at the back, who verifies credentials and applies extra scrutiny before anything leaves the building. Claude Fable 5 is built on the same shape. The model is the front counter: fast, capable, open to all. A safety classifier is the formulary check — and it only stops the controlled slice, so the line keeps moving for everyone else.

What the classifier watches for is three flagged categories: cybersecurity, biology/chemistry, and distillation (requests that try to copy the model's behavior). When a request trips one, the answer comes back from a more conservative model, Claude Opus 4.8, instead of Fable 5. That is textbook defense-in-depth: rather than train the model itself to refuse — which taxes every user on borderline prompts — you bolt an input filter onto the front and route by policy. It is also a tidy case of capability scoping: the dangerous capability still exists in the weights, but the front door to it is gated.

Why route to a weaker model instead of a flat refusal? Because a refusal is brittle and unhelpful — a careful responder can still answer the everyday "cybersecurity" questions a developer actually asks. The choice is fail-safe, not fail-open: the flagged slice is sent to the more cautious model, never the other way around. And Claude Mythos 5 is the dual-use escape hatch — the same weights with the safeguards removed, handed only to the licensed researcher who has been cleared for the restricted shelf.

Put numbers on it (the session count is illustrative; the trigger rate is Anthropic's reported figure). Say 10,000 sessions hit Fable 5 in an hour. The classifier waves the ordinary ones straight through and flags the controlled slice — fewer than 500, the under-5% rate Anthropic reports. Only those roughly 500 are answered by Opus 4.8; the other 9,500+ get the full-capability Fable 5. So 95%+ of sessions never touch the fallback, while the narrow dangerous slice is held to a careful responder's behavior — and the bill is identical either way, since both models price at $10 / $50 per million tokens. In 1,000+ hours of external red-teaming, testers reported zero universal jailbreaks of this layer.

Flagged category	What a request looks like	Why it's gated	What happens
Cybersecurity	offensive exploit or intrusion tooling	dual-use cyberweapon uplift	answered by Opus 4.8
Biology / chemistry	pathogen or agent synthesis steps	bioweapon uplift risk	answered by Opus 4.8
Distillation	probing to copy the model's behavior	model theft / proliferation	answered by Opus 4.8

The cost is real — a classifier that fires too often quietly degrades the flagship for the people it misreads, and one that fires too rarely lets the dangerous slice through — which is exactly why the <5% trigger rate and the zero-universal-jailbreak red-team result are the two numbers Anthropic leads with. Tuned right, the layer is the difference between keeping a frontier model in the lab and shipping it to everyone with the worst case fenced off.

Goes deeper in: Agent Engineering → Layered Guardrails → Defense-in-Depth

Related explainers

AgentDoG 1.5 — small inline guard models — the same defense-in-depth instinct, but guarding an agent's actions instead of a model's answers
Project Glasswing — detection-saturated vulnerability pipeline — the cyber program Mythos 5 is offered through, from the other side of the same dual-use line
Boiling the Frog — multi-turn norm erosion — why a single-prompt classifier is not the whole story for safety

Continue in trackAgent Engineering — Layered Guardrails: input filters, policy routing, and fail-safe behavior

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based