The news. On May 21, 2026, IBM researchers Asaf Yehudai, Lilach Eden, and Michal Shmueli-Scheuer posted Agentic CLEAR to arXiv — an automatic evaluation framework that sits above an agent's observability layer and emits textual insights at three granularities: system (patterns across runs), trace (one full execution), and node (one decision point). Unlike taxonomy-driven evals that force errors into preset categories, CLEAR generates qualitative descriptions that align with human-annotated errors and predict task success rate, validated across 4 benchmarks and 7 agentic settings over tens of thousands of LLM calls. Read the paper →

Picture an aviation safety analyst. They never fly the plane and they never bolt on new sensors — every plane already carries a flight recorder, and the analyst's whole job is to read what those recorders already captured. That is exactly the seat Agentic CLEAR takes. Your agent's observability tooling is already recording every step — each tool call, each decision, each retry — into a trace. CLEAR reads that existing trace rather than re-running the agent — it works from the logs your observability layer already produces. The recording was always there; almost nobody reads it.

Now picture the analyst working at three zoom levels. They pull up one cockpit decision to ask whether that single call was sound — that is the node. They replay one whole flight, takeoff to landing, to see how the run unfolded — that is the trace. And they step back over the fleet's flight history to find the pattern that repeats across many flights — that is the system level. The same recorder data, read at three different distances, answers three different questions. A single pass/fail score usually stands at just one distance, and a blurry one: it tells you the flight landed, never which decision nearly ended it.

What lets CLEAR write a real finding instead of a checkbox is hierarchy-aware prompting: the LLM judge is asked a different question at each level — "what went wrong in this step?", "how did this run unfold?", "what fails across the whole history?" — and answers each in plain language. Because the output is free text and not a category label, a failure mode the team has never seen before still gets described. A taxonomy-driven eval mostly counts the errors someone already thought to list; CLEAR can describe the one nobody did.

Why one number hides the run

Here is why granularity earns its keep. Take an illustrative batch of 20 runs of one agent. A pass/fail eval reports 14 of 20 passed, prints a tidy 70% success rate, and a team ships on it. Now read the same traces with CLEAR. At the node level, the judge notices that 9 of those 14 "passes" only succeeded because a retry quietly masked the same shaky tool call. At the trace level, it sees that brittle call drive plan drift in 6 of the runs. At the system level, it names the single dominant pattern: the agent is one flaky tool away from collapse. The honest read is not 70% — it is a fragile 70% that is really a 25% you can trust. (The 20/14/9 counts and the 25% read are illustrative; the 4 benchmarks, 7 agent settings, and "predicts task success rate" come from the paper.) The single number averaged the fragility away; the three zoom levels are what made it visible.

LevelWhat it readsThe question it answersWhat a single score misses
nodeone decision or tool callwas this one step sound?a recovered-but-shaky call hidden inside a "pass"
traceone full run, start to finishhow did this run unfold?plan drift, premature stopping, compounding errors
systemruns across the agent's historywhat fails across the whole agent?the dominant failure pattern no single run reveals

Goes deeper in: Agent Engineering → Observability for Agents → What to log

Related explainers

This explainer stands alone from its news item (one concept), so the closest neighbors are other explainers on how to evaluate an agent's process, not just its final score:

Continue in trackEvals & Diagnostics: turn a failed run into a written diagnosis

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based