The news. On May 21, 2026, IBM researchers Asaf Yehudai, Lilach Eden, and Michal Shmueli-Scheuer posted Agentic CLEAR to arXiv — an automatic evaluation framework that sits above an agent's observability layer and emits textual insights at three granularities: system (patterns across runs), trace (one full execution), and node (one decision point). Unlike taxonomy-driven evals that force errors into preset categories, CLEAR generates qualitative descriptions that align with human-annotated errors and predict task success rate, validated across 4 benchmarks and 7 agentic settings over tens of thousands of LLM calls. Read the paper →
Picture an aviation safety analyst. They never fly the plane and they never bolt on new sensors — every plane already carries a flight recorder, and the analyst's whole job is to read what those recorders already captured. That is exactly the seat Agentic CLEAR takes. Your agent's observability tooling is already recording every step — each tool call, each decision, each retry — into a trace. CLEAR reads that existing trace rather than re-running the agent — it works from the logs your observability layer already produces. The recording was always there; almost nobody reads it.
Now picture the analyst working at three zoom levels. They pull up one cockpit decision to ask whether that single call was sound — that is the node. They replay one whole flight, takeoff to landing, to see how the run unfolded — that is the trace. And they step back over the fleet's flight history to find the pattern that repeats across many flights — that is the system level. The same recorder data, read at three different distances, answers three different questions. A single pass/fail score usually stands at just one distance, and a blurry one: it tells you the flight landed, never which decision nearly ended it.
What lets CLEAR write a real finding instead of a checkbox is hierarchy-aware prompting: the LLM judge is asked a different question at each level — "what went wrong in this step?", "how did this run unfold?", "what fails across the whole history?" — and answers each in plain language. Because the output is free text and not a category label, a failure mode the team has never seen before still gets described. A taxonomy-driven eval mostly counts the errors someone already thought to list; CLEAR can describe the one nobody did.
Why one number hides the run
Here is why granularity earns its keep. Take an illustrative batch of 20 runs of one agent. A pass/fail eval reports 14 of 20 passed, prints a tidy 70% success rate, and a team ships on it. Now read the same traces with CLEAR. At the node level, the judge notices that 9 of those 14 "passes" only succeeded because a retry quietly masked the same shaky tool call. At the trace level, it sees that brittle call drive plan drift in 6 of the runs. At the system level, it names the single dominant pattern: the agent is one flaky tool away from collapse. The honest read is not 70% — it is a fragile 70% that is really a 25% you can trust. (The 20/14/9 counts and the 25% read are illustrative; the 4 benchmarks, 7 agent settings, and "predicts task success rate" come from the paper.) The single number averaged the fragility away; the three zoom levels are what made it visible.
| Level | What it reads | The question it answers | What a single score misses |
|---|---|---|---|
node | one decision or tool call | was this one step sound? | a recovered-but-shaky call hidden inside a "pass" |
trace | one full run, start to finish | how did this run unfold? | plan drift, premature stopping, compounding errors |
system | runs across the agent's history | what fails across the whole agent? | the dominant failure pattern no single run reveals |
Goes deeper in: Agent Engineering → Observability for Agents → What to log
Related explainers
This explainer stands alone from its news item (one concept), so the closest neighbors are other explainers on how to evaluate an agent's process, not just its final score:
- FutureSim — harness-level agent eval — the level above: FutureSim scores the whole agent harness end-to-end, and CLEAR extends that downward to the trace and the single node
- TelBench — span-level error localization — the sibling granularity move: pinpointing which span of a run failed, the same "zoom in past the final grade" instinct
- Agent leaderboards mislead under distribution shift (IBM) — the contrast: predictive validity asks whether a ranking is valid; CLEAR asks what a single run was actually doing