What is Agentic CLEAR?

Agentic CLEAR is an automatic evaluation framework from IBM that sits above an agent's observability layer and emits textual findings at three granularities: system (patterns across runs), trace (one full execution), and node (one decision or tool call). It reads the trace data your agent already logs and uses an LLM judge with hierarchy-aware prompting to write a plain-language analysis at each level, rather than collapsing a whole run to a single pass/fail score.

What is system/trace/node eval granularity?

It is the idea that an agent can be evaluated at three zoom levels instead of one. Node is a single decision or tool call; trace is one whole run from start to finish; system is the pattern across the agent's run history. The same recorded trace data, read at three distances, answers three different questions — which a single aggregate score usually cannot, because it stands at just one blurry distance.

How is CLEAR different from taxonomy-driven agent evals?

Taxonomy-driven evals sort each error into a fixed list of preset categories, which is fast but blind to any failure mode the list never anticipated. CLEAR is taxonomy-free: its LLM judge writes a free-text qualitative description, so a brand-new failure mode is captured in words without anyone editing a category list. IBM reports these descriptions align with human-annotated errors and predict task success rate across 4 benchmarks and 7 agentic settings.

Agentic CLEAR grades agents at three zoom levels (IBM) — System/trace/node eval granularity

TL;DR

What is it: A new IBM paper introduces Agentic CLEAR, an automatic evaluation framework that sits above an agent's observability layer and writes textual findings at three granularities — system (across runs), trace (one full execution), and node (one decision or tool call).
Why it’s needed: A pass/fail score tells you that a run worked, never where it nearly broke or why. CLEAR reads the trace data your agent already logs and turns it into a diagnosis at the zoom level you need — the core skill of Observability and Evals & Diagnostics.
vs previous: Where taxonomy-driven evals force every error into a preset category (and miss any failure mode the taxonomy never anticipated), CLEAR emits free-text qualitative descriptions, so a brand-new failure mode is described in words with no taxonomy edit.

Jargon

Observability trace: The recorded log of one agent run — every span, tool call, and decision, captured by the instrumentation you already run. CLEAR is built to consume this existing data that your observability layer already produces.
Eval granularity (system / trace / node): Node = one decision or tool call. Trace = one whole run, start to finish. System = patterns across the agent's run history. CLEAR emits a separate analysis at each level instead of collapsing everything to one number.
LLM judge: Using a language model to grade or analyze another model's output. CLEAR's judge reads trace data and writes the finding; the trick is the prompting (below), which asks a different question at each granularity.
Hierarchy-aware prompting: The judge is prompted differently per level: node prompts ask "what went wrong in this one step?", trace prompts ask "how did this run unfold?", system prompts ask "what fails across the whole history?" — so each level answers its own question.
Taxonomy-driven vs taxonomy-free: Taxonomy-driven evals sort each error into a fixed list of categories — fast, but blind to anything the list never anticipated. Taxonomy-free means CLEAR writes a free-text description, so a never-seen failure mode is captured without editing a category list.
Task success rate: The fraction of runs that complete the task correctly — the outcome metric teams ultimately care about. A CLEAR result is validated partly by how well its qualitative findings predict this number.

The news. On May 21, 2026, IBM researchers Asaf Yehudai, Lilach Eden, and Michal Shmueli-Scheuer posted Agentic CLEAR to arXiv — an automatic evaluation framework that sits above an agent's observability layer and emits textual insights at three granularities: system (patterns across runs), trace (one full execution), and node (one decision point). Unlike taxonomy-driven evals that force errors into preset categories, CLEAR generates qualitative descriptions that align with human-annotated errors and predict task success rate, validated across 4 benchmarks and 7 agentic settings over tens of thousands of LLM calls. Read the paper →

Picture an aviation safety analyst. They never fly the plane and they never bolt on new sensors — every plane already carries a flight recorder, and the analyst's whole job is to read what those recorders already captured. That is exactly the seat Agentic CLEAR takes. Your agent's observability tooling is already recording every step — each tool call, each decision, each retry — into a trace. CLEAR reads that existing trace rather than re-running the agent — it works from the logs your observability layer already produces. The recording was always there; almost nobody reads it.

Now picture the analyst working at three zoom levels. They pull up one cockpit decision to ask whether that single call was sound — that is the node. They replay one whole flight, takeoff to landing, to see how the run unfolded — that is the trace. And they step back over the fleet's flight history to find the pattern that repeats across many flights — that is the system level. The same recorder data, read at three different distances, answers three different questions. A single pass/fail score usually stands at just one distance, and a blurry one: it tells you the flight landed, never which decision nearly ended it.

What lets CLEAR write a real finding instead of a checkbox is hierarchy-aware prompting: the LLM judge is asked a different question at each level — "what went wrong in this step?", "how did this run unfold?", "what fails across the whole history?" — and answers each in plain language. Because the output is free text and not a category label, a failure mode the team has never seen before still gets described. A taxonomy-driven eval mostly counts the errors someone already thought to list; CLEAR can describe the one nobody did.

Why one number hides the run

Here is why granularity earns its keep. Take an illustrative batch of 20 runs of one agent. A pass/fail eval reports 14 of 20 passed, prints a tidy 70% success rate, and a team ships on it. Now read the same traces with CLEAR. At the node level, the judge notices that 9 of those 14 "passes" only succeeded because a retry quietly masked the same shaky tool call. At the trace level, it sees that brittle call drive plan drift in 6 of the runs. At the system level, it names the single dominant pattern: the agent is one flaky tool away from collapse. The honest read is not 70% — it is a fragile 70% that is really a 25% you can trust. (The 20/14/9 counts and the 25% read are illustrative; the 4 benchmarks, 7 agent settings, and "predicts task success rate" come from the paper.) The single number averaged the fragility away; the three zoom levels are what made it visible.

Level	What it reads	The question it answers	What a single score misses
`node`	one decision or tool call	was this one step sound?	a recovered-but-shaky call hidden inside a "pass"
`trace`	one full run, start to finish	how did this run unfold?	plan drift, premature stopping, compounding errors
`system`	runs across the agent's history	what fails across the whole agent?	the dominant failure pattern no single run reveals

Goes deeper in: Agent Engineering → Observability for Agents → What to log

Related explainers

This explainer stands alone from its news item (one concept), so the closest neighbors are other explainers on how to evaluate an agent's process, not just its final score:

FutureSim — harness-level agent eval — the level above: FutureSim scores the whole agent harness end-to-end, and CLEAR extends that downward to the trace and the single node
TelBench — span-level error localization — the sibling granularity move: pinpointing which span of a run failed, the same "zoom in past the final grade" instinct
Agent leaderboards mislead under distribution shift (IBM) — the contrast: predictive validity asks whether a ranking is valid; CLEAR asks what a single run was actually doing

Continue in trackEvals & Diagnostics: turn a failed run into a written diagnosis

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based