What is streaming reasoning?

Streaming reasoning lets a language model reason while its input is still arriving, under partial observation, instead of waiting for the whole prompt. AdaSR (arXiv 2606.14694) pairs an on-the-fly streaming-reasoning phase with a final deep-reasoning pass once the stream completes, so the answer can land sooner.

Why does AdaSR matter?

Conventional reasoning models read the entire prompt before they think, which wastes the time the input takes to arrive in streaming settings like live transcripts, sensor feeds, or an agent watching a tool run. AdaSR overlaps reasoning with input arrival and learns how much to think and when, reporting a better balance of accuracy, compute, and latency than a supervised fine-tuning baseline.

AdaSR teaches LLMs to reason mid-stream — Streaming reasoning

Q: How is HRPO different from standard RL fine-tuning?

Standard RL fine-tuning spreads one sequence-level advantage over every token, so it cannot tell a good live guess from a good final answer. HRPO (Hierarchical Relative Policy Optimization) decomposes the objective into a streaming-reasoning phase and a deep-reasoning phase and assigns credit per phase, while an adaptive-thinking reward prices in latency.

Jargon

Streaming reasoning: Reasoning that happens while the input is still arriving, under partial observation, instead of waiting for the whole prompt. AdaSR pairs it with a final deep-reasoning pass once the stream completes.
Read-then-think: The conventional paradigm: read the entire prompt, then reason. It is the baseline AdaSR departs from — see AI Agents → Planning & Reflection.
HRPO: Hierarchical Relative Policy Optimization. The RL method that trains AdaSR by splitting policy optimization into a streaming phase and a deep phase for finer-grained credit, rather than spreading one advantage over every token.
Advantage assignment: How RL decides which tokens deserve credit or blame for the outcome. Standard recipes apply one sequence-level advantage to all tokens; HRPO assigns it per phase.
Adaptive-thinking reward: A reward term that prices in latency — it encourages the model to allocate just enough reasoning compute, and when, rather than always thinking the maximum amount.
GRPO: Group Relative Policy Optimization, the relative-advantage RL family HRPO descends from. See VPO vs GRPO for how relative advantages work.

The news. On June 12, 2026, researchers released AdaSR — Adaptive Streaming Reasoning, a way to train LLMs that reason while the input is still streaming in and then run a final deliberation once the stream completes. Its training method, HRPO (Hierarchical Relative Policy Optimization), decomposes the RL objective into a streaming-reasoning phase and a deep-reasoning phase so credit is assigned per phase instead of as one flat sequence-level advantage. Three rewards — format, accuracy, and adaptive-thinking (latency-aware) — steer it. The authors report a better balance of accuracy, compute, and streaming latency than a supervised fine-tuning baseline, without publishing headline numbers. Read the paper →

Picture a simultaneous interpreter at a live speech. A bad interpreter does the obvious thing: wait for the speaker to finish a whole sentence, then translate it — accurate, but always a sentence behind. A skilled one starts interpreting on the fly, rendering each clause as it lands and revising on the rare word that flips the meaning. AdaSR is teaching the model to be the second interpreter: reason as the words arrive, not after the speaker stops. That breaks the usual read-then-think habit, where the model sits idle through the entire prompt and only starts reasoning once the last token lands.

But interpreting live is a different skill from composing the final, polished translation — and AdaSR trains the two separately. Its method, HRPO, splits reinforcement learning into a streaming-reasoning phase (think under partial observation) and a deep-reasoning phase (deliberate once the input is complete), then assigns RL credit to each phase on its own. That fine-grained advantage assignment is the core trick: a flat, sequence-level reward can't tell a good live guess from a good final answer, so it can't coach either skill well. On top of it sits a reward that prices in how much to think and when — the adaptive-thinking term — alongside the usual format and accuracy rewards.

Approach	When it reasons	RL training signal	Answer latency
read-then-think	only after the full prompt lands	one sequence-level advantage over all tokens	late — input-time plus all reasoning
AdaSR	while the input streams, then a short final pass	split per phase (HRPO) + latency-aware reward	sooner — reasoning overlaps the wait

Put the clock on it. Say a user dictates a 200-token question into a live transcript, and answering needs about 80 tokens of reasoning. Read-then-think waits for all 200 input tokens to land, then spends 80 — so the time-to-first-answer is input-time plus 80. AdaSR overlaps the work: it spends maybe 50 of those reasoning tokens while the input is still streaming, and only ~30 of final deliberation after the last word. The answer lands sooner by exactly the reasoning it managed to pre-compute during the wait — here ~50 tokens of overlap (the 200 / 80 / 50 / 30 split is illustrative; the paper reports a better accuracy–compute–latency balance than an SFT baseline but gives no headline numbers).

Goes deeper in: AI Agents → Planning & Reflection → When to Spend More Tokens

Related explainers

StreamMA — streaming inter-agent reasoning — the other "streaming reasoning": AdaSR is one model thinking as its input arrives; StreamMA is agents streaming reasoning to each other
LongTraceRL — rubric process reward — like HRPO, it pushes RL credit below the sequence level, rewarding the steps of a trace rather than only the final answer
VPO vs GRPO — the relative-policy-optimization family HRPO descends from, and how relative advantages are computed

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based