The news. On June 12, 2026, researchers released AdaSR — Adaptive Streaming Reasoning, a way to train LLMs that reason while the input is still streaming in and then run a final deliberation once the stream completes. Its training method, HRPO (Hierarchical Relative Policy Optimization), decomposes the RL objective into a streaming-reasoning phase and a deep-reasoning phase so credit is assigned per phase instead of as one flat sequence-level advantage. Three rewards — format, accuracy, and adaptive-thinking (latency-aware) — steer it. The authors report a better balance of accuracy, compute, and streaming latency than a supervised fine-tuning baseline, without publishing headline numbers. Read the paper →
Picture a simultaneous interpreter at a live speech. A bad interpreter does the obvious thing: wait for the speaker to finish a whole sentence, then translate it — accurate, but always a sentence behind. A skilled one starts interpreting on the fly, rendering each clause as it lands and revising on the rare word that flips the meaning. AdaSR is teaching the model to be the second interpreter: reason as the words arrive, not after the speaker stops. That breaks the usual read-then-think habit, where the model sits idle through the entire prompt and only starts reasoning once the last token lands.
But interpreting live is a different skill from composing the final, polished translation — and AdaSR trains the two separately. Its method, HRPO, splits reinforcement learning into a streaming-reasoning phase (think under partial observation) and a deep-reasoning phase (deliberate once the input is complete), then assigns RL credit to each phase on its own. That fine-grained advantage assignment is the core trick: a flat, sequence-level reward can't tell a good live guess from a good final answer, so it can't coach either skill well. On top of it sits a reward that prices in how much to think and when — the adaptive-thinking term — alongside the usual format and accuracy rewards.
| Approach | When it reasons | RL training signal | Answer latency |
|---|---|---|---|
| read-then-think | only after the full prompt lands | one sequence-level advantage over all tokens | late — input-time plus all reasoning |
| AdaSR | while the input streams, then a short final pass | split per phase (HRPO) + latency-aware reward | sooner — reasoning overlaps the wait |
Put the clock on it. Say a user dictates a 200-token question into a live transcript, and answering needs about 80 tokens of reasoning. Read-then-think waits for all 200 input tokens to land, then spends 80 — so the time-to-first-answer is input-time plus 80. AdaSR overlaps the work: it spends maybe 50 of those reasoning tokens while the input is still streaming, and only ~30 of final deliberation after the last word. The answer lands sooner by exactly the reasoning it managed to pre-compute during the wait — here ~50 tokens of overlap (the 200 / 80 / 50 / 30 split is illustrative; the paper reports a better accuracy–compute–latency balance than an SFT baseline but gives no headline numbers).
Goes deeper in: AI Agents → Planning & Reflection → When to Spend More Tokens
Related explainers
- StreamMA — streaming inter-agent reasoning — the other "streaming reasoning": AdaSR is one model thinking as its input arrives; StreamMA is agents streaming reasoning to each other
- LongTraceRL — rubric process reward — like HRPO, it pushes RL credit below the sequence level, rewarding the steps of a trace rather than only the final answer
- VPO vs GRPO — the relative-policy-optimization family HRPO descends from, and how relative advantages are computed