Akgül et al. (May 7, 2026) argue reinforcement learning for LLM reasoning is sparse policy selection rather than capability learning: token-level analysis across multiple model families shows RL meaningfully shifts probability mass at only 1–3% of positions, and promoted tokens always lie within the base model's top-5 alternatives. The paper introduces ReasonMaxxer, a contrastive-loss method targeting entropy-gated decision points without online generation, which matches or exceeds full RL across three model families, six scales, and six math reasoning benchmarks using hundreds of examples and minutes of single-GPU compute.

arXiv:2605.06241: RL for LLM Reasoning Touches Only 1–3% of Token Positions; RL-Free ReasonMaxxer Matches Full RL Training at 1,000× Lower Cost

Citations