Apple researchers presented ParaRNN as an oral paper at ICLR 2026, solving a decades-old barrier: training nonlinear RNNs (LSTM, GRU) sequentially has made them impractical to scale alongside transformers. The method casts the sequence of recurrence relationships as a single system of equations solved via Newton's iterations combined with custom parallel reductions, achieving up to 665× faster training while reproducing the same hidden-state evolution as sequential application. With this approach, the team trained the first 7-billion-parameter classical RNNs that reach perplexity competitive with transformers and Mamba2; the code is open-sourced. Because RNNs have lower inference cost than attention-based architectures, ParaRNN opens a practical path to scaling them as an efficient alternative.

Apple's ParaRNN enables parallel training of large nonlinear RNNs with 665x speedup

Citations