Google DeepMind published Decoupled DiLoCo on April 23, a distributed training architecture that partitions large runs across isolated compute islands communicating asynchronously rather than synchronously across all chips. The approach cuts cross-datacenter bandwidth from 198 Gbps to 0.84 Gbps and maintains 88% useful training progress in high-failure environments versus 27% for conventional synchronization — validated by training a 12-billion parameter model across four US regions. The accompanying paper demonstrates the method can mix hardware generations (TPU v6e and v5p) in a single run with zero global downtime.

Google DeepMind publishes Decoupled DiLoCo: distributed training on 0.84 Gbps inter-datacenter bandwidth

Citations