Efficient MoE Pre-training at Scale on 1K AMD GPUs with TorchTitan

Training massive MoE models (DeepSeek-V3, Llama 4-Scout, etc.) pushes hardware to the brink. AMD + Meta’s PyTorch team fixed that: optimized TorchTitan + Primus-Turbo for MI325X → near-perfect scaling on 1,024 GPUs. Big scale + high efficiency is now real. https://tinyurl.com/c3jjzxbb

December 5, 2025

AI Engineering Gen AI

About The Author
Latest Posts

VIEW ALL POSTS BY

Efficient MoE Pre-training at Scale on 1K AMD GPUs with TorchTitan

You May Also Like