Scaling Large MoE Models with Wide Expert Parallelism on NVL72 Rack Scale Systems

Eduardo Alvarez

2025-10-21 3 min read

<img alt="" class="webfeedsFeaturedVisual wp-post-image" height="432" src="https://developer-blogs.nvidia.com/wp-content/uploads/2025/10/image4-2-768x432-jpg.webp" style="display: block; margin-bottom...

Modern AI workloads have moved well beyond single-GPU inference serving. Model parallelism, which efficiently splits computation across many GPUs, is now the...

Modern AI workloads have moved well beyond single-GPU inference serving. Model parallelism, which efficiently splits computation across many GPUs, is now the foundation of scalable, state-of-the-art deployments. The highest-performing models increasingly adopt mixture-of-experts (MoE) architectures, which are more efficient than dense models because they activate only a subset of trained…

Source

Source: NVIDIA Technical Blog Word count: 1118 words

Published on 2025-10-21 00:00