Why Topology-Aware Scheduling Matters for GB200 NVL72
Modern AI models—especially trillion-parameter LLMs and mixture-of-experts architectures—demand massive, low-latency GPU communication. The NVIDIA GB200 NVL72 delivers exascale compute in a single rack, with 72 Blackwell GPUs interconnected by NVLink at 130 TB/s. But raw hardware power means nothing if your job scheduler scatters workloads across disjoint network domains.
In shared clusters, naive scheduling (e.g., Slurm's default cons_res plugin) treats all nodes as equal, often fragmenting jobs across leaf switches. This leads to severe performance degradation: a job that could fully utilize NVLink bandwidth ends up bottlenecked by slower inter-switch links. The solution is topology-aware scheduling, which aligns job allocations with the physical network hierarchy.
NVIDIA and SchedMD collaborated to introduce the topology/block plugin in Slurm 23.11, purpose-built for rack-scale systems like GB200 NVL72 and GB300 NVL72. This post dives into how it works, how to configure segment sizes, and what simulation results reveal about real-world occupancy trade-offs.

Understanding the Block Topology Plugin
The topology/block plugin replaces the older topology/tree plugin's best-effort approach with a deterministic block allocation strategy. It groups nodes into NVLink domains (each domain is a set of nodes that can communicate entirely over NVLink without crossing switches).
Key Parameters
TopologyBlockSched: Enables block scheduling in slurm.conf.SwitchNames: Defines domain boundaries (e.g.,sw0for domain 0).SegmentSize: The number of nodes allocated contiguously within a domain.
Example Configuration
# slurm.conf
TopologyPlugin=topology/block
TopologyBlockSched=yes
SwitchNames=sw0 Nodes=node[01-18]
SwitchNames=sw1 Nodes=node[19-36]
SwitchNames=sw2 Nodes=node[37-54]
SwitchNames=sw3 Nodes=node[55-72]
When a job requests 32 GPUs (8 nodes), Slurm's block scheduler will attempt to allocate all 8 nodes within the same NVLink domain (e.g., sw0). If insufficient nodes are free, it will fall back to adjacent domains but only after exhausting in-domain options.
Segment Sizing Rules of Thumb
| Job Size (GPUs) | Recommended Segment Size (Nodes) | Example Workload |
|---|---|---|
| 128+ | 16 | MoE training |
| 32–64 | 4 | Large dense LLM |
| <32 | 1 | Fine-tuning, inference |
Important: Choose segment sizes that are powers of two for optimal alignment with NVLink topology. Non-power-of-two sizes (e.g., 12 nodes) may still work, but test your specific workload's efficiency.

Simulation Results: Occupancy vs. Fragmentation
We built a standalone Slurm simulator (time-accelerated, VM-based) to evaluate scheduling strategies on a 5,000-node GB200 NVL72 cluster running 15,000 jobs over 7 days, with 2.5% node failure rate.
Key Findings
-
Fragmentation is minimal with
topology/block. Small jobs (1–18 nodes) tend to pack into the last two nodes of each domain, leaving larger contiguous blocks free for big jobs. -
Occupancy penalty is ~1% compared to a theoretical
noTopobaseline (which ignores topology constraints). This means you can achieve near-optimal utilization without sacrificing performance. -
Large jobs benefit most. Jobs with ≥32 nodes using segment size 16 saw up to 2.6× training throughput improvement (per MLPerf) versus naive placement.
Recommended Policy: Large_Perf_Custom
- Jobs ≥32 nodes → segment size 16
- Jobs <32 nodes → segment size 2
- Monitor fragmentation weekly; adjust segment sizes if >10% of domains are >50% fragmented
Limitations & Caveats
- Not a silver bullet: If your cluster runs many tiny jobs (single-node), the block plugin may over-constrain and increase queue time. In such cases, consider a hybrid policy that relaxes topology constraints for jobs under 4 GPUs.
- Segment size tuning requires experimentation: The optimal segment size depends on model parallelism strategy (FSDP, TP, PP, EP). Use the simulator to validate before production rollout.
- Node failures can break domain continuity: The block plugin does not automatically rebalance after failures. You must manually drain or replace failed nodes to maintain domain integrity.
Next Steps
- Upgrade to Slurm 23.11+ and enable
topology/block. - Run the simulation framework (available on NVIDIA's GitHub) with your workload trace.
- Start with conservative segment sizes (power of two) and iterate based on monitoring data.
For a deeper dive into segment scheduling algorithms, see the original NVIDIA blog post.

Conclusion
Topology-aware scheduling is no longer optional for exascale AI clusters. NVIDIA's GB200 NVL72, combined with Slurm's block topology plugin, delivers both high performance and high utilization—provided you configure segment sizes wisely. Our simulations show that the occupancy penalty of topology constraints can be reduced to ~1%, making it a clear win for any serious AI infrastructure.
Start today: review your Slurm configuration, model your workload with the simulator, and deploy block scheduling. Your GPU hours—and your researchers—will thank you.