Splitting a Model Across Machines

Distributed training systems exist because a modern model no longer fits, in memory or in time, on one accelerator. The problem splits along three axes. Memory pressure asks where parameters, gradients, and optimizer states live; FSDP shards all three across devices and gathers them only when a layer runs. Compute pressure asks how to cut one forward pass across many chips; tensor parallelism splits individual matrix multiplies, pipeline parallelism splits the layer stack into stages. Gradient checkpointing trades compute for memory, recomputing activations instead of storing them. Each technique targets a different bottleneck, and real runs combine them.

Tensor parallelism beats pipeline parallelism when the interconnect is fast and the bubble is the dominant tax, which in practice means inside a single node. The two pay for scale in different currencies. Tensor parallelism splits each matrix multiply, so it issues an all-reduce of full activations twice per layer, every forward and backward; the volume is high and the frequency is per-layer, which only NVLink-class bandwidth absorbs cheaply. Pipeline parallelism sends activations point-to-point only at stage boundaries, a fraction of the traffic, so it tolerates slow inter-node links. Its cost is the idle bubble between stages, not bandwidth.

Three-dimensional parallelism composes the three splits as orthogonal axes of a device mesh, assigning every accelerator a coordinate of tensor rank, pipeline rank, and data rank. The total chip count is the product of the three degrees. The placement follows the bandwidth hierarchy: tensor parallelism sits innermost, inside a node where NVLink feeds its per-layer all-reduces; pipeline parallelism spans nodes, since its boundary activations are cheap; data parallelism wraps the outside, replicating whole pipelines and reconciling them with one gradient all-reduce per step, the rarest collective. Each axis maps its heaviest traffic onto the fastest affordable link.

Megatron-LM and DeepSpeed expose the three degrees as plain integers and then build the plumbing underneath. In Megatron, flags like tensor-model-parallel-size and pipeline-model-parallel-size set two axes; the data-parallel degree is whatever world size remains. From those numbers it constructs one process group per axis and assigns each rank its coordinates. The catch is that the axes are not equal in effort. Data parallelism, including DeepSpeed’s ZeRO sharding, is largely a wrapper around the model. Tensor and pipeline parallelism demand surgery: hand-written column- and row-parallel layers, and the layer stack carved into stages with a 1F1B schedule.

At the largest scales the binding constraint stops being memory or bandwidth and becomes failure itself. A run spanning thousands of accelerators sees a fault every few hours, and synchronous training is unforgiving: every collective is a barrier, so one dead or straggling rank stalls the entire job. The defense is mundane but relentless. Checkpoint the sharded state often, detect faults with health checks, swap in spare nodes, and restart from the last save. The same synchrony that makes the gradient math exact is what makes the machine brittle, and the edifice rests on how fast it can be rebuilt.