What the Loss Curve Hides

Training dynamics is the study of how a network’s behavior changes over the course of optimization, not just where it lands. A loss curve looks like steady decay, but that smoothness hides structure: long plateaus, sudden drops, and regimes where the model reorganizes what it has already learned. Grokking is the sharpest example, where validation accuracy stays at chance long after the training loss bottoms out, then snaps to perfect generalization thousands of steps later. Phase transitions name these abrupt shifts, moments when a capability appears almost discontinuously. The training trajectory, not the final weights, carries the explanation.

Whether emergent capabilities are genuine phase transitions depends heavily on the ruler used to measure them. The original claim, from Wei and collaborators, was that certain abilities like multi-step arithmetic appear suddenly past a parameter threshold, invisible at smaller scale. Schaeffer and colleagues then pushed back: the sharpness is often manufactured by the metric. Exact-match accuracy is all-or-nothing, so a model improving smoothly in log-likelihood looks flat until it crosses a hidden bar, then spikes. Swap to a continuous score and the cliff becomes a ramp. The capability grows steadily; the discontinuity lives in the evaluation.

Mechanistic interpretability offers a sharper test than a benchmark, because it watches the internal algorithm form rather than scoring the output. Two findings anchor the case. Olsson and colleagues traced in-context learning to induction heads that snap into place during a narrow window of training, leaving a visible bump in the loss. Nanda and colleagues reverse-engineered a grokking network and found it builds a Fourier-based algorithm for modular addition. But that second result cuts both ways: the circuit assembles gradually, even while test accuracy stays flat and then jumps. Interpretability relocates the transition more than it abolishes it.

A progress measure is a quantity that rises smoothly while the metric sits flat, exposing hidden movement toward a solution. In the grokking study, the measures were built from the model’s own Fourier machinery: a restricted loss keeping only the key frequencies climbs steadily long before test accuracy moves. The honest catch is that those measures were reverse-engineered after the circuit was understood, so they explain rather than predict. Defining one in advance is harder. Developmental interpretability, drawing on singular learning theory, tries to detect stage transitions from the local learning coefficient alone, without naming the circuit first.

A model-agnostic progress measure can work, but reliability across tasks is probably too much to ask of one scalar. Weight norm is the best candidate: Liu, Michaud, and Tegmark showed grokking coincides with the norm drifting toward a critical scale, and weight decay hastens the jump. Sparsity tells a similar story when the solution is sparse. Yet each scalar tracks one kind of structure, and a new task can hide its progress where the gauge never looks. That is the lasting lesson: the trajectory is partly legible, never as smooth as the loss curve that hides it.