4 · Direct logit attribution
Which components actually push the action logit up? Our model has no LayerNorm and no unembed bias, so the final residual stream is a clean linear sum of component writes:
logit[a] = Σ_component ( write_component · W_U[:, a] )
That makes the decomposition exact — the bars below literally add up to the true logit.
Start with the predicted action and the default Relative to: <runner-up> view.
Loading attribution…
The story has two tiers. Flip Relative to → raw logit: the MLP dwarfs everything, contributing ~+146 to NORTH while the four attention heads add only ~+2 each. But the MLP is not deciding which direction — it hands a similarly huge score to every movement direction (SOUTH, EAST, WEST all ~+80 to +140) and strongly suppresses the object-actions (PICKUP ≈ −241, DROP ≈ −56). The MLP's job is the coarse cut: “this is a movement command, not pick/drop.”
Now switch Relative to → WEST (the runner-up). The MLP's giant common-mode term cancels, and the picture inverts: the small attention-head contributions are now clearly carrying the fine “which direction” decision — the heads supply roughly a third of NORTH's margin over WEST, the MLP the rest. This is the general lesson of DLA: a component's contribution to a raw logit can be dominated by a large term shared across many tokens; to see what drives a decision you attribute the logit difference.