7 · Activation patching: the causal story

Attention (lesson 3) and DLA (lesson 4) are correlational — they tell us what a component attends to or writes, not whether the model would change its mind without it. To make a causal claim we intervene: run two inputs that differ in exactly one word, then copy a single activation from one run into the other and watch the decision move.

The setup is a minimal pair:

clean — go north → NORTH
corrupt — go south → SOUTH

These differ only at the second word, so any causal effect we find is the discriminative direction word doing its work. Our metric is the logit difference at the final position, logit[NORTH] − logit[SOUTH]: large and positive on the clean run, negative on the corrupt run. For each residual-stream checkpoint and token position we patch the clean activation into the corrupt run and measure recovery:

recovery = (patched_metric − corrupt_metric) / (clean_metric − corrupt_metric)

1.0 means that single spot, on its own, flips the decision all the way back to NORTH; 0.0 means it has no causal effect there.

Loading patching…

Read the grid top-to-bottom — that is the order activations are computed. At the embedding, the entire decision sits on the direction word north: patching it alone fully recovers NORTH (1.00) while every other position does nothing (0.00) — unsurprising, since that token is the only thing that differs. After layer 0's attention the signal has begun to move: north drops to 0.77 while the final <sep> readout position picks up 0.38. By layer 1 it has relocated completely — patching <sep> alone recovers the whole decision (1.00) and the input word no longer matters (0.00).

Two details make the mechanism concrete. Each MLP row is identical to the attention row directly above it — the tell that attention does the moving (it mixes across positions) while the position-wise MLPs do not. And the destination is exactly the <sep> slot whose attention we stared at in lesson 3. So this is the causal confirmation of that pattern: the circuit reads the direction word, and attention ferries that information up to the readout position where the unembedding (lesson 4) can turn it into a logit.

This is the causal capstone: attention told us where the model looks, DLA told us which component writes the answer, and patching confirms which activations the decision actually depends on — and watches the discriminative signal migrate from the input word to the readout position as it flows up the stack.