8 · Synthesis: across both axes
Seven lessons in, we can state the whole circuit in one breath and then point at the evidence for each clause.
The circuit. The model reads the single word that disambiguates the command, attention
copies that word to the final <sep> position, and the unembedding turns the residual
vector there into an action logit — with the MLP making the coarse "movement vs object" cut
and the attention heads supplying the fine "which direction." Every distinct phrasing of an
action lands on essentially the same readout vector, so the surface words stop mattering
once they reach the prediction slot.
Each clause is something we actually saw, from a different angle:
- Attention (lesson 3) — the
<sep>row attends almost entirely to the discriminative word (upwardin "head upward"), ignoring the shared filler. - Direct logit attribution (lesson 4) — the computation is two-tier: the MLP dominates the raw logits (the coarse movement-vs-object cut) while the small attention-head contributions carry the actual direction decision, visible only once you attribute the logit difference.
- Clustering (lesson 5) — synonymous phrases collapse to nearly identical residual vectors (intra-action cosine ≈ 1.00). The abstraction is real: "go north", "head upward", and "travel north" are the same thing inside the model.
- Patching (lesson 7) — the causal confirmation. Patching one
activation at a time shows the discriminative signal start on the input word and migrate,
via attention, to the
<sep>readout slot; the position-wise MLPs never move it across positions.
That is the data axis. The model axis (lesson 6) gave the most instructive — and most deflating — result: on a task this easy, depth barely matters. A 1-layer model already reaches 100% accuracy and fully clusters the synonyms; a second layer reaches the same final separation (0.63 vs 0.65). What depth changes is only where the abstraction forms — the 2-layer model defers all of the action-separation to its last layer — not whether it forms.
Loading experiment grid…
What this shows, and what it doesn't. We got a clean, fully reverse-engineered circuit, and four independent views — attention, DLA, clustering, patching — that agree on the same story. That agreement is the real payoff: each method is fallible alone, but they triangulate. The honest caveat is that the headline question, "does depth enable abstraction?", is under-stressed here: the task is so easy that the abstraction already appears at the minimum capacity, so the depth axis has nothing to bite on. The empty cells in the grid above are the to-do list, not an oversight.
The next dial to turn is difficulty, on either axis: many more synonyms per action, or compositional commands ("go north then pick up") that force the model to compose representations instead of mapping one discriminative word. That is where depth should finally begin to matter — and where these same four views become a genuine test of the abstraction a deeper model can build that a shallow one cannot. The machinery assembled here — the hand-rolled transformer, our own hook/cache, and the four analyses — is exactly what that investigation would reuse.