2 · Baseline: a 1-layer model, no synonyms
Before any interpretability, we confirm the simplest model solves the simplest task: a 1-layer transformer on the richness-0 dataset. With one phrase per action, the model can essentially memorize a lookup table.
First, the model itself. It is a hand-rolled decoder-only transformer — deliberately minimal so every component is legible. Click through the pieces; the labels here (the residual stream and the hook-point names) are exactly what the attention, attribution, and patching lessons reach into later.
Training is full-batch (every step sees all six examples) with cross-entropy on the single
action token at the final <sep> position. Toggle between val loss and val accuracy:
the loss falls to ~0 and accuracy snaps to 100% within a few hundred steps — with one phrase
per action there is nothing to generalize, just a table to memorize.
Loading training curves…
That clean, fast convergence is the baseline. Everything interesting comes from asking how the model represents the mapping internally — which is where the next lessons go.