2 · Baseline: a 1-layer model, no synonyms

Before any interpretability, we confirm the simplest model solves the simplest task: a 1-layer transformer on the richness-0 dataset. With one phrase per action, the model can essentially memorize a lookup table.

First, the model itself. It is a hand-rolled decoder-only transformer — deliberately minimal so every component is legible. Click through the pieces; the labels here (the residual stream and the hook-point names) are exactly what the attention, attribution, and patching lessons reach into later.

input sequence (n_ctx = 4)

↓

residual stream · d_model 64

↓

logits over vocab → argmax at <sep> = predicted action

No LayerNorm, no unembed bias. Every component just adds to the residual stream, so the final logits are an exact linear sum of those writes — which is what lets direct logit attribution and activation patching read the stream cleanly.

Decoder-only transformer: 1 layer, d_model 64, 4 heads (d_head 16), d_mlp 256. Click any component to see what it computes and where in the curriculum we open it up.

Training is full-batch (every step sees all six examples) with cross-entropy on the single action token at the final <sep> position. Toggle between val loss and val accuracy: the loss falls to ~0 and accuracy snaps to 100% within a few hundred steps — with one phrase per action there is nothing to generalize, just a table to memorize.

Loading training curves…

That clean, fast convergence is the baseline. Everything interesting comes from asking how the model represents the mapping internally — which is where the next lessons go.