1 · The task
We build a tiny world where short natural-language commands map to a small set of
actions: NORTH, SOUTH, EAST, WEST, PICKUP, DROP. The model reads a command and must
predict the single action token. That is the whole task — no world state, no multi-step
plans — which is exactly what makes it small enough to reverse-engineer completely.
The grammar
Every command is exactly two words, so each example is a fixed-length sequence of four tokens:
<bos> w1 w2 <sep>
<bos> opens the sequence, the two words carry the command, and the model predicts the
action at the final <sep> position. Fixing the length to four removes padding entirely and
keeps the attention and logit-attribution views easy to read later: there is always one
prediction position fed by exactly two content words.
A handful of examples, with the tokenized sequence and the action the model must predict:
| command | tokenized sequence | target action |
|---|---|---|
| go north | <bos> go north <sep> | NORTH |
| head upward | <bos> head upward <sep> | NORTH |
| move east | <bos> move east <sep> | EAST |
| grab item | <bos> grab item <sep> | PICKUP |
| put down | <bos> put down <sep> | DROP |
Notice that go north and head upward share no words yet map to the same action — that collision is the whole point, and the dial below controls how much of it the model sees.
One prediction, not autoregression
The model is a decoder-only transformer — the same causal architecture GPT-style language
models use — but we use it differently. A language model is trained to predict the next
token at every position (given <bos>, predict w1; given <bos> w1, predict w2; …).
Our model is trained to do exactly one thing: read the whole command <bos> w1 w2 <sep>
in a single forward pass and predict the action only at the final <sep> position. The
loss is applied to that one position; the action token is the label, never fed back into
the input.
So there is no step where the model consumes … <sep> NORTH and keeps going — the input
is always exactly four tokens, and the answer is read off the last one. It behaves more like
a classifier ("which of the six actions?") than a text generator. The causal architecture
does still emit an output at every position (that's just how a causal mask works), but only
the final one is supervised or meaningful; the predictions at <bos>, w1, and w2 are
untrained byproducts we ignore.
The dial: synonymy
The knob we turn is how many ways the same action can be phrased. Each action has a canonical phrasing plus a set of synonyms:
| action | canonical | some synonyms |
|---|---|---|
NORTH | go north | move north, head upward, travel north, … |
EAST | go east | move east, head rightward, travel east, … |
PICKUP | pick up | grab item, take object, lift item, … |
DROP | put down | drop item, release object, leave item, … |
Richness sets how many of these the model sees: richness 0 gives one phrase per action (six commands total — pure memorization, a lookup table), and each level adds another synonym, up to six phrasings per action. The interpretability question rides on this dial: do synonymous phrases converge to the same internal representation, and at what model depth? That is the thread the rest of the curriculum pulls on.
One deliberate trap: the discriminative word sits in different positions across actions.
Directions are disambiguated by the second word (go north vs go south share "go"),
while pick/drop are disambiguated by the first (grab item vs drop item share "item").
So the model cannot cheat by always reading a fixed slot — it has to attend to the word that
actually carries the meaning. We will watch it do exactly that in lesson 3.
Tokenization
Tokenization is word-level and tiny. The vocabulary is, in order: the two special tokens
<bos> and <sep>, then every command word (sorted), then the six action tokens
themselves. Because the action tokens live in the same vocabulary as the words, the model's
unembedding can emit an action directly at the <sep> position — there is no separate
classification head to reason about.
How the dataset is generated
The data layer (interp/data/) turns a richness level into train/validation splits in four
steps:
- Pick the phrasings. The richness level selects how many synonyms each action gets — from one (richness 0) up to six (richness 5).
- Build the vocabulary. Collect every command word, sort it, and lay the vocab out as
[<bos>, <sep>]+ sorted words + the six action tokens — so every run at the same richness gets the same, stable token ids. - Enumerate the universe. Form every (phrasing → action) pair at this richness and
encode each as
<bos> w1 w2 <sep>, paired with its action token as the target at the final position. At richness 0 the universe is just six examples; at richness 3 it is twenty-four. - Sample the splits. Draw the training and validation examples from that universe with a fixed random seed, so the whole dataset is reproducible.
Because the universe is small and completely enumerable, the train and validation splits are drawn from the same set. That is deliberate: validation here measures how well the model fit the mapping, not held-out generalization — which is what we want, since every later lesson is about the learned representations, not a generalization gap. Training is then full-batch (each step sees the entire training set), which the next lesson shows converging almost immediately on the easiest setting.
With the task fixed, the next lesson trains the simplest possible model on the simplest setting (one phrase per action) and confirms it learns the lookup table.