1 · The task

We build a tiny world where short natural-language commands map to a small set of actions: NORTH, SOUTH, EAST, WEST, PICKUP, DROP. The model reads a command and must predict the single action token. That is the whole task — no world state, no multi-step plans — which is exactly what makes it small enough to reverse-engineer completely.

The grammar

Every command is exactly two words, so each example is a fixed-length sequence of four tokens:

<bos>   w1   w2   <sep>

<bos> opens the sequence, the two words carry the command, and the model predicts the action at the final <sep> position. Fixing the length to four removes padding entirely and keeps the attention and logit-attribution views easy to read later: there is always one prediction position fed by exactly two content words.

A handful of examples, with the tokenized sequence and the action the model must predict:

commandtokenized sequencetarget action
go north<bos> go north <sep>NORTH
head upward<bos> head upward <sep>NORTH
move east<bos> move east <sep>EAST
grab item<bos> grab item <sep>PICKUP
put down<bos> put down <sep>DROP

Notice that go north and head upward share no words yet map to the same action — that collision is the whole point, and the dial below controls how much of it the model sees.

One prediction, not autoregression

The model is a decoder-only transformer — the same causal architecture GPT-style language models use — but we use it differently. A language model is trained to predict the next token at every position (given <bos>, predict w1; given <bos> w1, predict w2; …). Our model is trained to do exactly one thing: read the whole command <bos> w1 w2 <sep> in a single forward pass and predict the action only at the final <sep> position. The loss is applied to that one position; the action token is the label, never fed back into the input.

So there is no step where the model consumes … <sep> NORTH and keeps going — the input is always exactly four tokens, and the answer is read off the last one. It behaves more like a classifier ("which of the six actions?") than a text generator. The causal architecture does still emit an output at every position (that's just how a causal mask works), but only the final one is supervised or meaningful; the predictions at <bos>, w1, and w2 are untrained byproducts we ignore.

The dial: synonymy

The knob we turn is how many ways the same action can be phrased. Each action has a canonical phrasing plus a set of synonyms:

actioncanonicalsome synonyms
NORTHgo northmove north, head upward, travel north, …
EASTgo eastmove east, head rightward, travel east, …
PICKUPpick upgrab item, take object, lift item, …
DROPput downdrop item, release object, leave item, …

Richness sets how many of these the model sees: richness 0 gives one phrase per action (six commands total — pure memorization, a lookup table), and each level adds another synonym, up to six phrasings per action. The interpretability question rides on this dial: do synonymous phrases converge to the same internal representation, and at what model depth? That is the thread the rest of the curriculum pulls on.

One deliberate trap: the discriminative word sits in different positions across actions. Directions are disambiguated by the second word (go north vs go south share "go"), while pick/drop are disambiguated by the first (grab item vs drop item share "item"). So the model cannot cheat by always reading a fixed slot — it has to attend to the word that actually carries the meaning. We will watch it do exactly that in lesson 3.

Tokenization

Tokenization is word-level and tiny. The vocabulary is, in order: the two special tokens <bos> and <sep>, then every command word (sorted), then the six action tokens themselves. Because the action tokens live in the same vocabulary as the words, the model's unembedding can emit an action directly at the <sep> position — there is no separate classification head to reason about.

How the dataset is generated

The data layer (interp/data/) turns a richness level into train/validation splits in four steps:

  1. Pick the phrasings. The richness level selects how many synonyms each action gets — from one (richness 0) up to six (richness 5).
  2. Build the vocabulary. Collect every command word, sort it, and lay the vocab out as [<bos>, <sep>] + sorted words + the six action tokens — so every run at the same richness gets the same, stable token ids.
  3. Enumerate the universe. Form every (phrasing → action) pair at this richness and encode each as <bos> w1 w2 <sep>, paired with its action token as the target at the final position. At richness 0 the universe is just six examples; at richness 3 it is twenty-four.
  4. Sample the splits. Draw the training and validation examples from that universe with a fixed random seed, so the whole dataset is reproducible.

Because the universe is small and completely enumerable, the train and validation splits are drawn from the same set. That is deliberate: validation here measures how well the model fit the mapping, not held-out generalization — which is what we want, since every later lesson is about the learned representations, not a generalization gap. Training is then full-batch (each step sees the entire training set), which the next lesson shows converging almost immediately on the easiest setting.

With the task fixed, the next lesson trains the simplest possible model on the simplest setting (one phrase per action) and confirms it learns the lookup table.