6 · Turning the model dial: depth
Does abstraction depend on capacity? We hold the synonym-rich data fixed and add depth.
The catch — and the reason this lesson is representational, not accuracy-driven: on this toy task even the 1-layer model already reaches 100% accuracy. Both depths, trained on the same synonym-rich (richness-3) data, converge to val accuracy 1.0; the loss curves differ only in the transient. Toggle to val accuracy to see them both pin to 1.0:
Loading training curves…
Accuracy saturates, so it tells us nothing about how the model gets there. The clustering view is what lets us see the difference.
Here is the 2-layer model on the same data as lesson 5. Step
the Depth selector through layer 1 → layer 2 and watch what happens.
Loading clustering…
The result is more interesting than "deeper clusters harder." Measuring cosine similarity at the final position, within an action vs across actions:
| model | depth | intra-action | inter-action | separation |
|---|---|---|---|---|
| 1-layer | layer 1 | 1.00 | 0.37 | 0.63 |
| 2-layer | layer 1 | 1.00 | 1.00 | 0.00 |
| 2-layer | layer 2 | 1.00 | 0.35 | 0.65 |
Two things stand out. First, synonyms collapse to essentially the same vector (intra ≈ 1.00) in both models — one layer already fully abstracts away the surface phrasing. Second, the final separation is basically identical (0.63 vs 0.65); the extra layer does not buy tighter clusters.
What depth changes is where the work happens. The 2-layer model leaves the prediction position almost untouched after layer 1 (everything still ≈1.00 similar), then does all the action-separation in layer 2. The 1-layer model has no choice but to do it immediately. So on a task this easy, depth doesn't sharpen the abstraction — it just redistributes when it forms. To make depth genuinely matter for abstraction we'd need a harder dial: many more synonyms, or compositional commands. That's the next axis to turn.
Loading experiment grid…