6 · Turning the model dial: depth

Does abstraction depend on capacity? We hold the synonym-rich data fixed and add depth.

The catch — and the reason this lesson is representational, not accuracy-driven: on this toy task even the 1-layer model already reaches 100% accuracy. Both depths, trained on the same synonym-rich (richness-3) data, converge to val accuracy 1.0; the loss curves differ only in the transient. Toggle to val accuracy to see them both pin to 1.0:

Loading training curves…

Accuracy saturates, so it tells us nothing about how the model gets there. The clustering view is what lets us see the difference.

Here is the 2-layer model on the same data as lesson 5. Step the Depth selector through layer 1layer 2 and watch what happens.

Loading clustering…

The result is more interesting than "deeper clusters harder." Measuring cosine similarity at the final position, within an action vs across actions:

modeldepthintra-actioninter-actionseparation
1-layerlayer 11.000.370.63
2-layerlayer 11.001.000.00
2-layerlayer 21.000.350.65

Two things stand out. First, synonyms collapse to essentially the same vector (intra ≈ 1.00) in both models — one layer already fully abstracts away the surface phrasing. Second, the final separation is basically identical (0.63 vs 0.65); the extra layer does not buy tighter clusters.

What depth changes is where the work happens. The 2-layer model leaves the prediction position almost untouched after layer 1 (everything still ≈1.00 similar), then does all the action-separation in layer 2. The 1-layer model has no choice but to do it immediately. So on a task this easy, depth doesn't sharpen the abstraction — it just redistributes when it forms. To make depth genuinely matter for abstraction we'd need a harder dial: many more synonyms, or compositional commands. That's the next axis to turn.

Loading experiment grid…