Core Questions

1. What is the object that should be identified?

Is the goal:

a latent code z that helps predict y,
a latent strategy label that corresponds to a human-interpretable reasoning mode,
or a globally consistent decomposition where the same latent index means the same thing across tasks?

2. What counts as identifiability?

Possible distinctions:

task-wise identifiability: for a fixed x, can we recover the relevant latent partition for that problem?
problem-wise identifiability: do we recover the right strategy classes for a family of inputs sharing the same task?
globally consistent identifiability: does one latent index correspond to one strategy semantics across the whole data distribution?

3. When is “useful latent” not enough?

A useful latent can still fail to be identifiable if:

it is only predictive on the training support,
it changes meaning across examples,
it encodes surface-level regularities instead of strategies,
or it relies on arbitrary relabelings that are not stable across inputs.

4. What structure makes strategy semantics recoverable?

Candidate ingredients:

a small finite set of strategies,
strategy-specific trajectory structure that persists long enough to be observed,
enough variation across examples to separate strategy from prompt surface form,
and supervision or verification signals that make semantics observable.

5. How should the theory connect to `exp2`?

The synthetic setting is useful because it can support:

direct access to ground-truth strategies,
comparisons of posterior/router predictions to known labels,
and controlled tests of whether local usefulness implies global consistency.

The theory should therefore end in empirical diagnostics, not just abstract existence claims.

Next

Next: Metrics and definitions