Core Questions
Core Questions
1. What is the object that should be identified?
Is the goal:
- a latent code
zthat helps predicty, - a latent strategy label that corresponds to a human-interpretable reasoning mode,
- or a globally consistent decomposition where the same latent index means the same thing across tasks?
2. What counts as identifiability?
Possible distinctions:
- task-wise identifiability: for a fixed
x, can we recover the relevant latent partition for that problem? - problem-wise identifiability: do we recover the right strategy classes for a family of inputs sharing the same task?
- globally consistent identifiability: does one latent index correspond to one strategy semantics across the whole data distribution?
3. When is “useful latent” not enough?
A useful latent can still fail to be identifiable if:
- it is only predictive on the training support,
- it changes meaning across examples,
- it encodes surface-level regularities instead of strategies,
- or it relies on arbitrary relabelings that are not stable across inputs.
4. What structure makes strategy semantics recoverable?
Candidate ingredients:
- a small finite set of strategies,
- strategy-specific trajectory structure that persists long enough to be observed,
- enough variation across examples to separate strategy from prompt surface form,
- and supervision or verification signals that make semantics observable.
5. How should the theory connect to exp2?
The synthetic setting is useful because it can support:
- direct access to ground-truth strategies,
- comparisons of posterior/router predictions to known labels,
- and controlled tests of whether local usefulness implies global consistency.
The theory should therefore end in empirical diagnostics, not just abstract existence claims.