Core Questions

1. What is the object that should be identified?

Is the goal:

  • a latent code z that helps predict y,
  • a latent strategy label that corresponds to a human-interpretable reasoning mode,
  • or a globally consistent decomposition where the same latent index means the same thing across tasks?

2. What counts as identifiability?

Possible distinctions:

  • task-wise identifiability: for a fixed x, can we recover the relevant latent partition for that problem?
  • problem-wise identifiability: do we recover the right strategy classes for a family of inputs sharing the same task?
  • globally consistent identifiability: does one latent index correspond to one strategy semantics across the whole data distribution?

3. When is “useful latent” not enough?

A useful latent can still fail to be identifiable if:

  • it is only predictive on the training support,
  • it changes meaning across examples,
  • it encodes surface-level regularities instead of strategies,
  • or it relies on arbitrary relabelings that are not stable across inputs.

4. What structure makes strategy semantics recoverable?

Candidate ingredients:

  • a small finite set of strategies,
  • strategy-specific trajectory structure that persists long enough to be observed,
  • enough variation across examples to separate strategy from prompt surface form,
  • and supervision or verification signals that make semantics observable.

5. How should the theory connect to exp2?

The synthetic setting is useful because it can support:

  • direct access to ground-truth strategies,
  • comparisons of posterior/router predictions to known labels,
  • and controlled tests of whether local usefulness implies global consistency.

The theory should therefore end in empirical diagnostics, not just abstract existence claims.

Next

Next: Metrics and definitions

Built with LogoFlowershow