Metrics and Definitions

This file should eventually contain the formal backbone of the thread.

Candidate Definitions

Task-wise identifiability

For a fixed input x, the latent factorization is task-wise identifiable if the induced latent partition separates the strategies relevant to that x, up to a permutation of the latent labels.

Informally: within one problem instance, different strategies should map to distinct latent values in a way that is recoverable from the model.

Problem-wise identifiability

For a task family with a shared strategy set, the latent factorization is problem-wise identifiable if the same latent partition recovers the same strategy set across multiple x from that family, again up to permutation.

Global-consistent identifiability

The latent factorization is globally consistent if one latent index corresponds to one strategy semantics across the full support of the data distribution, possibly modulo a single global permutation of labels.

This is the strongest notion and the one most aligned with the phrase “z means the same strategy across examples.”

Candidate Metrics

  • Posterior-label alignment: compare q(z|x,y) to ground-truth strategy labels when they exist.
  • Router-label alignment: compare p(z|x) to strategy labels or strategy frequencies on tasks where a routing target is meaningful.
  • Permutation-invariant matching: measure the best latent-to-strategy matching score after allowing label permutations.
  • Cross-example consistency: measure whether a latent index maps to the same strategy across many x.
  • Semantic stability: check whether latent semantics remain fixed under changes in prompt wording, task instance, or data split.

Controlled-generation coverage metric

For controlled evaluation on one input x_i, define the set-valued strategy compatibility relation

Bi[k,s]=1[sMi(k)],B_i[k,s] = \mathbf{1}[s \in M_i(k)],

where M_i(k) is the set of ground-truth strategies compatible with the generated output under forced latent z_k.

This yields the local coverage score

cloc(xi)=1Smaxϕ:S[K] injectivesSBi[ϕ(s),s].c_{\mathrm{loc}}(x_i) = \frac{1}{|S|} \max_{\phi:S \to [K]\ \text{injective}} \sum_{s \in S} B_i[\phi(s), s].

This metric is intentionally weaker than semantic identifiability:

  • it measures whether the latent states cover the attainable strategy behaviors for that input,
  • it does not require unique strategy attribution,
  • and it does not require one latent index to mean the same strategy across inputs.

This is therefore a local coverage metric, not a global semantic-alignment metric.

Failure Cases to Distinguish

  • Predictive but unstable latents: good for reconstruction, bad for semantics.
  • Globally consistent but task-agnostic latents: stable labels that do not actually separate strategy variation.
  • Locally identifiable but globally inconsistent latents: each task is recoverable on its own, but the latent labels do not align across tasks.
  • Overcompressed latent spaces: a single latent may encode multiple strategies without a clean semantic interpretation.

Empirical Quantities to Eventually Log

  • strategy classification accuracy,
  • permutation-invariant alignment score,
  • cross-task latent agreement,
  • posterior entropy vs alignment,
  • router entropy vs alignment,
  • and stability under different prompt templates.

Next

Next: Toy models

Built with LogoFlowershow