Metrics and Definitions

This file should eventually contain the formal backbone of the thread.

Candidate Definitions

Task-wise identifiability

For a fixed input x, the latent factorization is task-wise identifiable if the induced latent partition separates the strategies relevant to that x, up to a permutation of the latent labels.

Informally: within one problem instance, different strategies should map to distinct latent values in a way that is recoverable from the model.

Problem-wise identifiability

For a task family with a shared strategy set, the latent factorization is problem-wise identifiable if the same latent partition recovers the same strategy set across multiple x from that family, again up to permutation.

Global-consistent identifiability

The latent factorization is globally consistent if one latent index corresponds to one strategy semantics across the full support of the data distribution, possibly modulo a single global permutation of labels.

This is the strongest notion and the one most aligned with the phrase “z means the same strategy across examples.”

Candidate Metrics

Posterior-label alignment: compare q(z|x,y) to ground-truth strategy labels when they exist.
Router-label alignment: compare p(z|x) to strategy labels or strategy frequencies on tasks where a routing target is meaningful.
Permutation-invariant matching: measure the best latent-to-strategy matching score after allowing label permutations.
Cross-example consistency: measure whether a latent index maps to the same strategy across many x.
Semantic stability: check whether latent semantics remain fixed under changes in prompt wording, task instance, or data split.

Controlled-generation coverage metric

For controlled evaluation on one input x_i, define the set-valued strategy compatibility relation

B_i[k,s] = \mathbf{1}[s \in M_i(k)],

where M_i(k) is the set of ground-truth strategies compatible with the generated output under forced latent z_k.

This yields the local coverage score

c_{\mathrm{loc}}(x_i) = \frac{1}{|S|} \max_{\phi:S \to [K]\ \text{injective}} \sum_{s \in S} B_i[\phi(s), s].

This metric is intentionally weaker than semantic identifiability:

it measures whether the latent states cover the attainable strategy behaviors for that input,
it does not require unique strategy attribution,
and it does not require one latent index to mean the same strategy across inputs.

This is therefore a local coverage metric, not a global semantic-alignment metric.

Failure Cases to Distinguish

Predictive but unstable latents: good for reconstruction, bad for semantics.
Globally consistent but task-agnostic latents: stable labels that do not actually separate strategy variation.
Locally identifiable but globally inconsistent latents: each task is recoverable on its own, but the latent labels do not align across tasks.
Overcompressed latent spaces: a single latent may encode multiple strategies without a clean semantic interpretation.

Empirical Quantities to Eventually Log

strategy classification accuracy,
permutation-invariant alignment score,
cross-task latent agreement,
posterior entropy vs alignment,
router entropy vs alignment,
and stability under different prompt templates.

Next: Toy models