Metrics and Definitions
Metrics and Definitions
This file should eventually contain the formal backbone of the thread.
Candidate Definitions
Task-wise identifiability
For a fixed input x, the latent factorization is task-wise identifiable if the
induced latent partition separates the strategies relevant to that x, up to a
permutation of the latent labels.
Informally: within one problem instance, different strategies should map to distinct latent values in a way that is recoverable from the model.
Problem-wise identifiability
For a task family with a shared strategy set, the latent factorization is
problem-wise identifiable if the same latent partition recovers the same
strategy set across multiple x from that family, again up to permutation.
Global-consistent identifiability
The latent factorization is globally consistent if one latent index corresponds to one strategy semantics across the full support of the data distribution, possibly modulo a single global permutation of labels.
This is the strongest notion and the one most aligned with the phrase “z
means the same strategy across examples.”
Candidate Metrics
- Posterior-label alignment: compare
q(z|x,y)to ground-truth strategy labels when they exist. - Router-label alignment: compare
p(z|x)to strategy labels or strategy frequencies on tasks where a routing target is meaningful. - Permutation-invariant matching: measure the best latent-to-strategy matching score after allowing label permutations.
- Cross-example consistency: measure whether a latent index maps to the same
strategy across many
x. - Semantic stability: check whether latent semantics remain fixed under changes in prompt wording, task instance, or data split.
Controlled-generation coverage metric
For controlled evaluation on one input x_i, define the set-valued strategy
compatibility relation
where M_i(k) is the set of ground-truth strategies compatible with the
generated output under forced latent z_k.
This yields the local coverage score
This metric is intentionally weaker than semantic identifiability:
- it measures whether the latent states cover the attainable strategy behaviors for that input,
- it does not require unique strategy attribution,
- and it does not require one latent index to mean the same strategy across inputs.
This is therefore a local coverage metric, not a global semantic-alignment metric.
Failure Cases to Distinguish
- Predictive but unstable latents: good for reconstruction, bad for semantics.
- Globally consistent but task-agnostic latents: stable labels that do not actually separate strategy variation.
- Locally identifiable but globally inconsistent latents: each task is recoverable on its own, but the latent labels do not align across tasks.
- Overcompressed latent spaces: a single latent may encode multiple strategies without a clean semantic interpretation.
Empirical Quantities to Eventually Log
- strategy classification accuracy,
- permutation-invariant alignment score,
- cross-task latent agreement,
- posterior entropy vs alignment,
- router entropy vs alignment,
- and stability under different prompt templates.