Formalizing Strategy Identification with Expressive Generator Classes

This note formulates the problem of identifying or learning a latent "strategy" factorization of an observed conditional law $p(y \mid x)$ when the strategy-conditioned generator class is highly expressive. The main point is that once the generator can represent the marginalized law directly, the classical latent-variable question of unique identification is no longer the right target. The problem becomes one of selecting a useful factorization from an often very large equivalence class.

Setup

Let $\mathcal{X}$ , $\mathcal{Z}$ , and $\mathcal{Y}$ be spaces of inputs, latent strategies, and outputs, respectively. Fix a reference measure $\mu$ on $\mathcal{Z}$ so that we can write latent marginalizations as integrals with respect to $z$ .

We take as primitive:

a router class

\mathcal{R} \subset \mathcal{P}(\mathcal{Z} | \mathcal{X}),

where each $r \in \mathcal{R}$ is a conditional distribution $r(z | x)$ from inputs to latent modes or strategies; - a *strategy-conditioned generator class*

\mathcal{G} \subset \mathcal{P}(\mathcal{Y} | \mathcal{X} \times \mathcal{Z}),

where each $g \in \mathcal{G}$ is a conditional distribution $g(y | x, z)$ over outputs given an input-strategy pair. > [!definition] > Define the mixing operator > $$ > M(r, g)(y \mid x) := \int g(y \mid x, z)\, r(z \mid x)\, d\mu(z). > $$ > The induced input-output conditional model class is > $$ > \mathcal{M}(\mathcal{R}, \mathcal{G}) := \{p \in \mathcal{P}(\mathcal{Y} \mid \mathcal{X}) : p = M(r, g) \text{ for some } (r, g) \in \mathcal{R} \times \mathcal{G}\}. > $$ > Thus $p \in \mathcal{M}(\mathcal{R}, \mathcal{G})$ if and only if it admits a latent > factorization through some router-generator pair. > [!definition] > For a fixed conditional law $p \in \mathcal{P}(\mathcal{Y} | \mathcal{X})$, define its > factorization set by > $$ > \mathcal{F}_{\mathcal{R}, \mathcal{G}}(p) := \{(r, g) \in \mathcal{R} \times \mathcal{G} : M(r, g) = p\}. > $$ > When $\mathcal{R}$ and $\mathcal{G}$ are understood, write simply $\mathcal{F}(p)$. The ambient question is: for a given marginalized conditional law $p(y \mid x)$, what structure, if any, should be recovered from the set of its latent factorizations $\mathcal{F}(p)$? > [!quote] > This setup contains several classical models as special cases. In *mixtures > of regressions*, the router is usually independent of $x$, i.e. $r(z | x) = r(z)$, while the experts belong to a structured parametric class. Hennig > studies identifiability of clusterwise linear regression under this kind of > restriction @hennig2000. In *mixture-of-experts* models, both the router and > the experts depend on $x$; Jiang and Tanner analyze approximation and maximum > likelihood estimation for hierarchical mixtures-of-experts in exponential > family regression models @jiang1999. In *semiparametric/nonparametric > conditional mixtures*, some or all of the gating and expert maps are allowed > to vary nonparametrically in $x$; Wang, Yao, and Huang give identifiability > results for nonparametric and semiparametric mixtures of generalized linear > models under smoothness and nondegeneracy conditions @wang2014. > > These papers illustrate the classical route to identifiability: sharply > restrict the router and expert families. The present note instead asks what > remains meaningful when $\mathcal{G}$ is too expressive for such structural > uniqueness to hold in general. ## Three questions: existence, identification, selection The factorization formalism separates three logically distinct questions. > [!definition] > Let $p \in \mathcal{P}(\mathcal{Y} | \mathcal{X})$. > > 1. *Existence:* is $\mathcal{F}(p)$ nonempty? > 2. *Identification:* is $\mathcal{F}(p)$ a singleton, or a singleton modulo a > natural equivalence relation on latent representations? > 3. *Selection:* when $\mathcal{F}(p)$ is large, which factorization should be > preferred as the informative or canonical one? In the expressive-generator regime, these questions behave very differently. Existence is typically easy, exact identification usually fails, and the main problem becomes selection. > [!definition] > Suppose there exists $g_{p} \in \mathcal{G}$ such that > $$ > g_{p}(y \mid x, z) = p(y \mid x) \quad \text{for all } x \in \mathcal{X}, z \in \mathcal{Z}. > $$ > Then for every router $r \in \mathcal{R}$, > $$ > M(r, g_{p})(y \mid x) = \int p(y \mid x)\, r(z \mid x)\, d\mu(z) = p(y \mid x), > $$ > so $(r, g_{p}) \in \mathcal{F}(p)$. > [!success] > If $\mathcal{G}$ contains a $z$-constant realization of the target conditional law > $p(y \mid x)$ and $\mathcal{R}$ contains at least two distinct routers, then > $\mathcal{F}(p)$ contains at least two distinct factorizations. Hence $p$ cannot > be identified from single-view observations $(X, Y)$ by uniqueness of > factorization alone. This trivial factorization is not merely a pathological corner case. It is the canonical obstruction created by expressive generators: the generator can absorb the entire observable law, leaving the router unconstrained. > [!quote] > Locatello et al. study unsupervised disentanglement in flexible latent > variable models of the form $z \sim P(z)$, $x \sim P(x | z)$ and show that, in > the absence of inductive biases on the model class and data distribution, > disentangled latent structure is not identifiable from observations alone > @locatello2019. Their impossibility statement is closely analogous to the > trivial factorization above: many latent representations induce the same > observable law. > > A related but more constructive literature asks how identifiability can be > recovered by adding side information. Hyvärinen, Sasaki, and Turner consider > nonlinear ICA with an auxiliary observed variable $u$, where the latent > source law is conditionally independent across coordinates given $u$ and each > coordinate follows a conditionally exponential-family model. Under > invertibility and variability conditions, they prove identifiability of the > nonlinear ICA model and consistency of a contrastive learning method > @hyvarinen2019. Khemakhem et al. extend this perspective to flexible > latent-variable models and VAEs: with a decoder $x = f(z, \epsilon)$ and a > conditionally factorized exponential-family prior $p(z | u)$, they show that > the true joint distribution over observed and latent variables becomes > identifiable up to a controlled equivalence relation, whereas the ordinary > unsupervised VAE is generally unidentifiable @khemakhem2020. > > The common lesson is structural. Once the generator is expressive, latent > structure is not identified from the marginal law alone; one needs either a > selection criterion or extra observation structure, such as auxiliary > variables, time, or repeated views. ## Identification requires quotienting by equivalence Even in favorable cases, exact equality of latent factorizations is too strong. At minimum, one should quotient by relabelings of a discrete latent space. For general measurable latent spaces, the natural notion is pushforward under an invertible measurable reparameterization. > [!definition] > Let $T : \mathcal{Z} \to \mathcal{Z}$ be invertible and measurable. For a factorization > $(r, g)$, define the transformed pair $(r^T, g^T)$ by > $$ > r^T(\cdot \mid x) := T_{\#} r(\cdot \mid x), > $$ > i.e. the pushforward of the conditional law $r(\cdot | x)$ under $T$, and > $$ > g^T(y \mid x, z) := g(y \mid x, T^{-1}(z)). > $$ > Then > $$ > M(r^T, g^T) = M(r, g). > $$ > We write $(r, g) ~ (r', g')$ if there exists such a $T$ with > $(r', g') = (r^T, g^T)$. When $\mathcal{Z}$ is finite, this reduces to permutation equivalence:

r^\sigma(z \mid x) = r(\sigma^{-1}(z) \mid x), \quad g^\sigma(y \mid x, z) = g(y \mid x, \sigma^{-1}(z)).

> [!definition] > For a target law $p$, the *equivalence-class factorization set* is > $$ > \mathcal{F}(p) / \sim. > $$ > We say that: > > - $p$ is *exactly identifiable* if $\mathcal{F}(p)$ is a singleton; > - $p$ is *identifiable up to latent reparameterization* if > $\mathcal{F}(p) / \sim$ is a singleton; > - $p$ is *set identified* if $\mathcal{F}(p) / \sim$ contains multiple elements but > is still a meaningful constrained set. This is the right level of generality for the present problem. Even if one can only hope to select a factorization, the selection target should usually be an equivalence class rather than a specific latent coordinate system. ## Formalizing expressiveness of the generator class Here, we formalize what it might mean for the generator class to be "highly expressive" in a way that captures the trivial factorization phenomenon. ### Exact absorption or mixture-closure style richness The first meaning is that the generator class can represent or "absorb" the marginalized conditional law $p(y \mid x)$ directly without using the latent variable $z$. > [!definition] > Let $p \in \mathcal{P}(\mathcal{Y} | \mathcal{X})$. We say that generator class $\mathcal{G}$ is *$p$-absorbing* > if there exists $g_{p} \in \mathcal{G}$ such that > $$ > g_{p}(y \mid x, z) = p(y \mid x) \quad \text{for all } x, z. > $$ > More generally, for a target class $\mathcal{P}_*$, we say that $\mathcal{G}$ is > *$\mathcal{P}_*$-absorbing* if it is $p$-absorbing for every $p \in \mathcal{P}_*$. This is the strongest and cleanest formalization of the degeneracy. It makes non-identifiability immediate, because the latent variable can be ignored by the generator without affecting the observable law. ### Approximation richness A weaker but practically relevant notion is universal approximation. > [!definition] > Let $d$ be a metric or divergence on conditional laws. We say that $\mathcal{G}$ is > *approximation-rich* for a target class $\mathcal{P}_*$ if for every > $p \in \mathcal{P}_*$ and every $\epsilon > 0$, there exists $g \in \mathcal{G}$ and some > latent value $z_0 \in \mathcal{Z}$ such that > $$ > \sup_{x} d(g(\cdot \mid x, z_0), p(\cdot \mid x)) \leq \epsilon. > $$ Approximation richness is weaker than exact absorbency but leads to the same conceptual issue: near-trivial factorizations can fit the data almost perfectly, so uniqueness of latent structure is unstable. The "approximation richness" notion is closer to modern neural conditional generators, including language models, where exact equality is rarely the most natural notion but highly accurate approximation is enough to make the latent factorization problem ill-posed. ## Selection by optimization If uniqueness of factorization fails (which occurs when the generator class is very rich), we must define a selection criterion to pick out a preferred factorization from the set of all factorizations. This provides a framework for studying strategy identification in the expressive-generator regime. > [!definition] > Let $J((r, g); p)$ be a real-valued criterion defined on the set of factorizations $\mathcal{F}(p)$ of $p(y \mid x)$. The criterion-selected factorization set is > $$ > \mathrm{Sel}_J(p) := \operatorname*{arg\,max}_{(r, g) \in \mathcal{F}(p)} J((r, g); p). > $$ > We then call $\mathrm{Sel}_J(p)$ the identified set induced by the criterion $J$. This can be posed as an optimization problem that balances multiple desiderata. For example, one might want to maximize the informativeness of the latent variable about the output, while penalizing complexity of the factorization and rewarding separation among the strategy-conditioned generators. This leads to a composite criterion of the form

J((r, g); p) = \mathrm{informativeness}((r, g); p) - \lambda \mathrm{complexity}((r, g)) + \gamma \mathrm{separation}((r, g); p),

where $\lambda, \gamma \geq 0$ are tradeoff parameters. The role of this decomposition is conceptual: the latent should encode something relevant, should not be gratuitously complicated, and should induce meaningfully distinct strategy-conditioned generators. Next, we discuss candidate terms for each of these desiderata. > [!remark] ## Next [[theory/strategy_identification/concepts/03_toy_model_mixture_deterministic_distributions|Next: Toy model, mixture over deterministic conditional distributions]] > One could try to require that $(r,g) \sim (r',g') &ge; J((r, g); p) = J((r', g'); p)$ and formulate the objective directly over equivalence classes. For the present problem, however, this is too strong. In practical settings the latent space $\mathcal{Z}$ typically comes with concrete geometry through parameterization, finite precision, and gradient-based optimization, and the criterion may legitimately depend on that geometry. In particular, separation objectives may depend on a metric $d$ on $\mathcal{Z}$, so they are not invariant under arbitrary invertible reparameterizations of the latent space. Thus the quotienting formalism remains useful for discussing identification, but we do not build full reparameterization invariance into the selection criterion itself. ### Candidate informativeness terms The most direct candidate is conditional mutual information:

I(Z; Y | X).

This measures how much informtion the latent variable contains about the output beyond what is already encoded in the input. This quantity is zero for the trivial factorization whenever $g(y | x, z)$ is independent of $z$ almost surely, so it discourages complete collapse of the latent variable into a useless annotation. ### Candidate complexity terms Complexity penalties prevent the criterion from over-fragmenting the latent. We want to maintain _abstractness_ of the strategy variable, making it encode a high-level concept rather than a detailed low-level encoding of the input or output. Examples of complexity terms include: - support size or latent dimension, e.g. $|\mathrm{supp}(Z)|$ or $\dim(\mathcal{Z})$ (this can be represented by defining the latent space $\mathcal{Z}$ rather than by an explicit term in the criterion); - $I(X; Z)$, penalizing over-detailed encodings of the input into the router. The choice depends on what one wants a strategy variable to mean. ### Candidate separation terms The intuition to be formalized here is that the latent strategy variable should induce meaningfully distinct conditional generators, and that larger latent separation should correspond to larger output-level separation. For the present note, we allow the criterion to depend on explicit geometry of the latent space rather than insisting on invariance under arbitrary invertible reparameterizations. Let $d$ be a metric or nonnegative dissimilarity on $\mathcal{Z}$, and let $D$ be a divergence or discrepancy between conditional output laws. A natural geometry-aware separation functional is

\mathrm{Sep}_{d, D}(r, g) := \mathbb{E}_X \left[ \frac{\int!!\int d(z, z'), D(g(\cdot \mid X, z), g(\cdot \mid X, z')), r(dz \mid X), r(dz' \mid X)}{\int!!\int d(z, z'), r(dz \mid X), r(dz' \mid X)} \right],

whenever the denominator is positive, with an obvious convention needed when the router degenerates to a point mass. This objective computes a distance-weighted average divergence between strategy-conditioned generators. It emphasizes pairs of latent values that are far apart under the chosen latent metric, while normalizing by the corresponding average latent distance under the router. Thus it does not merely reward generic pairwise diversity; it rewards diversity aligned with latent-space geometry. When $\mathcal{Z}$ is finite, a canonical choice is the discrete metric

d(z, z') = 1 \text{ if } z \neq z' \quad \text{and} \quad d(z, z) = 0,

so that $\mathrm{Sep}_{d, D}$ reduces to a normalized off-diagonal average divergence between experts. In that case, the term rewards distinct behavior for different latent variables. Potential choices for $D$ include KL divergence, Jensen--Shannon divergence, total variation, Wasserstein distance, or other integral probability metrics. The choice of $d$ and $D$ determines what notion of strategy distinctness is being encouraged. Note that this objective is geometry-dependent: it depends on the chosen latent metric $d$ and can therefore change under latent reparameterizations. This is contrast to the previously defined equivalence classes under invertible reparameterizations. In practical settings such as those involving neural models, this is desirable since the actual geometry induced by parameterization and optimization can matter for what counts as a meaningful strategy variable. > [!remark] > Informativeness alone tends to favor ever finer latent refinements. Complexity > alone tends to collapse the latent. Separation alone can reward arbitrary > expert splitting with little statistical meaning. The point of the composite > criterion is to explicitly encode a balance between useful relevance, > parsimony, and distinctness of the conditional generators. > [!quote] > Tishby, Pereira, and Bialek formulate representation learning as the problem > of finding a compressed code $\tilde{X}$ of $X$ that preserves as much > information as possible about a relevant variable $Y$ @tishby2000. Their > information bottleneck objective is > $$ > \min_{p(\tilde{x} \mid x)} I(X; \tilde{X}) - \beta I(\tilde{X}; Y), > $$ > equivalently maximizing relevance subject to a compression penalty. > The corresponding variational principle yields self-consistent equations and > a Blahut--Arimoto-type re-estimation algorithm @tishby2000. > > In the present setting, this suggests taking the latent strategy variable $Z$ > not as a uniquely identifiable hidden cause, but as a criterion-defined > compressed representation chosen for relevance to $Y$ beyond $X$. Replacing > exact latent identification by a bottleneck-style variational problem is one > principled way to formalize the selection problem when $\mathcal{G}$ is too rich > for structural uniqueness to be plausible. > [!remark]