Router Marginal KL

This note records how to interpret the continuous-latent router marginal KL diagnostic.

Definition

The router marginal KL is not a per-example posterior-versus-prior term. It is a batch-level diagnostic on the router prior itself.

For the continuous model, the batch of prior Gaussians is moment-matched to one aggregate diagonal Gaussian, and that aggregate is compared to the standard normal:

K_{\mathrm{router\ marg}} = \mathrm{KL}\!\left(\hat p_{\mathrm{batch}}(z)\,\|\,\mathcal{N}(0, I)\right).

See losses.py and continuous_cvae.py.

Objective Placement

The continuous objective may include this term as:

\mathcal{J} = \cdots + \lambda_{\mathrm{rmkl}} K_{\mathrm{router\ marg}} + \cdots

If router_marginal_kl_to_prior_weight = 0, then this is diagnostic-only for that run. A large raw value may still be informative, but it is not something the optimizer was asked to minimize.

What It Means

Small router marginal KL means the batch-aggregate prior usage looks close to N(0, I).

It does not mean:

the posterior matches the prior on each example
the latent is informative
the latent is strategy-coded

This term is about global latent occupancy, not semantic alignment.

Why It Matters

This diagnostic can reveal whether the router prior is drifting into a globally distorted or off-center latent geometry even when per-example prior KL looks reasonable.

At the same time, matching the aggregate prior usage to N(0, I) is only a geometric constraint. The model can still satisfy it while:

mixing multiple strategies in the same region
encoding nuisance factors instead of strategy
relying on the backbone more than on z

So router marginal KL is best treated as a geometric regularizer, not as a direct proxy for strategy disentanglement.

Practical Reading Rule

When router marginal KL looks large, the first question is whether it was active in the objective. If not, the right reading is usually “the model was not asked to keep global prior usage close to the reference normal,” not “training is broken.”