Router Marginal KL
Router Marginal KL
Back to CVAE objective and loss terms
This note records how to interpret the continuous-latent router marginal KL diagnostic.
Definition
The router marginal KL is not a per-example posterior-versus-prior term. It is a batch-level diagnostic on the router prior itself.
For the continuous model, the batch of prior Gaussians is moment-matched to one aggregate diagonal Gaussian, and that aggregate is compared to the standard normal:
See losses.py and continuous_cvae.py.
Objective Placement
The continuous objective may include this term as:
If router_marginal_kl_to_prior_weight = 0, then this is diagnostic-only for that run. A large raw value may still be informative, but it is not something the optimizer was asked to minimize.
What It Means
Small router marginal KL means the batch-aggregate prior usage looks close to N(0, I).
It does not mean:
- the posterior matches the prior on each example
- the latent is informative
- the latent is strategy-coded
This term is about global latent occupancy, not semantic alignment.
Why It Matters
This diagnostic can reveal whether the router prior is drifting into a globally distorted or off-center latent geometry even when per-example prior KL looks reasonable.
At the same time, matching the aggregate prior usage to N(0, I) is only a geometric constraint. The model can still satisfy it while:
- mixing multiple strategies in the same region
- encoding nuisance factors instead of strategy
- relying on the backbone more than on
z
So router marginal KL is best treated as a geometric regularizer, not as a direct proxy for strategy disentanglement.
Practical Reading Rule
When router marginal KL looks large, the first question is whether it was active in the objective. If not, the right reading is usually “the model was not asked to keep global prior usage close to the reference normal,” not “training is broken.”