Posterior Variance
Posterior Variance
Back to CVAE objective and loss terms
This note records how to interpret the continuous-latent posterior-variance diagnostic and auxiliary loss.
Definition
The posterior-variance term is simply the batch mean of the per-dimension posterior variance:
In code it is just exp(logvar).mean(). See losses.py.
Objective Placement
The continuous objective may include this term as:
If posterior_variance_weight = 0, then this is diagnostic-only for that run. Large raw values may still be informative, but they are not directly penalized.
What It Means
Small posterior variance means the posterior samples stay near their means. That can make sampled z more stable and easier to decode.
It does not mean:
- the posterior mean is semantically useful
- the prior matches the posterior
- the router can recover the same structure at test time
This term measures absolute posterior sharpness, not semantic structure.
Why It Can Be Large While Prior KL Is Small
The prior KL and posterior-variance terms constrain different relationships.
If both q(z | x, y) and p(z | x) become broad in a coordinated way, then:
K_{\mathrm{prior}} = \mathrm{KL}(q(z | x, y)\,\|\,p(z | x))can stay small because the two distributions match each otherV_{\mathrm{post}}can still be large because both distributions are broad in absolute terms
That is why posterior-variance regularization is not redundant with KL regularization.
How To Read The Scale
Because this metric is a mean variance, its raw scale is directly interpretable in latent units:
V_post ≈ 1means average latent variance about1V_post ≈ 4means average latent standard deviation about2
So a rise in posterior variance is not automatically an instability signal. In many runs it simply means the model is choosing a broad latent because absolute sharpness is cheap under the current objective.
Practical Reading Rule
When posterior variance looks large, the first questions are:
- Is the term active in the objective?
- Is the concern semantic alignment, sampled-latent stability, or prior matching?
Large posterior variance with zero weight usually means “the model was free to keep the latent broad,” not “optimization diverged.”