Posterior Variance

This note records how to interpret the continuous-latent posterior-variance diagnostic and auxiliary loss.

Definition

The posterior-variance term is simply the batch mean of the per-dimension posterior variance:

V_{\mathrm{post}} = \mathbb{E}_{(x,y), d}\left[\operatorname{Var}_{q(z \mid x, y)}(z_d)\right].

In code it is just exp(logvar).mean(). See losses.py.

Objective Placement

The continuous objective may include this term as:

\mathcal{J} = \cdots + \lambda_{\mathrm{pv}} V_{\mathrm{post}} + \cdots

If posterior_variance_weight = 0, then this is diagnostic-only for that run. Large raw values may still be informative, but they are not directly penalized.

What It Means

Small posterior variance means the posterior samples stay near their means. That can make sampled z more stable and easier to decode.

It does not mean:

the posterior mean is semantically useful
the prior matches the posterior
the router can recover the same structure at test time

This term measures absolute posterior sharpness, not semantic structure.

Why It Can Be Large While Prior KL Is Small

The prior KL and posterior-variance terms constrain different relationships.

If both q(z | x, y) and p(z | x) become broad in a coordinated way, then:

K_{\mathrm{prior}} = \mathrm{KL}(q(z | x, y)\,\|\,p(z | x)) can stay small because the two distributions match each other
V_{\mathrm{post}} can still be large because both distributions are broad in absolute terms

That is why posterior-variance regularization is not redundant with KL regularization.

How To Read The Scale

Because this metric is a mean variance, its raw scale is directly interpretable in latent units:

V_post ≈ 1 means average latent variance about 1
V_post ≈ 4 means average latent standard deviation about 2

So a rise in posterior variance is not automatically an instability signal. In many runs it simply means the model is choosing a broad latent because absolute sharpness is cheap under the current objective.

Practical Reading Rule

When posterior variance looks large, the first questions are:

Is the term active in the objective?
Is the concern semantic alignment, sampled-latent stability, or prior matching?

Large posterior variance with zero weight usually means “the model was free to keep the latent broad,” not “optimization diverged.”