Token-Weighted Excess Reconstruction Loss \ for Discrete CVAE Disentanglement

This note identifies a gradient-scaling failure mode in the standard conditional ELBO when applied to teacher-forced autoregressive models, and derives a token-weighted reconstruction loss that corrects it.

The setting is the shared-parameter discrete CVAE described in the companion note CVAE Post-Training Methodology for Latent Strategy Disentanglement.

Motivation: Teacher-Forcing Dilution

Setup and notation

Consider a discrete CVAE with $K$ latent values trained under teacher forcing. For one example $(x, y)$ with $y = (y_1, \dots, y_{T})$ , define:

per-token z-conditioned NLL: $\ell_{t} (z) = -\log p_\phi (y_{t} | x, z, y_{< t}),$
per-token z-free baseline NLL: $b_{t} = -\log p_\phi (y_{t} | x, y_{< t}).$

Both use the same backbone $\phi$ ; the baseline simply omits the $angle.l z angle.r z_{k} angle.l / z angle.r$ section from the input.

The current reconstruction loss averages uniformly over all supervised tokens $\mathcal{S}$ with $|\mathcal{S}| = T$ :

$R(z) = 1 / T \sum_{t \in \mathcal{S}} \ell_{t} (z).$

Two-regime structure under teacher forcing

Under teacher forcing, the solution sequence has two regimes:

Strategy-sensitive prefix ( $t \leq t_0$ ). The teacher-forced history $y_{< t}$ does not yet determine the strategy. The base model has non-trivial uncertainty: $b_{t} > 0$ . The z-conditioned model can reduce this uncertainty when $z$ matches the strategy used in $y$ , so $\ell_{t} (z)$ varies across latent values.
Strategy-determined suffix ( $t > t_0$ ). The partial solution $y_{< t}$ reveals the strategy. Continuation is nearly deterministic: $b_{t} \approx 0$ and $\ell_{t} (z) \approx 0$ for all $z$ , because the teacher-forced prefix renders $z$ redundant.

The boundary $t_0$ is not a sharp cutoff but a useful abstraction. For synthetic-sequence tasks, it corresponds roughly to the first few solution tokens where the procedural strategy is manifested.

The dilution

Under the two-regime model, the token-averaged reconstruction becomes:

R(z) = 1 / T \sum_{t=1}^{t_0} \ell_{t} (z) + underbrace(1 / T \sum_{t=t_0 + 1}^{T} \ell_{t} (z), \approx 0).

The variation of $R(z)$ across latent values:

R(z) - R(z') = 1 / T \sum_{t=1}^{t_0} [\ell_{t} (z) - \ell_{t} (z')] = t_0 / T \cdot \delta_{z z'},

where $\delta_{z z'} = 1 / t_0 \sum_{t \leq t_0} [\ell_{t} (z) - \ell_{t} (z')]$ is the mean per-informative-token difference. The factor $t_0 / T$ is the dilution: the $T - t_0$ easy tokens contribute zero to the numerator but inflate the denominator.

Gradient Analysis

Posterior parameters $\xi$

The reconstruction gradient on the posterior parameters $\xi$ of $q_\xi (z | x, y)$ is:

(\partial \mathcal{L}_\mathrm{recon}) / (\partial \xi) = \sum_{z} (\partial q_\xi) / (\partial \xi) \cdot R(z).

Because $\sum_{z} \partial q_\xi / \partial \xi = 0$ (the probabilities sum to one), any component of $R(z)$ that is constant across $z$ cancels. The effective gradient magnitude is proportional to the variation:

\left\lVert \frac{\partial \mathcal{L}_\mathrm{recon}}{\partial \xi} \right\rVert \sim \max_{z \neq z'} |R(z) - R(z')| = \frac{t_0}{T} |\delta|.

The KL gradient on $\xi$ is $O(\beta)$ , independent of $T$ . The reconstruction signal dominates when:

$t_0 / T |\delta| > \beta \cdot C$

for a constant $C$ set by the KL curvature. When $t_0 / T << 1$ and $\beta = O(1)$ , the KL wins and the posterior collapses to the prior.

Why a scalar baseline does not help

Subtracting a stop-gradiented scalar $C = \operatorname{sg}(1 / T \sum_{t} b_{t})$ from $R(z)$ gives $R(z) - C$ . Since $C$ is constant across $z$ , it cancels in $\sum_{z} (\partial q / \partial \xi) \cdot [R(z) - C]$ by the sum-to-one constraint. The gradient is identical to the unmodified loss. More generally, any additive term that does not vary with $z$ is a no-op for the posterior gradient.

z-dependent parameters $\psi$

The z-token embeddings $\psi$ receive gradient only from reconstruction (the KL does not involve $\psi$ ):

(\partial \mathcal{L}_\mathrm{recon}) / (\partial \psi) = \sum_{z} q(z) \cdot 1 / T \sum_{t} (\partial \ell_{t} (z)) / (\partial \psi).

For early tokens, $(\partial \ell_{t}) / (\partial \psi)$ is nonzero because the z-token influences predictions before the teacher-forced prefix dominates. For late tokens, it is near zero. The effective gradient magnitude is:

||(\partial \mathcal{L}_\mathrm{recon}) / (\partial \psi)|| ~ t_0 / T \cdot ||\nabla_{\psi}||_{\mathrm{early}}.

The dilution also degrades the signal-to-noise ratio: the $T - t_0$ late tokens contribute small residual gradients in random directions, mixing with the $t_0$ meaningful gradient contributions from early tokens.

Token-Weighted Reconstruction Loss

Definition

Define per-token weights as the stop-gradiented z-free baseline NLL:

$w_{t} = \operatorname{sg}(b_{t}) = \operatorname{sg}(-\log p_\phi (y_{t} | x, y_{< t})).$

Replace the uniform token average with a weighted average:

R^w (z) = (\sum_{t \in \mathcal{S}} w_{t} \cdot \ell_{t} (z)) / (\sum_{t \in \mathcal{S}} w_{t} + \epsilon),

where $\epsilon > 0$ prevents division by zero.

The full weighted reconstruction loss under the posterior:

\mathcal{L}_\mathrm{recon}^w = \sum_{z=1}^{K} q_\xi (z | x, y) \cdot R^w (z).

The total objective:

\mathcal{J} = \mathcal{L}_\mathrm{recon}^w + \beta \cdot \mathrm{KL}(q_\xi (z | x, y), p_\phi (z | x)) + \mathrm{auxiliary\ terms}.

Intuition

Tokens where the base model is uncertain ( $b_{t}$ large) are precisely the positions where z-conditioning has room to reduce NLL: the strategy has not yet been revealed by the teacher-forced prefix. Tokens where the base model is already confident ( $b_{t} \approx 0$ ) are positions where $z$ is redundant. Weighting by $b_{t}$ focuses the reconstruction objective on the z-informative prefix without requiring task-specific knowledge of $t_0$ .

Gradient improvement

Since $w_{t} \approx 0$ for $t > t_0$ , the weighted reconstruction concentrates on the strategy-sensitive prefix:

R^w (z) \approx (\sum_{t=1}^{t_0} b_{t} \cdot \ell_{t} (z)) / (\sum_{t=1}^{t_0} b_{t}).

The variation across latent values:

R^w (z) - R^w (z') = (\sum_{t=1}^{t_0} b_{t} \cdot [\ell_{t} (z) - \ell_{t} (z')]) / (\sum_{t=1}^{t_0} b_{t}) = O(|\delta|),

with no $t_0 / T$ dilution factor. The posterior gradient is amplified by $T / t_0$ relative to the uniform-average loss. The z-parameter gradient is similarly amplified: the denominator is $\sum_{t \leq t_0} b_{t}$ instead of $T$ , and late-token noise gradients are suppressed.

The reconstruction gradient on $\xi$ now dominates the KL when:

$|\delta| > \beta \cdot C,$

which is $T / t_0$ times easier to satisfy than the uniform-average condition.

Comparison With Scalar Baselines

It is important to distinguish token-level weighting from scalar baseline subtraction. Several candidate formulations fail to change the optimization:

Scalar stop-gradient baseline (no-op)

\mathcal{L}^{\mathrm{exc,\ scalar}} = \sum_{z} q(z) \cdot [R(z) - \operatorname{sg}(overline(b))] = \mathcal{L}_\mathrm{recon} - \mathrm{const}.

All gradients are identical to the standard ELBO because the subtracted term is constant with respect to all parameters.

Per-token stop-gradient baseline, uniform average (no-op for $\xi$ and $\psi$ )

\mathcal{L}^{\mathrm{exc,\ per-token}} = \sum_{z} q(z) \cdot 1 / T \sum_{t} [\ell_{t} (z) - \operatorname{sg}(b_{t})].

For $\xi$ : the $\operatorname{sg}(b_{t})$ terms factor out by $\sum_{z} \partial q / \partial \xi = 0$ . Gradient unchanged.

For $\psi$ : the $\operatorname{sg}(b_{t})$ terms do not depend on $\psi$ . Gradient unchanged.

For backbone $\theta$ (without stop-gradient on $b_{t}$ ): the gradients partially cancel, which removes reconstruction pressure on $\theta$ without creating z-usage pressure. This is harmful: the backbone quality degrades with no compensating benefit.

Token weighting (this proposal)

The critical difference is that weighting changes the denominator of the average, not the numerator. This alters the variation of $R^w (z)$ across latent values, which is the quantity that drives the posterior gradient. Subtracting a baseline — whether scalar or per-token — does not change this variation.

Conditions For Effectiveness

When it works

Nonzero initial z-sensitivity. At initialization, z-token embeddings are random, producing small random differences in $\ell_{t} (z)$ across $z$ at early positions. Under the standard loss, this signal is diluted by $t_0 / T$ and may be too weak to overcome the KL pressure. Under the weighted loss, the signal is preserved at $O(|\delta|)$ , potentially sufficient for the optimizer to bootstrap z-usage.
Non-trivial base-model uncertainty at early positions. The synthetic- sequence tasks have genuine strategy ambiguity: different strategies produce different early solution tokens for the same input $x$ . The base model (initialized from the entangled checkpoint) should have $b_{t} > 0$ at these positions.
Self-adapting behavior. Early in training, the base model may be uncertain at many positions, making $w_{t}$ roughly uniform. As the backbone improves and late tokens become easy, the weights concentrate on the strategy-sensitive prefix. The reweighting strengthens exactly when dilution becomes the binding constraint.

When it may not suffice

Complete z-insensitivity. If the decoder produces $\ell_{t} (z) = \ell_{t} (z')$ for all $z, z'$ at every position (including early tokens), the weighted variation is still zero. Reweighting amplifies existing z-variation but cannot create it from nothing. This limitation can be addressed by pairing the weighted loss with an architectural change that ensures nonzero $\partial \ell_{t} / \partial \psi$ (for example, z-conditioned layer normalization or a multi-token z-prefix).
Degenerate weights. If $b_{t} \approx 0$ for all $t$ (including early tokens), the denominator $sum w_{t} \approx 0$ and the loss degenerates. This would indicate that the task has no token-level strategy ambiguity, in which case disentanglement via reconstruction alone may be impossible.
Single-token z-conditioning. The z-token is a single token in the input sequence. Even with correct gradient scaling, its influence on early-position predictions is mediated by attention and may remain small. Reweighting is necessary for correct gradient balance but may not be sufficient to overcome an architectural bottleneck.

Implementation

Computing the baseline $b_{t}$

The z-free baseline requires one additional forward pass per batch through the z-free input template:

<bos> <x> {x} </x> <y> {y} </y> <eos>

This is the original entangled-model format. The per-token NLL $b_{t}$ is computed under teacher forcing at supervised positions and detached before use as weights.

Two implementation policies are possible:

Frozen policy. Compute $b_{t}$ from a frozen reference model initialized from the entangled checkpoint. This keeps the weighting signal stable while the disentangled model is trained.
Dynamic policy. Compute $b_{t}$ from the current training backbone at each step (still stop-gradiented before use), so weights evolve with the model.

Current implementation status in this repository: the frozen policy is implemented; the dynamic policy is intentionally deferred.

Modifying `masked_token_nll_per_example`

The current implementation computes:

denom = active.sum(dim=1).clamp_min(1.0)
return (token_loss * active).sum(dim=1) / denom

The weighted variant replaces the uniform active mask with active * weights, where weights is the detached per-token baseline NLL:

weighted_active = active * weights.detach()
denom = weighted_active.sum(dim=1).clamp_min(eps)
return (token_loss * weighted_active).sum(dim=1) / denom

Cost

One additional forward pass per batch (z-free baseline). For the exact discrete CVAE with $K$ latent values, the training loop already performs $K + 1$ forward passes per batch ( $K$ latent-conditioned and $1$ posterior). The baseline adds one more, increasing cost by $1 / (K + 1)$ .

Fallback behavior

When the denominator $sum w_{t}$ is below a threshold (base model is confident everywhere), the loss should fall back to the standard uniform-average reconstruction to avoid numerical instability. A simple implementation:

R^{\mathrm{eff}} (z) = cases(R^w (z) & \text{if} \sum_{t \in \mathcal{S}} w_{t} > \tau, R(z) & \mathrm{otherwise},)

where $\tau$ is a small positive threshold (for example, $10^{-4}$ ).