ss-disentangled-direct-pretrained-pv-rmkl-sweep-v1

`ss-disentangled-direct-pretrained-pv-rmkl-sweep-v1`

Summary

Initial partial report for the follow-up direct-pretrained sweep over posterior-variance and router-marginal-KL weights. This note should be read as an update to the earlier ss-disentangled-direct-pretrained-v1 interpretation, using the new partial task x pv x rmkl collection to test whether explicitly controlling those previously-explosive diagnostics improves latent-strategy alignment.

Experiment Metadata

Collection: ss-disentangled-direct-pretrained-pv-rmkl-sweep-v1
Run date: collection created 2026-04-14T01:21:16, last updated 2026-04-14T15:43:46
Collection definition: .slurmkit/collections/ss-disentangled-direct-pretrained-pv-rmkl-sweep-v1.yaml
Source run directory: .runs/synthetic_sequences/disentangled/synthetic-pretrained-disentangled-direct-pv-rmkl-sweep-v1
Collection commit ID: generated jobs are pinned to commit 017e5e4
Planned design: 6 tasks x 2 pv settings x 2 rmkl settings x 1 seed = 24 jobs
Observed completion at analysis time: 19/24 complete, 3/24 failed, 2/24 still running
Known failure mode: the 3 failed cells are sorting_algorithms OOM failures, not modeling divergences

Design

Backbone: pretrained Qwen2.5-0.5B
Training mode: direct disentangling without prior mixture fine-tuning
Adaptation: lora
Latent setup: continuous, latent dim 8, last-token pooling
Shared settings: beta=0.1, baseline-normalized reconstruction enabled, router_support_weight=0.0, token_weighted_reconstruction_weight=0.0, inter_latent_divergence_weight=0.0
Sweep dimensions:
- posterior_variance_weight in {0.0, 0.1}
- router_marginal_kl_to_prior_weight in {0.0, 0.1}
Training budget: 1 epoch, max_sequences_per_epoch=200000, seed 314159

Key Figures

Validation accuracy and alignment dynamics by task and setting. Caption: Validation accuracy stays high across most completed cells, but router and posterior probe behavior varies by task and setting.

Objective terms by task and setting. Caption: Normalized reconstruction and KL-to-prior remain small across settings. Because the objective definition changes with the added PV and RMKL terms, total loss should be interpreted within-setting rather than naively across settings.

Weighted auxiliary PV and RMKL terms by task and setting. Caption: The new regularized settings do control the targeted auxiliary quantities, often by orders of magnitude relative to the unregularized baseline.

Standardized objective versus auxiliary dynamics by task and setting. Caption: The dynamics confirm that the optimizer responds strongly to the added regularizers, but the resulting representation changes are not cleanly mirrored by improved router probe accuracy.

Completed-cell final metrics by task and setting. Caption: Final task accuracy remains near-saturated for most completed cells, while router probe differences across settings are modest and task-dependent.

Updated Interpretation

Relative to the earlier ss-disentangled-direct-pretrained-v1 report, the new partial sweep weakens the strongest version of the hypothesis that large posterior-variance and router-marginal-KL diagnostics were themselves the main reason the latent failed to align with reasoning strategies.

The main reason is simple: the new loss terms do what they were designed to do, but latent-strategy alignment still does not improve reliably. Adding router_marginal_kl_to_prior_weight=0.1 drives the corresponding diagnostic down dramatically on completed cells, from a mean of about 3805 in the pv0.0_rmkl0.0 baseline cells to about 1.0 in pv0.0_rmkl0.1, while keeping final accuracy essentially unchanged. However, mean router probe accuracy does not improve with that change; if anything it is slightly lower (0.449 -> 0.441 on the completed cells).

The same pattern largely holds for posterior-variance regularization. Turning on posterior_variance_weight=0.1 cuts the posterior-variance term sharply, from a mean of about 730 in the unregularized completed cells to about 75 in pv0.1_rmkl0.0. But this does not produce a broad router-side alignment gain. Mean router probe falls to about 0.429, while mean posterior-mu probe rises to about 0.489. That is a useful nuance: posterior-side structure may become somewhat cleaner under PV regularization, but the router-sampled latent still does not become clearly more strategy-aligned. So the bottleneck may be the router or the router-posterior gap rather than posterior geometry alone.

The combined pv0.1_rmkl0.1 setting is also instructive. It keeps both auxiliary terms moderate and preserves very high final accuracy, but still does not produce a decisive global gain in router probe. Its mean router probe (0.436) remains below the unregularized baseline mean. This suggests that merely adding these two regularizers is not sufficient to make z behave like a clean strategy variable in the direct-pretrained regime.

The task-level picture is mixed rather than uniformly negative. multidigit_addition does improve under pv0.1_rmkl0.1, reaching the best router probe among its completed settings (0.511 vs 0.485 in the earlier unregularized run). But list_summation and grid_pathfinding look best in the unregularized setting, linear_equation_solving is nearly flat across settings, and the most informative missing cases remain exactly where the collection is incomplete: baseline base_conversion is still running, and sorting_algorithms only has the pv0.0_rmkl0.1 cell completed because the other three cells OOMed. So there is not yet enough evidence to claim a reliable positive effect from these regularizers.

The new sweep does, however, strengthen two of the earlier hypotheses. First, objective mismatch remains a strong explanation: explicitly reducing these auxiliary diagnostics does not reliably translate into better strategy decodability, so those diagnostics are not sufficient proxies for the target property. Second, backbone bypass / weak reliance on z still looks plausible: the pretrained backbone continues to deliver near-perfect task performance across many settings, which leaves the optimization free to satisfy auxiliary constraints without making router-sampled z the main carrier of reasoning-strategy information.

At the same time, the sweep softens one earlier story. The prior report suggested that uncontrolled latent noise might be a central reason the sampled latent fails to map to strategy. The new results only support that weakly. Penalizing posterior variance appears to help posterior-side linear decodability somewhat, but it does not close the router-side alignment gap. So uncontrolled variance may be part of the picture, but it no longer looks like the dominant explanation on its own.

Working Update

Stronger than before:
- Objective mismatch between the active loss and the desired “strategy-coded latent” behavior.
- Backbone bypass or residual use of pretrained internal circuitry instead of forcing semantic use of z.
Weaker than before:
- “Exploding auxiliary diagnostics are the main cause of weak latent-strategy alignment.”
Still open:
- Posterior-side regularization may help some tasks or some latent views, but the current partial collection does not show a robust router-side gain.
- The most ambiguous cases are still the incomplete ones: baseline base_conversion and most of the sorting_algorithms grid.

Practical Takeaway

The partial sweep provides evidence that these regularizers are mechanistically active but not yet semantically sufficient. In other words: the model can be made to control posterior variance and router marginal KL, but that alone does not reliably make the latent line up with reasoning strategies. The current best update is therefore not “the earlier diagnosis was wrong,” but rather “the earlier diagnosis was incomplete: those exploding terms were real, but controlling them is not enough to solve the representation problem.”