ss-disentangled-direct-pretrained-v1
ss-disentangled-direct-pretrained-v1
ss-disentangled-direct-pretrained-v1
Summary
Brief note for the initial direct-pretrained disentangling collection using a pretrained Qwen2.5-0.5B backbone. This collection was meant as a small run-level diagnostic check before broader sweeps, with emphasis on loss dynamics rather than hyperparameter effects.
Experiment Metadata
- Collection:
ss-disentangled-direct-pretrained-v1 - Run date: collection created
2026-04-13T03:12:15, last updated2026-04-13T14:22:27 - Collection definition:
.slurmkit/collections/ss-disentangled-direct-pretrained-v1.yaml - Source run directory:
.runs/synthetic_sequences/disentangled/synthetic-pretrained-disentangled-direct-v1 - Collection commit ID: all 5 completed runs record git commit
c2de547 - Planned design:
6 tasks x 1 setting x 1 seed = 6jobs - Observed completion:
5/6complete, withsorting_algorithmsincomplete and excluded from quantitative summaries
Design
- Backbone: pretrained
Qwen2.5-0.5B - Training mode: direct disentangling without prior mixture fine-tuning
- Adaptation:
lora - Latent setup:
continuous, latent dim8, last-token pooling - Optimization:
beta=0.1, baseline-normalized reconstruction enabled - Diagnostic-only loss terms:
posterior_variance_weight=0.0,router_marginal_kl_to_prior_weight=0.0,router_support_weight=0.0,token_weighted_reconstruction_weight=0.0,inter_latent_divergence_weight=0.0 - Training budget:
1epoch,max_sequences_per_epoch=200000, seed314159
Key Figures
Caption: Validation final-answer accuracy stays very high while router-sampled and posterior-mean probe accuracy remain much lower and often degrade over training.
Caption: The optimized objective terms, normalized reconstruction and KL-to-prior, are both low and generally stable by the end of training.
Caption: Posterior-variance and router-marginal-KL diagnostics are much larger in scale and can grow strongly, but they are not active terms in the optimized objective for this collection.
Caption: Standardized trends highlight the mismatch: objective terms shrink or stabilize while the unweighted diagnostics often remain large or worsen.
Caption: Final answer accuracy is nearly saturated on all completed tasks, but router probe accuracy remains only moderate and varies across tasks.
Interpretation
The central pattern is a strong decoupling between task success and latent-strategy alignment. Final sampled-test accuracy is essentially saturated across the 5 completed tasks (0.9994-1.0), but final router-sampled latent probe accuracy remains only moderate (0.36-0.49) and posterior-mean probe accuracy is similarly limited (0.39-0.54).
The loss decomposition helps explain why. The actual optimized objective in this collection is just normalized reconstruction plus 0.1 * KL-to-prior, and that objective behaves well. The terms that look explosive, posterior variance and router marginal KL to the prior, were logged only as diagnostics because their weights were zero. So the training run is not being asked to control those quantities, and it does not.
This suggests a simple mechanism-level explanation for the weak strategy mapping: the pretrained backbone can already solve these tasks well enough that the easiest solution is to preserve accuracy without forcing z to become a clean strategy variable. In that regime, z can drift toward a noisy, weakly structured, or nuisance-capturing code while the generator still produces correct answers. The fact that probe accuracy often declines over training reinforces that interpretation: optimization appears to improve the objective while eroding an initially more strategy-separable latent geometry.
Working Hypotheses
- Objective mismatch: nothing in the active loss directly rewards linearly decodable strategy structure.
- Sampling noise in the latent: with zero posterior-variance penalty, sampled
zmay become much less strategy-clean than posterior means. - Backbone bypass: direct disentangling from a pretrained model may let the decoder solve the task mostly through pretrained internal circuitry rather than through
z. - Wrong latent factor: the latent may be carrying real information, but not the coarse reasoning strategy labels used for evaluation.