ss-disentangled-direct-pretrained-v1

`ss-disentangled-direct-pretrained-v1`

Summary

Brief note for the initial direct-pretrained disentangling collection using a pretrained Qwen2.5-0.5B backbone. This collection was meant as a small run-level diagnostic check before broader sweeps, with emphasis on loss dynamics rather than hyperparameter effects.

Experiment Metadata

Collection: ss-disentangled-direct-pretrained-v1
Run date: collection created 2026-04-13T03:12:15, last updated 2026-04-13T14:22:27
Collection definition: .slurmkit/collections/ss-disentangled-direct-pretrained-v1.yaml
Source run directory: .runs/synthetic_sequences/disentangled/synthetic-pretrained-disentangled-direct-v1
Collection commit ID: all 5 completed runs record git commit c2de547
Planned design: 6 tasks x 1 setting x 1 seed = 6 jobs
Observed completion: 5/6 complete, with sorting_algorithms incomplete and excluded from quantitative summaries

Design

Backbone: pretrained Qwen2.5-0.5B
Training mode: direct disentangling without prior mixture fine-tuning
Adaptation: lora
Latent setup: continuous, latent dim 8, last-token pooling
Optimization: beta=0.1, baseline-normalized reconstruction enabled
Diagnostic-only loss terms: posterior_variance_weight=0.0, router_marginal_kl_to_prior_weight=0.0, router_support_weight=0.0, token_weighted_reconstruction_weight=0.0, inter_latent_divergence_weight=0.0
Training budget: 1 epoch, max_sequences_per_epoch=200000, seed 314159

Key Figures

Validation accuracy and alignment dynamics by task. Caption: Validation final-answer accuracy stays very high while router-sampled and posterior-mean probe accuracy remain much lower and often degrade over training.

Optimized objective terms by task. Caption: The optimized objective terms, normalized reconstruction and KL-to-prior, are both low and generally stable by the end of training.

Unweighted diagnostics by task. Caption: Posterior-variance and router-marginal-KL diagnostics are much larger in scale and can grow strongly, but they are not active terms in the optimized objective for this collection.

Standardized comparison of objective versus unweighted diagnostics. Caption: Standardized trends highlight the mismatch: objective terms shrink or stabilize while the unweighted diagnostics often remain large or worsen.

Final accuracy versus router probe across tasks. Caption: Final answer accuracy is nearly saturated on all completed tasks, but router probe accuracy remains only moderate and varies across tasks.

Interpretation

The central pattern is a strong decoupling between task success and latent-strategy alignment. Final sampled-test accuracy is essentially saturated across the 5 completed tasks (0.9994-1.0), but final router-sampled latent probe accuracy remains only moderate (0.36-0.49) and posterior-mean probe accuracy is similarly limited (0.39-0.54).

The loss decomposition helps explain why. The actual optimized objective in this collection is just normalized reconstruction plus 0.1 * KL-to-prior, and that objective behaves well. The terms that look explosive, posterior variance and router marginal KL to the prior, were logged only as diagnostics because their weights were zero. So the training run is not being asked to control those quantities, and it does not.

This suggests a simple mechanism-level explanation for the weak strategy mapping: the pretrained backbone can already solve these tasks well enough that the easiest solution is to preserve accuracy without forcing z to become a clean strategy variable. In that regime, z can drift toward a noisy, weakly structured, or nuisance-capturing code while the generator still produces correct answers. The fact that probe accuracy often declines over training reinforces that interpretation: optimization appears to improve the objective while eroding an initially more strategy-separable latent geometry.

Working Hypotheses

Objective mismatch: nothing in the active loss directly rewards linearly decodable strategy structure.
Sampling noise in the latent: with zero posterior-variance penalty, sampled z may become much less strategy-clean than posterior means.
Backbone bypass: direct disentangling from a pretrained model may let the decoder solve the task mostly through pretrained internal circuitry rather than through z.
Wrong latent factor: the latent may be carrying real information, but not the coarse reasoning strategy labels used for evaluation.