`synthetic-disentangled-Mar19`

Summary

This collection is a sparse 21-setting loss sweep over three synthetic tasks:

sorting_algorithms
grid_pathfinding
multidigit_addition

Planned cardinality was 63 jobs. The collection completed 62/63 jobs; the missing run is sorting_algorithms + beta0p1_base.

The main conclusion is that the sweep sharply separates "high task accuracy" from "clean latent-strategy alignment". Across the collection, controlled_test_final/mean_overall_final_answer_accuracy remains in a narrow 0.9853 to 1.0 range, while controlled_test_final/alignment_one_to_one ranges from 0.1624 to 0.6930. This means the sweep is informative primarily about disentanglement, not about whether the model can solve the task at all.

At the block level, the strongest pattern is that B_twr_only is the only group that changes the best achievable alignment on grid_pathfinding and sorting_algorithms. By contrast, all non-token-weighted blocks remain tightly clustered on those tasks, even when they use Mar18-inspired composite losses.

The strongest positive result is the token-weighted reconstruction setting beta0p1_twr2, but only on some tasks. The strongest negative result is that the occupancy-oriented auxiliaries (posterior_entropy, router marginal KL, router support, and their composites) do not materially improve strategy alignment on their own.

Best Results

`grid_pathfinding`

The clear standout is:

run: path:.runs/synthetic_sequences/disentangled/synthetic-disentangled-Mar19/task-grid_pathfinding_entangled-ms-m-opt-muon_adapt-full_opt-muon_sched-constant_lr-0.001_beta-0.1_bs-128_s314159-j1392169
loss setting: beta0p1_twr2
metrics:
- alignment_one_to_one = 0.6930
- local_strategy_coverage_mean = 1.0000
- local_strategy_full_coverage_rate = 1.0000
- mean_overall_final_answer_accuracy = 1.0000
- loss/val/reconstruction_loss = 0.000006

This is qualitatively different from the rest of the collection. The best non-token-weighted runs on the same task remain near:

alignment_one_to_one ~= 0.224
local_strategy_coverage_mean ~= 0.443
local_strategy_full_coverage_rate ~= 0.0
mean_overall_final_answer_accuracy = 1.0

Representative non-token-weighted comparison:

run: path:.runs/synthetic_sequences/disentangled/synthetic-disentangled-Mar19/task-grid_pathfinding_entangled-ms-m-opt-muon_adapt-full_opt-muon_sched-constant_lr-0.001_beta-0.5_bs-128_s314159-j1392164
loss setting: beta0p5_base
metrics:
- alignment_one_to_one = 0.224233
- local_strategy_coverage_mean = 0.442500
- local_strategy_full_coverage_rate = 0.0
- mean_overall_final_answer_accuracy = 1.0000
- loss/val/reconstruction_loss = 0.027564

So the additive token-weighted term is not giving a small incremental gain here. It pushes the system into a different regime.

`sorting_algorithms`

The best run is:

run: path:.runs/synthetic_sequences/disentangled/synthetic-disentangled-Mar19/task-sorting_algorithms_entangled-ms-m-opt-muon_adapt-full_opt-muon_sched-constant_lr-0.001_beta-0.1_bs-128_s314159-j1392148
loss setting: beta0p1_twr2
metrics:
- alignment_one_to_one = 0.27066
- local_strategy_coverage_mean = 0.39780
- mean_overall_final_answer_accuracy = 0.99304
- loss/val/reconstruction_loss = 0.014604

The best non-token-weighted runs on this task remain near:

alignment_one_to_one ~= 0.1754
local_strategy_coverage_mean ~= 0.223
mean_overall_final_answer_accuracy ~= 0.9997
loss/val/reconstruction_loss ~= 0.0237

Representative non-token-weighted comparison:

run: path:.runs/synthetic_sequences/disentangled/synthetic-disentangled-Mar19/task-sorting_algorithms_entangled-ms-m-opt-muon_adapt-full_opt-muon_sched-constant_lr-0.001_beta-0.1_bs-128_s314159-j1392158
loss setting: beta0p1_pe0p5_rmk0p5_rsw0p1
metrics:
- alignment_one_to_one = 0.175460
- local_strategy_coverage_mean = 0.223340
- mean_overall_final_answer_accuracy = 0.999720
- loss/val/reconstruction_loss = 0.023731

So beta0p1_twr2 substantially improves alignment and even improves the plain reconstruction term, but it trades away a small amount of answer accuracy.

`multidigit_addition`

This task does not benefit from token-weighted reconstruction. The best run is:

run: path:.runs/synthetic_sequences/disentangled/synthetic-disentangled-Mar19/task-multidigit_addition_entangled-ms-m-opt-muon_adapt-full_opt-muon_sched-constant_lr-0.001_beta-0.1_bs-128_s314159-j1392198
loss setting: beta0p1_rsw1
metrics:
- alignment_one_to_one = 0.332367
- local_strategy_coverage_mean = 0.332400
- mean_overall_final_answer_accuracy = 0.996600
- loss/val/reconstruction_loss = 0.021614

By contrast, the beta0p1_twr2 run on this task reaches only:

alignment_one_to_one = 0.3280
local_strategy_coverage_mean = 0.3281
mean_overall_final_answer_accuracy = 0.991267
loss/val/reconstruction_loss = 0.023170

The best token-weighted run on this task is actually beta0p1_twr1 (...j1392189), but it is still not better than beta0p1_rsw1 on the main disentanglement metrics:

beta0p1_twr1: alignment_one_to_one = 0.331233, local_strategy_coverage_mean = 0.331267, mean_overall_final_answer_accuracy = 0.997533, loss/val/reconstruction_loss = 0.021937
beta0p1_rsw1: alignment_one_to_one = 0.332367, local_strategy_coverage_mean = 0.332400, mean_overall_final_answer_accuracy = 0.996600, loss/val/reconstruction_loss = 0.021614

So multidigit_addition behaves as a negative control for the token-weighted intervention in its current additive form.

Conclusions

1. Accuracy is not the bottleneck in this collection

Task accuracy stays high almost everywhere, while disentanglement varies a lot. This sharpens the earlier diagnosis: the method's main failure mode is not failure to generate valid solutions, but failure to assign stable global strategy semantics to the latents.

2. Token-weighted reconstruction is the only term that materially changes disentanglement

The best block on grid_pathfinding and sorting_algorithms is B_twr_only. On grid_pathfinding, it is dramatically better than every other block; on sorting_algorithms, it is clearly better than every non-token- weighted block.

This is the strongest evidence so far that teacher-forcing dilution is a real part of the problem.

3. Occupancy-oriented auxiliaries do not identify strategy semantics

The best non-token-weighted block maxima remain tightly clustered:

grid_pathfinding: all non-B_twr_only blocks stay near 0.223 to 0.226 alignment.
sorting_algorithms: all non-B_twr_only blocks stay near 0.175.
multidigit_addition: all blocks stay near 0.331 to 0.332.

That is strong evidence that posterior entropy, router marginal KL, router support, and the Mar18-style composites mainly shape occupancy or usage, not strategy semantics.

4. Full local strategy recovery is still absent almost everywhere

controlled_test_final/local_strategy_full_coverage_rate is positive only for grid_pathfinding, and only one run achieves a nontrivial value:

beta0p1_twr2 reaches 1.0

The other positive values are tiny numerical traces (0.0001 or 0.0002). So the broad prior concern remains valid: clean local and global factorization is not the default outcome of the current method.

5. Total loss is not the right cross-setting selection metric

Because the objective changes across settings, loss/val/total_loss is not a clean model-quality measure. In particular, token-weighted runs can have higher total loss while still having better reconstruction and better alignment.

The more stable cross-setting comparison is:

alignment metrics,
local coverage metrics,
answer accuracy,
and plain loss/val/reconstruction_loss.

Recommendations

Near-term

Promote token-weighted reconstruction to the mainline diagnosis path, but only on tasks where it has already shown leverage: grid_pathfinding and sorting_algorithms.
Run a focused follow-up around beta0p1_twr2:
- replacement of standard reconstruction by the weighted term,
- additive vs replacement comparison,
- weighted-heavy curriculum.
Keep multidigit_addition in the follow-up set as a negative control. It is useful precisely because the current token-weighted intervention does not help there.

Medium-term

Stop spending broad sweep budget on occupancy-only tuning until a stronger counterexample appears. Mar19 suggests those terms do not create semantics on their own.
Add label-aware posterior/router diagnostics so future runs can say whether the failure is in posterior inference, router prediction, or generation.
Run a small groundtruth-data vs model-generated supervision comparison on the promising token-weighted settings before over-generalizing the Mar19 conclusions.

Concrete next experiments

grid_pathfinding: beta0p1_twr2 with additive vs replacement weighted reconstruction.
sorting_algorithms: beta0p1_twr2 and beta0p1_twr1, checking whether the alignment gain is worth the small answer-accuracy drop.
multidigit_addition: compare beta0p1_rsw1 against a replacement-style weighted reconstruction to test whether the issue is task mismatch or under-applied weighting.
Recover the missing sorting_algorithms + beta0p1_base run so future comparisons on that task have a true beta-matched baseline.

References

collection: path:.slurmkit/collections/synthetic-disentangled-Mar19.yaml
raw runs: path:.runs/synthetic_sequences/disentangled/synthetic-disentangled-Mar19
notebook: path:experiments/synthetic_sequences/disentangled/analysis/notebooks/03_mar19_analysis.ipynb
committed aggregates: path:experiments/synthetic_sequences/disentangled/analysis/aggregates/synthetic-disentangled-Mar19