synthetic-pretrained-disentangled-from-entangled-v1 Results

Final Snapshot

From <experiments/synthetic_sequences/disentangled/analysis/aggregates/synthetic-pretrained-disentangled-from-entangled-v1/data/final_snapshot.csv>:

completed runs: 5
failed runs: 1
mean sampled test final-answer accuracy: 1.0000
mean router-sampled probe accuracy: 0.5001
mean posterior-mu probe accuracy: 0.6491

Per-task final metrics:

Task	Final accuracy	Router probe	Posterior probe	Norm recon	KL prior	Posterior variance	Router marginal KL	Total loss
`list_summation`	`1.0000`	`0.3906`	`0.4318`	`0.9652`	`0.0058`	`409.50`	`2433.84`	`0.9658`
`grid_pathfinding`	`1.0000`	`0.4351`	`0.4669`	`1.0599`	`0.0047`	`452.51`	`3237.50`	`1.0603`
`linear_equation_solving`	`1.0000`	`0.4287`	`0.7169`	`0.9719`	`0.0019`	`1592.67`	`14631.87`	`0.9721`
`base_conversion`	`1.0000`	`0.7690`	`1.0000`	`0.0288`	`1.4288`	`387.87`	`2761.59`	`0.1717`
`multidigit_addition`	`1.0000`	`0.4772`	`0.6300`	`1.0241`	`2.6638`	`2236.91`	`20304.72`	`1.2905`

Figures

Validation accuracy and alignment dynamics by task. Caption: Validation final-answer accuracy, router probe accuracy, and posterior-mu probe accuracy by task.

Optimized objective terms by task. Caption: Baseline-normalized reconstruction, KL prior, and total objective by task.

Unweighted diagnostics by task. Caption: Posterior-variance and router-marginal-KL diagnostics by task.

Standardized comparison of objective versus unweighted diagnostics. Caption: Standardized trajectories comparing optimized objective terms against unweighted diagnostics.

Final accuracy versus router probe across tasks. Caption: Final answer accuracy versus router probe across completed tasks.

Normalized reconstruction versus router probe across tasks. Caption: Final baseline-normalized reconstruction loss versus router probe across completed tasks, colored by posterior probe accuracy.