synthetic-pretrained-disentangled-from-entangled-v2 Results

Warning

This collection has a run-artifact integrity issue. The local run-dir naming does not encode beta_schedule or beta_warmup_steps, so some completed beta0p1_const_pv0 and beta0p1_lin30_pv0 jobs collide into the same run directory and overwrite one another locally. In addition, run_manifest.json stores resolved warmup as an absolute step count rather than the original fractional setting, which initially masked several warmup runs during local analysis. For this collection, W&B should be treated as the source of truth for collection-level analysis. The local figures below are useful for partial inspection, but the collection should be superseded by a new run with corrected run naming.

Collection State

Current partial-collection snapshot from:

experiments/synthetic_sequences/disentangled/analysis/notebooks/13_apr15_ss_from_entangled_v2_analysis.ipynb
experiments/synthetic_sequences/disentangled/analysis/aggregates/synthetic-pretrained-disentangled-from-entangled-v2/

Current status table:

total planned cells: 30
COMPLETED_READY: 17
COMPLETED_COLLIDED: 4
RUNNING: 4
FAILED: 5

This corresponds to 21 slurm-completed cells in total. Of those, 17 have locally analyzable artifact bundles and 4 are marked as completed-but-collided because the surviving local run directory cannot be trusted to represent the intended loss setting.

The quantitative summaries below use the 17 COMPLETED_READY cells with locally analyzable artifact bundles. The 4 COMPLETED_COLLIDED cells are intentionally excluded.

Current Quantitative Snapshot

From data/final_snapshot.csv:

analyzable completed runs: 17
mean sampled test final-answer accuracy: 0.9991
mean router-sampled probe accuracy: 0.5697
mean posterior-mu probe accuracy: 0.7268
mean baseline-normalized reconstruction loss: 0.6294
mean KL prior loss: 2.3367
mean posterior-variance diagnostic: 499.60
mean total loss: 0.9426

Per-cell final metrics:

Task	Loss setting	Final accuracy	Router probe	Posterior probe	Norm recon	KL prior	Posterior variance	Total loss
`base_conversion`	`beta0p02_const_pv0`	`0.9998`	`0.8248`	`1.0000`	`0.0452`	`1.4907`	`409.47`	`0.0750`
`base_conversion`	`beta0p05_const_pv0`	`1.0000`	`0.7558`	`1.0000`	`0.0186`	`1.4898`	`386.25`	`0.0931`
`base_conversion`	`beta0p1_lin30_pv0`	`1.0000`	`0.8878`	`1.0000`	`0.0302`	`1.6307`	`4.22`	`0.1933`
`grid_pathfinding`	`beta0p02_const_pv0`	`1.0000`	`0.4709`	`0.4998`	`1.0603`	`0.0048`	`453.35`	`1.0604`
`grid_pathfinding`	`beta0p05_const_pv0`	`1.0000`	`0.4316`	`0.5063`	`1.0604`	`0.0046`	`440.54`	`1.0606`
`grid_pathfinding`	`beta0p05_lin30_pv0p01`	`0.9989`	`0.4417`	`0.4682`	`1.0599`	`5.8260`	`378.86`	`5.1398`
`grid_pathfinding`	`beta0p1_lin30_pv0`	`0.9998`	`0.4338`	`0.4976`	`1.1010`	`3.1025`	`1594.63`	`1.4113`
`linear_equation_solving`	`beta0p02_const_pv0`	`1.0000`	`0.8115`	`1.0000`	`0.0049`	`1.5411`	`2.35`	`0.0357`
`linear_equation_solving`	`beta0p05_const_pv0`	`1.0000`	`0.4292`	`0.8242`	`0.9715`	`0.0049`	`1987.78`	`0.9718`
`linear_equation_solving`	`beta0p05_lin30_pv0p01`	`1.0000`	`0.7931`	`0.9169`	`0.0231`	`1.5096`	`0.24`	`0.1010`
`linear_equation_solving`	`beta0p1_const_pv0`	`1.0000`	`0.4340`	`0.8251`	`0.9711`	`0.0129`	`1599.67`	`0.9724`
`list_summation`	`beta0p02_const_pv0`	`1.0000`	`0.5926`	`0.6967`	`0.4410`	`0.9434`	`9.84`	`0.4599`
`list_summation`	`beta0p05_const_pv0`	`1.0000`	`0.3860`	`0.5022`	`0.9658`	`0.0046`	`414.59`	`0.9661`
`list_summation`	`beta0p05_lin30_pv0p01`	`1.0000`	`0.3986`	`0.4062`	`0.9650`	`0.0657`	`0.27`	`0.9710`
`list_summation`	`beta0p1_lin30_pv0`	`1.0000`	`0.3857`	`0.4433`	`0.9659`	`0.0039`	`87.06`	`0.9663`
`multidigit_addition`	`beta0p02_const_pv0`	`0.9854`	`0.4762`	`0.7938`	`1.0136`	`19.8183`	`721.92`	`1.4100`
`multidigit_addition`	`beta0p05_lin30_pv0p01`	`1.0000`	`0.7309`	`0.9759`	`0.0023`	`2.2702`	`2.09`	`0.1367`

Figures

Collection status by task and loss setting. Caption: Partial-collection status over the 6 x 5 task-by-loss-setting design, now distinguishing completed-ready cells from completed-but-collided cells. In the current snapshot, 21 cells are completed in slurm, 17 are locally analyzable, and 4 are excluded because the surviving local run directory is ambiguous.

Final metric heatmaps by task and loss setting. Caption: Final-state heatmaps for router probe, posterior probe, baseline-normalized reconstruction, and KL prior over the currently analyzable cells.

Alignment dynamics by task and loss setting. Caption: Validation final-answer accuracy, router-sampled probe accuracy, and posterior-mu probe accuracy over training.

Objective-term dynamics by task and loss setting. Caption: Baseline-normalized reconstruction, KL prior, and total objective over training.

Auxiliary diagnostic dynamics by task and loss setting. Caption: Posterior-variance and router-marginal-KL diagnostics over training.

Final normalized reconstruction versus router probe. Caption: Final baseline-normalized reconstruction loss versus final router-sampled probe accuracy across the currently analyzable cells.

Final router probe versus posterior probe. Caption: Final router-sampled probe accuracy versus final posterior-mu probe accuracy across the currently analyzable cells.

Delta versus v1 heatmaps. Caption: Per-task deltas relative to synthetic-pretrained-disentangled-from-entangled-v1 for router probe, posterior probe, and baseline-normalized reconstruction, using the currently analyzable cells.