synthetic-pretrained-disentangled-from-entangled-v2 Results
synthetic-pretrained-disentangled-from-entangled-v2 Results
This collection has a run-artifact integrity issue.
The local run-dir naming does not encode beta_schedule or beta_warmup_steps, so some completed beta0p1_const_pv0 and beta0p1_lin30_pv0 jobs collide into the same run directory and overwrite one another locally. In addition, run_manifest.json stores resolved warmup as an absolute step count rather than the original fractional setting, which initially masked several warmup runs during local analysis.
For this collection, W&B should be treated as the source of truth for collection-level analysis. The local figures below are useful for partial inspection, but the collection should be superseded by a new run with corrected run naming.
Collection State
Current partial-collection snapshot from:
experiments/synthetic_sequences/disentangled/analysis/notebooks/13_apr15_ss_from_entangled_v2_analysis.ipynbexperiments/synthetic_sequences/disentangled/analysis/aggregates/synthetic-pretrained-disentangled-from-entangled-v2/
Current status table:
- total planned cells:
30 COMPLETED_READY:17COMPLETED_COLLIDED:4RUNNING:4FAILED:5
This corresponds to 21 slurm-completed cells in total. Of those, 17 have locally analyzable artifact bundles and 4 are marked as completed-but-collided because the surviving local run directory cannot be trusted to represent the intended loss setting.
The quantitative summaries below use the 17 COMPLETED_READY cells with locally analyzable artifact bundles. The 4 COMPLETED_COLLIDED cells are intentionally excluded.
Current Quantitative Snapshot
From data/final_snapshot.csv:
- analyzable completed runs:
17 - mean sampled test final-answer accuracy:
0.9991 - mean router-sampled probe accuracy:
0.5697 - mean posterior-mu probe accuracy:
0.7268 - mean baseline-normalized reconstruction loss:
0.6294 - mean KL prior loss:
2.3367 - mean posterior-variance diagnostic:
499.60 - mean total loss:
0.9426
Per-cell final metrics:
| Task | Loss setting | Final accuracy | Router probe | Posterior probe | Norm recon | KL prior | Posterior variance | Total loss |
|---|---|---|---|---|---|---|---|---|
base_conversion | beta0p02_const_pv0 | 0.9998 | 0.8248 | 1.0000 | 0.0452 | 1.4907 | 409.47 | 0.0750 |
base_conversion | beta0p05_const_pv0 | 1.0000 | 0.7558 | 1.0000 | 0.0186 | 1.4898 | 386.25 | 0.0931 |
base_conversion | beta0p1_lin30_pv0 | 1.0000 | 0.8878 | 1.0000 | 0.0302 | 1.6307 | 4.22 | 0.1933 |
grid_pathfinding | beta0p02_const_pv0 | 1.0000 | 0.4709 | 0.4998 | 1.0603 | 0.0048 | 453.35 | 1.0604 |
grid_pathfinding | beta0p05_const_pv0 | 1.0000 | 0.4316 | 0.5063 | 1.0604 | 0.0046 | 440.54 | 1.0606 |
grid_pathfinding | beta0p05_lin30_pv0p01 | 0.9989 | 0.4417 | 0.4682 | 1.0599 | 5.8260 | 378.86 | 5.1398 |
grid_pathfinding | beta0p1_lin30_pv0 | 0.9998 | 0.4338 | 0.4976 | 1.1010 | 3.1025 | 1594.63 | 1.4113 |
linear_equation_solving | beta0p02_const_pv0 | 1.0000 | 0.8115 | 1.0000 | 0.0049 | 1.5411 | 2.35 | 0.0357 |
linear_equation_solving | beta0p05_const_pv0 | 1.0000 | 0.4292 | 0.8242 | 0.9715 | 0.0049 | 1987.78 | 0.9718 |
linear_equation_solving | beta0p05_lin30_pv0p01 | 1.0000 | 0.7931 | 0.9169 | 0.0231 | 1.5096 | 0.24 | 0.1010 |
linear_equation_solving | beta0p1_const_pv0 | 1.0000 | 0.4340 | 0.8251 | 0.9711 | 0.0129 | 1599.67 | 0.9724 |
list_summation | beta0p02_const_pv0 | 1.0000 | 0.5926 | 0.6967 | 0.4410 | 0.9434 | 9.84 | 0.4599 |
list_summation | beta0p05_const_pv0 | 1.0000 | 0.3860 | 0.5022 | 0.9658 | 0.0046 | 414.59 | 0.9661 |
list_summation | beta0p05_lin30_pv0p01 | 1.0000 | 0.3986 | 0.4062 | 0.9650 | 0.0657 | 0.27 | 0.9710 |
list_summation | beta0p1_lin30_pv0 | 1.0000 | 0.3857 | 0.4433 | 0.9659 | 0.0039 | 87.06 | 0.9663 |
multidigit_addition | beta0p02_const_pv0 | 0.9854 | 0.4762 | 0.7938 | 1.0136 | 19.8183 | 721.92 | 1.4100 |
multidigit_addition | beta0p05_lin30_pv0p01 | 1.0000 | 0.7309 | 0.9759 | 0.0023 | 2.2702 | 2.09 | 0.1367 |
Figures
Caption: Partial-collection status over the 6 x 5 task-by-loss-setting design, now distinguishing completed-ready cells from completed-but-collided cells. In the current snapshot, 21 cells are completed in slurm, 17 are locally analyzable, and 4 are excluded because the surviving local run directory is ambiguous.
Caption: Final-state heatmaps for router probe, posterior probe, baseline-normalized reconstruction, and KL prior over the currently analyzable cells.
Caption: Validation final-answer accuracy, router-sampled probe accuracy, and posterior-mu probe accuracy over training.
Caption: Baseline-normalized reconstruction, KL prior, and total objective over training.
Caption: Posterior-variance and router-marginal-KL diagnostics over training.
Caption: Final baseline-normalized reconstruction loss versus final router-sampled probe accuracy across the currently analyzable cells.
Caption: Final router-sampled probe accuracy versus final posterior-mu probe accuracy across the currently analyzable cells.
Caption: Per-task deltas relative to synthetic-pretrained-disentangled-from-entangled-v1 for router probe, posterior probe, and baseline-normalized reconstruction, using the currently analyzable cells.