Experiment-reportsSynthetic-pretrained-disentangled-from-entangled-v2synthetic-pretrained-disentangled-from-entangled-v2 Results

synthetic-pretrained-disentangled-from-entangled-v2 Results

Back to design

Warning

This collection has a run-artifact integrity issue. The local run-dir naming does not encode beta_schedule or beta_warmup_steps, so some completed beta0p1_const_pv0 and beta0p1_lin30_pv0 jobs collide into the same run directory and overwrite one another locally. In addition, run_manifest.json stores resolved warmup as an absolute step count rather than the original fractional setting, which initially masked several warmup runs during local analysis. For this collection, W&B should be treated as the source of truth for collection-level analysis. The local figures below are useful for partial inspection, but the collection should be superseded by a new run with corrected run naming.

Collection State

Current partial-collection snapshot from:

  • experiments/synthetic_sequences/disentangled/analysis/notebooks/13_apr15_ss_from_entangled_v2_analysis.ipynb
  • experiments/synthetic_sequences/disentangled/analysis/aggregates/synthetic-pretrained-disentangled-from-entangled-v2/

Current status table:

  • total planned cells: 30
  • COMPLETED_READY: 17
  • COMPLETED_COLLIDED: 4
  • RUNNING: 4
  • FAILED: 5

This corresponds to 21 slurm-completed cells in total. Of those, 17 have locally analyzable artifact bundles and 4 are marked as completed-but-collided because the surviving local run directory cannot be trusted to represent the intended loss setting.

The quantitative summaries below use the 17 COMPLETED_READY cells with locally analyzable artifact bundles. The 4 COMPLETED_COLLIDED cells are intentionally excluded.

Current Quantitative Snapshot

From data/final_snapshot.csv:

  • analyzable completed runs: 17
  • mean sampled test final-answer accuracy: 0.9991
  • mean router-sampled probe accuracy: 0.5697
  • mean posterior-mu probe accuracy: 0.7268
  • mean baseline-normalized reconstruction loss: 0.6294
  • mean KL prior loss: 2.3367
  • mean posterior-variance diagnostic: 499.60
  • mean total loss: 0.9426

Per-cell final metrics:

TaskLoss settingFinal accuracyRouter probePosterior probeNorm reconKL priorPosterior varianceTotal loss
base_conversionbeta0p02_const_pv00.99980.82481.00000.04521.4907409.470.0750
base_conversionbeta0p05_const_pv01.00000.75581.00000.01861.4898386.250.0931
base_conversionbeta0p1_lin30_pv01.00000.88781.00000.03021.63074.220.1933
grid_pathfindingbeta0p02_const_pv01.00000.47090.49981.06030.0048453.351.0604
grid_pathfindingbeta0p05_const_pv01.00000.43160.50631.06040.0046440.541.0606
grid_pathfindingbeta0p05_lin30_pv0p010.99890.44170.46821.05995.8260378.865.1398
grid_pathfindingbeta0p1_lin30_pv00.99980.43380.49761.10103.10251594.631.4113
linear_equation_solvingbeta0p02_const_pv01.00000.81151.00000.00491.54112.350.0357
linear_equation_solvingbeta0p05_const_pv01.00000.42920.82420.97150.00491987.780.9718
linear_equation_solvingbeta0p05_lin30_pv0p011.00000.79310.91690.02311.50960.240.1010
linear_equation_solvingbeta0p1_const_pv01.00000.43400.82510.97110.01291599.670.9724
list_summationbeta0p02_const_pv01.00000.59260.69670.44100.94349.840.4599
list_summationbeta0p05_const_pv01.00000.38600.50220.96580.0046414.590.9661
list_summationbeta0p05_lin30_pv0p011.00000.39860.40620.96500.06570.270.9710
list_summationbeta0p1_lin30_pv01.00000.38570.44330.96590.003987.060.9663
multidigit_additionbeta0p02_const_pv00.98540.47620.79381.013619.8183721.921.4100
multidigit_additionbeta0p05_lin30_pv0p011.00000.73090.97590.00232.27022.090.1367

Figures

Collection status by task and loss setting. Caption: Partial-collection status over the 6 x 5 task-by-loss-setting design, now distinguishing completed-ready cells from completed-but-collided cells. In the current snapshot, 21 cells are completed in slurm, 17 are locally analyzable, and 4 are excluded because the surviving local run directory is ambiguous.

Final metric heatmaps by task and loss setting. Caption: Final-state heatmaps for router probe, posterior probe, baseline-normalized reconstruction, and KL prior over the currently analyzable cells.

Alignment dynamics by task and loss setting. Caption: Validation final-answer accuracy, router-sampled probe accuracy, and posterior-mu probe accuracy over training.

Objective-term dynamics by task and loss setting. Caption: Baseline-normalized reconstruction, KL prior, and total objective over training.

Auxiliary diagnostic dynamics by task and loss setting. Caption: Posterior-variance and router-marginal-KL diagnostics over training.

Final normalized reconstruction versus router probe. Caption: Final baseline-normalized reconstruction loss versus final router-sampled probe accuracy across the currently analyzable cells.

Final router probe versus posterior probe. Caption: Final router-sampled probe accuracy versus final posterior-mu probe accuracy across the currently analyzable cells.

Delta versus v1 heatmaps. Caption: Per-task deltas relative to synthetic-pretrained-disentangled-from-entangled-v1 for router probe, posterior probe, and baseline-normalized reconstruction, using the currently analyzable cells.

Built with LogoFlowershow