continuous-cvae-Mar20
continuous-cvae-Mar20
Summary
This report covers the collection:
path:.slurmkit/collections/exp2_continuous_cvae-Mar20-full.yaml
Collection design: 6 tasks x 8 curated loss settings x 2 adaptation modes (full,
reinit) x 1 seed = 96 expected runs.
Observed completion:
- complete runs with
run_manifest.json+metrics.json:94/96 - missing settings:
base_conversion + full + beta0p5_basebase_conversion + reinit + beta0p5_base
Main high-level finding: Mar20 maintains very high final task quality while
strategy-control diagnostics vary substantially. In particular,
sampled_test_final/overall_final_answer_accuracy stays high (mean 0.9870, min
0.9094, max 1.0), while sampled_test_final/sampled_z/linear_probe_accuracy
spans a wider range (mean 0.4452, min 0.3660, max 0.5237).
Global Metrics Snapshot
Across the 94 completed runs:
sampled_test_final/overall_final_answer_accuracy:0.9094to1.0(mean0.9870)sampled_test_final/parse_success_rate:0.9917to1.0(mean0.9995)sampled_test_final/sampled_z/linear_probe_accuracy:0.3660to0.5237(mean0.4452)continuous_diag_test_final/posterior_mu/linear_probe_accuracy:0.3737to0.9991(mean0.7038)sampled_test_final/unique_valid_strategy_rate:0.6572to0.9980(mean0.9056)sampled_test_final/invalid_strategy_rate:0.0to0.0556(mean0.0077)
By adaptation mode (means):
fullgenerally outperformsreiniton this collection:- sampled-z probe:
0.4486vs0.4417 - posterior-mu probe:
0.7576vs0.6500 - final answer accuracy:
0.9925vs0.9815 - invalid strategy rate:
0.0045vs0.0108
- sampled-z probe:
Key Figures
Figure 1 — Coverage / Completeness
Caption: This heatmap shows completed run counts per (task_name, adaptation_mode). Most cells have all 8 expected settings; only
base_conversion has 7 per mode. Conclusion: collection-level conclusions
are broadly reliable, with one missing setting pair concentrated in a single
task.
Figure 2 — Probe vs Final Accuracy
Caption: Each point is one run, faceted by task, plotting
sampled-z probe accuracy against final answer accuracy. Accuracy remains high in
most runs while probe varies visibly; correlation is modest (r ~= 0.302).
Conclusion: solving quality and strategy decodability are coupled only
weakly-to-moderately in Mar20.
Figure 3 — Posterior Probe vs Sampled Probe
Caption: This compares posterior-mu probe accuracy (x/y pair) to sampled-z
probe accuracy per run. Relationship is moderate (r ~= 0.549) but many runs
show a positive gap (posterior - sampled, mean +0.259, max +0.562).
Conclusion: strategies are often more linearly recoverable from posterior
means than from sampled latents used at generation time.
Figure 4 — Invalid Strategy Rate vs Probe
Caption: Scatter of invalid strategy rate against sampled-z probe accuracy.
Trend is negative (r ~= -0.353). Conclusion: better latent strategy
control tends to coincide with fewer invalid strategy attributions, though this
is not a strict monotonic relation.
Figure 5 — Heatmap: Sampled-z Probe by Setting
Caption: Task-by-loss-setting heatmaps for sampled-z probe, split by
adaptation mode. Highest probes cluster in multidigit_addition and
sorting_algorithms; list_summation is consistently lower. Conclusion:
probe quality is strongly task-dependent; no single setting dominates every task.
Figure 6 — Heatmap: Posterior-mu Probe by Setting
Caption: Same grid view for posterior-mu probe accuracy. Several settings
(e.g., beta0p1_twr1, beta0p1_rmk0p5) produce strong posterior probe values,
often higher than sampled-z probe for the same run family. Conclusion:
posterior representations can be highly separable even when sampled generation
latents are less semantically crisp.
Figure 7 — Heatmap: Unique-Valid Strategy Rate
Caption: Task-by-setting unique-valid rates show clear task effects:
grid_pathfinding remains the hardest (low unique-valid), while
linear_equation_solving and multidigit_addition are strong. Conclusion:
latent strategy control quality is heterogeneous across tasks, not just across
loss settings.
Figure 8 — Dynamics: Sampled-z Probe (Top Runs)
Caption: For top-ranked runs (guardrailed), sampled-z probe over eval
progress often peaks early and softens later; average start-to-end delta is
negative (mean delta ~= -0.136). Conclusion: generation-time strategy
separation can decay during late training even in otherwise strong runs.
Figure 9 — Dynamics: Posterior Probe (Top Runs)
Caption: Posterior-mu probe over time for the same top-run set usually
improves through training (mean delta ~= +0.198). Conclusion: posterior
inference space tends to become more linearly separable over training,
highlighting a posterior-vs-sampled gap.
Figure 10 — Dynamics: Sampled Accuracy (Top Runs)
Caption: Sampled validation accuracy rises strongly over training in top
runs (mean delta ~= +0.400). Conclusion: optimization continues to improve
answer correctness even when sampled-z probe may flatten or decline.
Best-Run Analysis (Task/Adaptation Highlights)
Using sampled_test_final/sampled_z/linear_probe_accuracy as the ranking axis,
best runs per (task, adaptation_mode) include:
multidigit_addition + full + beta0p5_base: probe0.5237, accuracy0.9984, unique-valid0.9980(strongest overall sampled-z probe in Mar20).sorting_algorithms + full + beta0p1_pe0p5: probe0.4785, accuracy1.0, unique-valid0.8684.base_conversion + full + beta0p1_pe0p5: probe0.4638, accuracy0.9836, unique-valid0.9804.linear_equation_solving + full + beta0p1_pe0p5: probe0.4533, accuracy0.9932, unique-valid0.9912.
Notable tradeoff examples:
grid_pathfindinghas reasonable probe scores but much lower unique-valid rates (around0.66-0.67in top-probe runs), indicating persistent ambiguity in strategy attribution despite high answer accuracy.- Some
reinittop-probe entries (e.g.,base_conversion + beta0p1_rmk0p5) show weaker final accuracy (0.9633) than correspondingfull-mode leaders.
Takeaways
- Mar20 continuous CVAE reliably reaches high task correctness, but strategy controllability remains the main variance axis.
fulladaptation is consistently stronger thanreiniton this collection across probe quality, posterior separability, accuracy, and invalid-rate.- There is a recurring posterior-vs-sampled gap: posterior-mu probes can be very high while sampled-z probes are materially lower.
- Task identity is a major driver:
multidigit_addition/sorting_algorithmsare easiest for sampled-z strategy decodability;grid_pathfindingandlist_summationremain harder. - The composite setting
beta0p1_pe0p5_rmk0p5_rsw0p1is the weakest block in this sweep by global means (lower probe and lower answer accuracy), so it is a low-priority region for additional budget.
Recommended Follow-Ups
- Target the posterior-vs-sampled gap explicitly:
- keep posterior probe high while improving sampled-z probe stability late in training.
- Run focused follow-up sweeps on stronger settings (
beta1_base,beta0p1_twr1,beta0p1_pe0p5,beta0p1_base) with more seeds. - Add task-specific diagnostics for
grid_pathfindingstrategy ambiguity (especially unique-valid vs invalid attribution behavior). - Recover the two missing
base_conversion + beta0p5_baseruns for a complete rectangular comparison set.
References
- collection:
path:.slurmkit/collections/exp2_continuous_cvae-Mar20-full.yaml - raw runs:
path:.runs/synthetic_sequences/disentangled/exp2_continuous_cvae-Mar20-full - notebook:
path:experiments/synthetic_sequences/disentangled/analysis/notebooks/04_mar20_continuous_cvae_analysis.ipynb - committed aggregates:
path:experiments/synthetic_sequences/disentangled/analysis/aggregates/exp2_continuous_cvae-Mar20-full