`continuous-cvae-Mar20`

Summary

This report covers the collection:

path:.slurmkit/collections/exp2_continuous_cvae-Mar20-full.yaml

Collection design: 6 tasks x 8 curated loss settings x 2 adaptation modes (full, reinit) x 1 seed = 96 expected runs.

Observed completion:

complete runs with run_manifest.json + metrics.json: 94/96
missing settings:
- base_conversion + full + beta0p5_base
- base_conversion + reinit + beta0p5_base

Main high-level finding: Mar20 maintains very high final task quality while strategy-control diagnostics vary substantially. In particular, sampled_test_final/overall_final_answer_accuracy stays high (mean 0.9870, min 0.9094, max 1.0), while sampled_test_final/sampled_z/linear_probe_accuracy spans a wider range (mean 0.4452, min 0.3660, max 0.5237).

Global Metrics Snapshot

Across the 94 completed runs:

sampled_test_final/overall_final_answer_accuracy: 0.9094 to 1.0 (mean 0.9870)
sampled_test_final/parse_success_rate: 0.9917 to 1.0 (mean 0.9995)
sampled_test_final/sampled_z/linear_probe_accuracy: 0.3660 to 0.5237 (mean 0.4452)
continuous_diag_test_final/posterior_mu/linear_probe_accuracy: 0.3737 to 0.9991 (mean 0.7038)
sampled_test_final/unique_valid_strategy_rate: 0.6572 to 0.9980 (mean 0.9056)
sampled_test_final/invalid_strategy_rate: 0.0 to 0.0556 (mean 0.0077)

By adaptation mode (means):

full generally outperforms reinit on this collection:
- sampled-z probe: 0.4486 vs 0.4417
- posterior-mu probe: 0.7576 vs 0.6500
- final answer accuracy: 0.9925 vs 0.9815
- invalid strategy rate: 0.0045 vs 0.0108

Key Figures

Figure 1 — Coverage / Completeness

Coverage heatmap across task and adaptation mode for the Mar20 collection.

Caption: This heatmap shows completed run counts per (task_name, adaptation_mode). Most cells have all 8 expected settings; only base_conversion has 7 per mode. Conclusion: collection-level conclusions are broadly reliable, with one missing setting pair concentrated in a single task.

Figure 2 — Probe vs Final Accuracy

Sampled-z probe accuracy versus final answer accuracy by task.

Caption: Each point is one run, faceted by task, plotting sampled-z probe accuracy against final answer accuracy. Accuracy remains high in most runs while probe varies visibly; correlation is modest (r ~= 0.302). Conclusion: solving quality and strategy decodability are coupled only weakly-to-moderately in Mar20.

Figure 3 — Posterior Probe vs Sampled Probe

Posterior-mu probe accuracy versus sampled-z probe accuracy.

Caption: This compares posterior-mu probe accuracy (x/y pair) to sampled-z probe accuracy per run. Relationship is moderate (r ~= 0.549) but many runs show a positive gap (posterior - sampled, mean +0.259, max +0.562). Conclusion: strategies are often more linearly recoverable from posterior means than from sampled latents used at generation time.

Figure 4 — Invalid Strategy Rate vs Probe

Invalid strategy rate versus sampled-z probe accuracy.

Caption: Scatter of invalid strategy rate against sampled-z probe accuracy. Trend is negative (r ~= -0.353). Conclusion: better latent strategy control tends to coincide with fewer invalid strategy attributions, though this is not a strict monotonic relation.

Figure 5 — Heatmap: Sampled-z Probe by Setting

Sampled-z linear probe accuracy heatmap by task and loss setting.

Caption: Task-by-loss-setting heatmaps for sampled-z probe, split by adaptation mode. Highest probes cluster in multidigit_addition and sorting_algorithms; list_summation is consistently lower. Conclusion: probe quality is strongly task-dependent; no single setting dominates every task.

Figure 6 — Heatmap: Posterior-mu Probe by Setting

Posterior-mu probe accuracy heatmap by task and loss setting.

Caption: Same grid view for posterior-mu probe accuracy. Several settings (e.g., beta0p1_twr1, beta0p1_rmk0p5) produce strong posterior probe values, often higher than sampled-z probe for the same run family. Conclusion: posterior representations can be highly separable even when sampled generation latents are less semantically crisp.

Figure 7 — Heatmap: Unique-Valid Strategy Rate

Unique-valid strategy rate heatmap by task and loss setting.

Caption: Task-by-setting unique-valid rates show clear task effects: grid_pathfinding remains the hardest (low unique-valid), while linear_equation_solving and multidigit_addition are strong. Conclusion: latent strategy control quality is heterogeneous across tasks, not just across loss settings.

Figure 8 — Dynamics: Sampled-z Probe (Top Runs)

Sampled-z probe dynamics over training for top runs.

Caption: For top-ranked runs (guardrailed), sampled-z probe over eval progress often peaks early and softens later; average start-to-end delta is negative (mean delta ~= -0.136). Conclusion: generation-time strategy separation can decay during late training even in otherwise strong runs.

Figure 9 — Dynamics: Posterior Probe (Top Runs)

Posterior probe dynamics over training for top runs.

Caption: Posterior-mu probe over time for the same top-run set usually improves through training (mean delta ~= +0.198). Conclusion: posterior inference space tends to become more linearly separable over training, highlighting a posterior-vs-sampled gap.

Figure 10 — Dynamics: Sampled Accuracy (Top Runs)

Sampled validation accuracy dynamics over training for top runs.

Caption: Sampled validation accuracy rises strongly over training in top runs (mean delta ~= +0.400). Conclusion: optimization continues to improve answer correctness even when sampled-z probe may flatten or decline.

Best-Run Analysis (Task/Adaptation Highlights)

Using sampled_test_final/sampled_z/linear_probe_accuracy as the ranking axis, best runs per (task, adaptation_mode) include:

multidigit_addition + full + beta0p5_base: probe 0.5237, accuracy 0.9984, unique-valid 0.9980 (strongest overall sampled-z probe in Mar20).
sorting_algorithms + full + beta0p1_pe0p5: probe 0.4785, accuracy 1.0, unique-valid 0.8684.
base_conversion + full + beta0p1_pe0p5: probe 0.4638, accuracy 0.9836, unique-valid 0.9804.
linear_equation_solving + full + beta0p1_pe0p5: probe 0.4533, accuracy 0.9932, unique-valid 0.9912.

Notable tradeoff examples:

grid_pathfinding has reasonable probe scores but much lower unique-valid rates (around 0.66-0.67 in top-probe runs), indicating persistent ambiguity in strategy attribution despite high answer accuracy.
Some reinit top-probe entries (e.g., base_conversion + beta0p1_rmk0p5) show weaker final accuracy (0.9633) than corresponding full-mode leaders.

Takeaways

Mar20 continuous CVAE reliably reaches high task correctness, but strategy controllability remains the main variance axis.
full adaptation is consistently stronger than reinit on this collection across probe quality, posterior separability, accuracy, and invalid-rate.
There is a recurring posterior-vs-sampled gap: posterior-mu probes can be very high while sampled-z probes are materially lower.
Task identity is a major driver: multidigit_addition/sorting_algorithms are easiest for sampled-z strategy decodability; grid_pathfinding and list_summation remain harder.
The composite setting beta0p1_pe0p5_rmk0p5_rsw0p1 is the weakest block in this sweep by global means (lower probe and lower answer accuracy), so it is a low-priority region for additional budget.

Recommended Follow-Ups

Target the posterior-vs-sampled gap explicitly:
- keep posterior probe high while improving sampled-z probe stability late in training.
Run focused follow-up sweeps on stronger settings (beta1_base, beta0p1_twr1, beta0p1_pe0p5, beta0p1_base) with more seeds.
Add task-specific diagnostics for grid_pathfinding strategy ambiguity (especially unique-valid vs invalid attribution behavior).
Recover the two missing base_conversion + beta0p5_base runs for a complete rectangular comparison set.

References

collection: path:.slurmkit/collections/exp2_continuous_cvae-Mar20-full.yaml
raw runs: path:.runs/synthetic_sequences/disentangled/exp2_continuous_cvae-Mar20-full
notebook: path:experiments/synthetic_sequences/disentangled/analysis/notebooks/04_mar20_continuous_cvae_analysis.ipynb
committed aggregates: path:experiments/synthetic_sequences/disentangled/analysis/aggregates/exp2_continuous_cvae-Mar20-full