exp2discretecvae-Apr3-7loss (extended - Apr 7 update)
exp2_discrete_cvae-Apr3-7loss (extended - Apr 7 update)
exp2_discrete_cvae-Apr3-7loss (extended - Apr 7 update)
Collection Context
- Collection:
path:.runs/synthetic_sequences/disentangled/exp2_discrete_cvae-Apr3-7loss - Original planned grid:
6 tasks × 6 loss settings × 3 adaptation modes = 108runs - Extended grid (Apr 6 followup):
+ 3 new loss settings(beta0p03_norm_twr2_jsd0,beta0p1_norm_twr2_jsd0_lin40,beta0p1_norm_twr3_jsd0) →~162planned total - Observed completed runs in local artifacts:
144- Note:
beta0p1_norm_twr2_jsd0_lin40runs are absent from the executive summary statistics, suggesting those jobs did not complete in time for this analysis.
- Note:
- Primary decision metrics:
- alignment:
controlled_test_final/alignment_one_to_one - accuracy gate:
sampled_test_final/final_answer_accuracy > 0.95
- alignment:
Choose interventions that improve latent-strategy alignment while maintaining strong answer utility.
For this report, an intervention "works" when it improves mean alignment and keeps mean accuracy above 0.95.
Executive Answers
1) Does Token-Weighted Reconstruction (TWR) work?
Answer: Inconclusive — finding reversed from prior analysis. With the extended dataset (n=18 pairs, up from 15), the matched-pair alignment delta for twr0 → twr2 is now slightly negative and the confidence interval spans zero. TWR at weight 2 does not show a reliable alignment benefit in the matched contrast.
- Metric key:
controlled_test_final/alignment_one_to_one- mean alignment:
0.4154 → 0.4004(−0.0150absolute) - matched-pair delta mean:
−0.0059, 95% bootstrap CI[−0.1067, +0.1096],n=18pairs
- mean alignment:
- Metric key:
sampled_test_final/final_answer_accuracy- mean accuracy:
0.9933 → 0.9949 - matched-pair delta mean:
+0.0019, 95% bootstrap CI[+0.0005, +0.0036] - treatment setting passes gate (
0.9949 > 0.95)
- mean accuracy:
- Note: The per-task winner results (§5) show
beta0p1_norm_twr3_jsd0— nottwr2— as the dominant setting. Thetwr2contrast likely does not capture the optimal TWR level.
Caption: TWR contrast (beta0p1_norm_twr0_jsd0 vs beta0p1_norm_twr2_jsd0) with alignment and accuracy deltas. The CI spans zero; see also §5 for the stronger twr3 result.
2) Does JSD work?
Answer: Marginally positive but inconclusive. The extended dataset shows a positive alignment delta for twr2_jsd0 → twr2_jsd0p5, but the confidence interval spans zero. The direction reversed from the prior analysis (which showed a negative effect with fewer pairs).
- Metric key:
controlled_test_final/alignment_one_to_one- mean alignment:
0.4004 → 0.4405(+0.0402absolute) - matched-pair delta mean:
+0.0208, 95% bootstrap CI[−0.1022, +0.1412],n=17pairs
- mean alignment:
- Metric key:
sampled_test_final/final_answer_accuracy- mean accuracy:
0.9949 → 0.9948(negligible change) - matched-pair delta mean:
−0.0009, 95% bootstrap CI[−0.0030, +0.0013] - treatment setting passes gate (
0.9948 > 0.95)
- mean accuracy:
Caption: JSD contrast (beta0p1_norm_twr2_jsd0 vs beta0p1_norm_twr2_jsd0p5) showing a positive but wide-CI alignment delta.
3) Does baseline-normalization of reconstruction loss work?
Answer: Yes, robustly. Normalization provides the largest and most clearly significant alignment uplift among the predefined contrasts, consistent with the prior analysis.
- Metric key:
controlled_test_final/alignment_one_to_one- mean alignment:
0.2856 → 0.4154(+0.1298absolute) - matched-pair delta mean:
+0.1298, 95% bootstrap CI[+0.0387, +0.2375],n=18pairs
- mean alignment:
- Metric key:
sampled_test_final/final_answer_accuracy- mean accuracy:
0.9917 → 0.9933 - matched-pair delta mean:
+0.0016, 95% bootstrap CI[+0.0001, +0.0035] - treatment setting passes gate (
0.9933 > 0.95)
- mean accuracy:
Caption: Normalization contrast (beta0p1_base_twr0_jsd0 vs beta0p1_norm_twr0_jsd0) — the strongest and most reliable alignment improvement among tested interventions.
4) Which combination works best? Which minimal combination is sufficient?
Answer: Best tested (from predefined ladder): beta0p1_norm_twr2_jsd0p5. However, the newly-tested beta0p1_norm_twr3_jsd0 (not in the ladder) achieves higher alignment overall (0.4821 vs 0.4405); see §5.
- Ladder tested:
base_twr0_jsd0 → norm_twr0_jsd0 → norm_twr2_jsd0 → norm_twr2_jsd0p5(all withbeta=0.1) - Best alignment mean (
controlled_test_final/alignment_one_to_one):0.4405atbeta0p1_norm_twr2_jsd0p5 - Best-setting accuracy mean (
sampled_test_final/final_answer_accuracy):0.9948(passes gate) - 95% of best alignment target:
0.4185 - First setting in ladder meeting target + accuracy gate:
beta0p1_norm_twr2_jsd0p5(same as best)
Caption: Beta=0.1 combination ladder with alignment and accuracy. beta0p1_norm_twr2_jsd0p5 is best in the predefined ladder; beta0p1_norm_twr3_jsd0 exceeds it but was not included in the ladder contrast.
5) One setting for all tasks, or task-specific winners?
Answer: Task-specific. No single setting is top-alignment on every task. The newly-tested beta0p1_norm_twr3_jsd0 is the strongest default, winning 3/6 tasks.
- Per-task winners under accuracy gate (
>0.95):beta0p1_norm_twr3_jsd0:3/6tasks (grid_pathfinding,linear_equation_solving,sorting_algorithms)beta0p1_norm_twr2_jsd0p5:1/6tasks (list_summation)beta0p1_norm_twr2_jsd0:1/6tasks (multidigit_addition)beta0p1_norm_twr0_jsd0:1/6tasks (base_conversion)
- Therefore: no universal winner (
single_universal_setting = false).
Caption: Per-task winner map under accuracy gate. beta0p1_norm_twr3_jsd0 (new from Apr 6 followup) is the most consistent performer.
6) Promising untried settings from observed effects
Answer: The Apr 6 followup sweep has now validated beta0p1_norm_twr3_jsd0, which is the highest-performing tested setting overall (mean alignment 0.4821). Top untested candidates from effect projection (updated for the expanded dataset) are:
beta0p1_norm_twr2_jsd1(projected alignment0.4807, projected accuracy0.9947)beta0p1_norm_twr2_jsd0p25(0.4205,0.9948)beta0p1_norm_twr2_jsd0p1(0.4084,0.9948)beta0p1_norm_twr1_jsd0(0.4079,0.9941)
Note: beta0p03_norm_twr2_jsd0 (lower beta, followup) achieved mean alignment 0.3667, below the main twr2 baseline — lower beta appears insufficient at this TWR level.
Caption: Candidate-priority chart from observed effects. beta0p1_norm_twr3_jsd0 is now marked as tested and is the top actual performer.
Caption: Reconstruction-vs-alignment tradeoff diagnostic by task and adaptation mode.
Cross-Task Robustness
Accuracy remains far above 0.95 for all competitive settings — the binding constraint is alignment consistency across tasks. beta0p1_norm_twr3_jsd0 is now the strongest default, winning 3/6 tasks and achieving the highest overall mean alignment (0.4821). Task-specific exceptions persist: base_conversion favors beta0p1_norm_twr0_jsd0, and list_summation favors beta0p1_norm_twr2_jsd0p5. The TWR=2 matched contrast no longer shows a reliable advantage over TWR=0; the prior positive finding does not replicate when the n is increased.
Final Recommendation
- Best-performing tested setting (overall mean alignment):
beta0p1_norm_twr3_jsd0(0.4821) - Best setting in the predefined ladder:
beta0p1_norm_twr2_jsd0p5(0.4405) - Interventions verdict (updated):
- normalization: works (robust, unchanged)
- TWR (
twr2): inconclusive (matched-pair CI spans zero with extended n; twr3 outperforms) - JSD (
0.5): inconclusive (positive direction but CI spans zero) beta0p1_norm_twr3_jsd0: strong positive finding from followup — new recommended default
Next sweep proposal:
- Prioritize higher JSD variants with twr3:
beta0p1_norm_twr3_jsd0p25,beta0p1_norm_twr3_jsd0p5 - High-projection untested setting:
beta0p1_norm_twr2_jsd1 - Confirm
base_conversionexception: rerun with wider adaptation coverage
References
- Notebook:
path:experiments/synthetic_sequences/disentangled/analysis/notebooks/08_apr4_discrete_cvae_7loss_analysis.ipynb - Aggregate dir:
path:experiments/synthetic_sequences/disentangled/analysis/aggregates/exp2_discrete_cvae-Apr3-7loss_20260407/ - Summary data:
path:experiments/synthetic_sequences/disentangled/analysis/aggregates/exp2_discrete_cvae-Apr3-7loss_20260407/summary.csv - Executive stats:
path:experiments/synthetic_sequences/disentangled/analysis/aggregates/exp2_discrete_cvae-Apr3-7loss_20260407/executive_summary_stats.json - Prior analysis (6-setting baseline):
path:experiments/synthetic_sequences/disentangled/analysis/aggregates/exp2_discrete_cvae-Apr3-7loss_20260403/