`exp2_discrete_cvae-Apr3-7loss` (extended - Apr 7 update)

Collection Context

Collection: path:.runs/synthetic_sequences/disentangled/exp2_discrete_cvae-Apr3-7loss
Original planned grid: 6 tasks × 6 loss settings × 3 adaptation modes = 108 runs
Extended grid (Apr 6 followup): + 3 new loss settings (beta0p03_norm_twr2_jsd0, beta0p1_norm_twr2_jsd0_lin40, beta0p1_norm_twr3_jsd0) → ~162 planned total
Observed completed runs in local artifacts: 144
- Note: beta0p1_norm_twr2_jsd0_lin40 runs are absent from the executive summary statistics, suggesting those jobs did not complete in time for this analysis.
Primary decision metrics:
- alignment: controlled_test_final/alignment_one_to_one
- accuracy gate: sampled_test_final/final_answer_accuracy > 0.95

Goal + Work Criterion

Choose interventions that improve latent-strategy alignment while maintaining strong answer utility. For this report, an intervention "works" when it improves mean alignment and keeps mean accuracy above 0.95.

Executive Answers

1) Does Token-Weighted Reconstruction (TWR) work?

Answer: Inconclusive — finding reversed from prior analysis. With the extended dataset (n=18 pairs, up from 15), the matched-pair alignment delta for twr0 → twr2 is now slightly negative and the confidence interval spans zero. TWR at weight 2 does not show a reliable alignment benefit in the matched contrast.

Metric key: controlled_test_final/alignment_one_to_one
- mean alignment: 0.4154 → 0.4004 (−0.0150 absolute)
- matched-pair delta mean: −0.0059, 95% bootstrap CI [−0.1067, +0.1096], n=18 pairs
Metric key: sampled_test_final/final_answer_accuracy
- mean accuracy: 0.9933 → 0.9949
- matched-pair delta mean: +0.0019, 95% bootstrap CI [+0.0005, +0.0036]
- treatment setting passes gate (0.9949 > 0.95)
Note: The per-task winner results (§5) show beta0p1_norm_twr3_jsd0 — not twr2 — as the dominant setting. The twr2 contrast likely does not capture the optimal TWR level.

TWR contrast on alignment and accuracy. Caption: TWR contrast (beta0p1_norm_twr0_jsd0 vs beta0p1_norm_twr2_jsd0) with alignment and accuracy deltas. The CI spans zero; see also §5 for the stronger twr3 result.

2) Does JSD work?

Answer: Marginally positive but inconclusive. The extended dataset shows a positive alignment delta for twr2_jsd0 → twr2_jsd0p5, but the confidence interval spans zero. The direction reversed from the prior analysis (which showed a negative effect with fewer pairs).

Metric key: controlled_test_final/alignment_one_to_one
- mean alignment: 0.4004 → 0.4405 (+0.0402 absolute)
- matched-pair delta mean: +0.0208, 95% bootstrap CI [−0.1022, +0.1412], n=17 pairs
Metric key: sampled_test_final/final_answer_accuracy
- mean accuracy: 0.9949 → 0.9948 (negligible change)
- matched-pair delta mean: −0.0009, 95% bootstrap CI [−0.0030, +0.0013]
- treatment setting passes gate (0.9948 > 0.95)

JSD contrast on alignment and accuracy. Caption: JSD contrast (beta0p1_norm_twr2_jsd0 vs beta0p1_norm_twr2_jsd0p5) showing a positive but wide-CI alignment delta.

3) Does baseline-normalization of reconstruction loss work?

Answer: Yes, robustly. Normalization provides the largest and most clearly significant alignment uplift among the predefined contrasts, consistent with the prior analysis.

Metric key: controlled_test_final/alignment_one_to_one
- mean alignment: 0.2856 → 0.4154 (+0.1298 absolute)
- matched-pair delta mean: +0.1298, 95% bootstrap CI [+0.0387, +0.2375], n=18 pairs
Metric key: sampled_test_final/final_answer_accuracy
- mean accuracy: 0.9917 → 0.9933
- matched-pair delta mean: +0.0016, 95% bootstrap CI [+0.0001, +0.0035]
- treatment setting passes gate (0.9933 > 0.95)

Normalization contrast on alignment and accuracy. Caption: Normalization contrast (beta0p1_base_twr0_jsd0 vs beta0p1_norm_twr0_jsd0) — the strongest and most reliable alignment improvement among tested interventions.

4) Which combination works best? Which minimal combination is sufficient?

Answer: Best tested (from predefined ladder): beta0p1_norm_twr2_jsd0p5. However, the newly-tested beta0p1_norm_twr3_jsd0 (not in the ladder) achieves higher alignment overall (0.4821 vs 0.4405); see §5.

Ladder tested: base_twr0_jsd0 → norm_twr0_jsd0 → norm_twr2_jsd0 → norm_twr2_jsd0p5 (all with beta=0.1)
Best alignment mean (controlled_test_final/alignment_one_to_one): 0.4405 at beta0p1_norm_twr2_jsd0p5
Best-setting accuracy mean (sampled_test_final/final_answer_accuracy): 0.9948 (passes gate)
95% of best alignment target: 0.4185
First setting in ladder meeting target + accuracy gate: beta0p1_norm_twr2_jsd0p5 (same as best)

Caption: Beta=0.1 combination ladder with alignment and accuracy. beta0p1_norm_twr2_jsd0p5 is best in the predefined ladder; beta0p1_norm_twr3_jsd0 exceeds it but was not included in the ladder contrast.

5) One setting for all tasks, or task-specific winners?

Answer: Task-specific. No single setting is top-alignment on every task. The newly-tested beta0p1_norm_twr3_jsd0 is the strongest default, winning 3/6 tasks.

Per-task winners under accuracy gate (>0.95):
- beta0p1_norm_twr3_jsd0: 3/6 tasks (grid_pathfinding, linear_equation_solving, sorting_algorithms)
- beta0p1_norm_twr2_jsd0p5: 1/6 tasks (list_summation)
- beta0p1_norm_twr2_jsd0: 1/6 tasks (multidigit_addition)
- beta0p1_norm_twr0_jsd0: 1/6 tasks (base_conversion)
Therefore: no universal winner (single_universal_setting = false).

Task-specific winner map under the accuracy gate. Caption: Per-task winner map under accuracy gate. beta0p1_norm_twr3_jsd0 (new from Apr 6 followup) is the most consistent performer.

6) Promising untried settings from observed effects

Answer: The Apr 6 followup sweep has now validated beta0p1_norm_twr3_jsd0, which is the highest-performing tested setting overall (mean alignment 0.4821). Top untested candidates from effect projection (updated for the expanded dataset) are:

beta0p1_norm_twr2_jsd1 (projected alignment 0.4807, projected accuracy 0.9947)
beta0p1_norm_twr2_jsd0p25 (0.4205, 0.9948)
beta0p1_norm_twr2_jsd0p1 (0.4084, 0.9948)
beta0p1_norm_twr1_jsd0 (0.4079, 0.9941)

Note: beta0p03_norm_twr2_jsd0 (lower beta, followup) achieved mean alignment 0.3667, below the main twr2 baseline — lower beta appears insufficient at this TWR level.

Caption: Candidate-priority chart from observed effects. beta0p1_norm_twr3_jsd0 is now marked as tested and is the top actual performer.

Reconstruction-versus-alignment tradeoff by task and adaptation mode. Caption: Reconstruction-vs-alignment tradeoff diagnostic by task and adaptation mode.

Cross-Task Robustness

Accuracy remains far above 0.95 for all competitive settings — the binding constraint is alignment consistency across tasks. beta0p1_norm_twr3_jsd0 is now the strongest default, winning 3/6 tasks and achieving the highest overall mean alignment (0.4821). Task-specific exceptions persist: base_conversion favors beta0p1_norm_twr0_jsd0, and list_summation favors beta0p1_norm_twr2_jsd0p5. The TWR=2 matched contrast no longer shows a reliable advantage over TWR=0; the prior positive finding does not replicate when the n is increased.

Final Recommendation

Best-performing tested setting (overall mean alignment): beta0p1_norm_twr3_jsd0 (0.4821)
Best setting in the predefined ladder: beta0p1_norm_twr2_jsd0p5 (0.4405)
Interventions verdict (updated):
- normalization: works (robust, unchanged)
- TWR (twr2): inconclusive (matched-pair CI spans zero with extended n; twr3 outperforms)
- JSD (0.5): inconclusive (positive direction but CI spans zero)
- beta0p1_norm_twr3_jsd0: strong positive finding from followup — new recommended default

Next sweep proposal:

Prioritize higher JSD variants with twr3: beta0p1_norm_twr3_jsd0p25, beta0p1_norm_twr3_jsd0p5
High-projection untested setting: beta0p1_norm_twr2_jsd1
Confirm base_conversion exception: rerun with wider adaptation coverage

References

Notebook: path:experiments/synthetic_sequences/disentangled/analysis/notebooks/08_apr4_discrete_cvae_7loss_analysis.ipynb
Aggregate dir: path:experiments/synthetic_sequences/disentangled/analysis/aggregates/exp2_discrete_cvae-Apr3-7loss_20260407/
Summary data: path:experiments/synthetic_sequences/disentangled/analysis/aggregates/exp2_discrete_cvae-Apr3-7loss_20260407/summary.csv
Executive stats: path:experiments/synthetic_sequences/disentangled/analysis/aggregates/exp2_discrete_cvae-Apr3-7loss_20260407/executive_summary_stats.json
Prior analysis (6-setting baseline): path:experiments/synthetic_sequences/disentangled/analysis/aggregates/exp2_discrete_cvae-Apr3-7loss_20260403/

exp2_discrete_cvae-Apr3-7loss (extended - Apr 7 update)