Exp1 Gaussian Mixture: Larger Beta Sweep (Concise Report)

Artifact source: path:experiments/gaussian_mixture/demos/artifacts/exp1_gaussian_mixture_high_density-larger-beta-sweep

Methodology

Notation

$x \in \mathbb{R}^d$ is an observed sample and $z \in \{1, \ldots, K\}$ is a latent mode.
$\mathbb{E}$ denotes expectation, $\mathrm{KL}$ denotes KL divergence, and $\mathrm{MSE}$ denotes mean squared error.
$\mathcal{N}$ denotes a Gaussian distribution and $\mathrm{Unif}$ denotes a uniform categorical prior.

Synthetic data is sampled from a labeled Gaussian mixture: $z \sim \mathrm{Cat}(\pi), \quad x \mid z = k \sim \mathcal{N}(\mu_k, \sigma^2 I_d), \quad x \in \mathbb{R}^d.$ The larger sweep uses $K=3$ , $d=2$ , $N=240$ , and $\beta \in \{0.10, 0.25, 0.50, 1.00, 2.00, 5.00\}.$

Model definitions (priors + objectives)

Both variants optimize $L(x) = L_\mathrm{recon}(x) + \beta L_\mathrm{kl} (x).$

Discrete latent VAE
- Posterior: $q_\phi(z = k \mid x) = \operatorname{softmax}(f_\phi(x))_k, \quad k \in \{1, \ldots, K\},$
- Prior: uniform categorical $p(z) = \mathrm{Unif}([K]).$
- Implementation note: training uses exact softmax-weighted expectation over categories (no Gumbel-softmax sampling).
- Objective: $L_\mathrm{disc}(x) = \sum_{k=1}^{K} q_\phi(z = k \mid x)\, \mathrm{MSE}(x, \hat{x}_k) + \beta \mathrm{KL}(q_\phi(z \mid x) \,\|\, p(z)).$
- Reported history/summary metrics: loss, recon, kl, nmi, clustering_accuracy.
Continuous latent VAE
- Posterior: $q_\phi(z \mid x) = \mathcal{N}(\mu_\phi(x), \operatorname{diag}(\sigma_\phi(x)^2)).$
- Prior: standard Gaussian $p(z) = \mathcal{N}(0, I).$
- Objective: $L_\mathrm{cont}(x) = \mathbb{E}_{q_\phi(z \mid x)}[\mathrm{MSE}(x, \hat{x}(z))] + \beta \mathrm{KL}(q_\phi(z \mid x) \,\|\, p(z)).$
- Reported history/summary metrics: loss, recon, kl, linear_probe_accuracy.

Metric definitions

recon: mean squared reconstruction error.
kl: KL divergence term in the objective.
nmi: normalized mutual information between predicted and true modes (discrete).
clustering_accuracy: permutation-invariant label alignment accuracy $\max_{\pi \in S_K} \frac{1}{N} \sum_{i} \mathbf{1}[\pi(\hat{z}_i) = z_i].$
linear_probe_accuracy: holdout linear classifier accuracy from continuous latent means to true modes.

Experimental Setup

Category	Parameter	Value
Dataset	Components ( $K$ )	3
Dataset	Dimension ( $d$ )	2
Dataset	Samples per sweep point ( $N$ )	240
Dataset	Component std	0.7
Dataset	Component separation	4.0
Dataset	Mixture weights	None (uniform default)
Training	Epochs	20
Training	Batch size	64
Training	Hidden dim	32
Training	Latent dim (continuous)	2
Training	Optimizer	Adam
Training	Learning rate	$10^{-3}$
Sweep	$\beta$ grid	$\{0.10, 0.25, 0.50, 1.00, 2.00, 5.00\}$
Reproducibility	Seed	17
Reproducibility	Device	`cpu`
Outputs	Artifact root	`path:experiments/gaussian_mixture/demos/artifacts/exp1_gaussian_mixture_high_density-larger-beta-sweep`

Quantitative results

Discrete latent summary

$\beta$	NMI	Acc	Recon	KL final
0.10	0.762	0.729	6.488	1.020
0.25	0.762	0.729	6.548	0.895
0.50	0.735	0.717	6.679	0.670
1.00	0.720	0.704	7.103	0.359
2.00	0.720	0.704	7.788	0.035
5.00	0.690	0.700	7.901	0.001

Continuous latent summary

$\beta$	Probe Acc	Recon	KL final	Mean radius
0.10	1.000	3.272	5.125	2.935
0.25	1.000	3.521	3.482	2.451
0.50	1.000	3.909	2.477	2.062
1.00	1.000	4.895	1.322	1.533
2.00	1.000	6.505	0.421	0.875
5.00	0.979	7.827	0.047	0.271

Figures

Caption: Discrete training history metrics across beta.

Caption: Continuous training history metrics across beta.

Caption: Discrete metric heatmap and confusion matrices.

Reconstruction versus KL tradeoff across latent type and beta. Caption: Reconstruction vs KL tradeoff across latent type and beta.

Key takeaways

Discrete mode recovery (NMI/accuracy) is strongest at lower beta, while KL pressure increases reconstruction cost as beta grows.
Continuous latents remain highly linearly predictive of true modes across the sweep (probe accuracy ~1.0 except slight drop at beta=5.0).
Larger beta compresses continuous latent geometry (mean radius drops from 2.935 to 0.271), consistent with stronger prior matching.