Abstract Structure

A (large) language model defines a distribution $p_\theta(y|x)$ over response sequences $y$ given a prompt or task $x$ .
This distribution is an entangled mixture over high-level strategies.
For example, given the prompt: “Prove [example of simple math statement]”, the language model may follow a number of possible strategies include [], [], or []. Under each strategy, the model can generate different instantiations that different in low-level details such as phrasing or format, etc.
It is useful to distinguish between high-level, abstract strategies and low-level instantiations of a given strategy.
It is useful to think of such abstract strategies as analogous to distributional “modes”: regions of response space that are geometrically connected in the sense that they share abstract high-level features.
It is in this sense that the language model $p_\theta(y|x)$ is a mixture over multiple distributional modes.
In this work, we study the problem of learning a disentangled factorization of language models that reveal their latent abstract modes and enable different types direct of direct interventions, which ultimately have applications to interpretability, safety, steerability, and active exploration.