Abstract Structure
Abstract Structure
- A (large) language model defines a distribution over response sequences given a prompt or task .
- This distribution is an entangled mixture over high-level strategies.
- For example, given the prompt: “Prove [example of simple math statement]”, the language model may follow a number of possible strategies include [], [], or []. Under each strategy, the model can generate different instantiations that different in low-level details such as phrasing or format, etc.
- It is useful to distinguish between high-level, abstract strategies and low-level instantiations of a given strategy.
- It is useful to think of such abstract strategies as analogous to distributional “modes”: regions of response space that are geometrically connected in the sense that they share abstract high-level features.
- It is in this sense that the language model is a mixture over multiple distributional modes.
- In this work, we study the problem of learning a disentangled factorization of language models that reveal their latent abstract modes and enable different types direct of direct interventions, which ultimately have applications to interpretability, safety, steerability, and active exploration.