Abstract Structure

  • A (large) language model defines a distribution pθ(yx)p_\theta(y|x) over response sequences yy given a prompt or task xx.
  • This distribution is an entangled mixture over high-level strategies.
  • For example, given the prompt: “Prove [example of simple math statement]”, the language model may follow a number of possible strategies include [], [], or []. Under each strategy, the model can generate different instantiations that different in low-level details such as phrasing or format, etc.
  • It is useful to distinguish between high-level, abstract strategies and low-level instantiations of a given strategy.
  • It is useful to think of such abstract strategies as analogous to distributional “modes”: regions of response space that are geometrically connected in the sense that they share abstract high-level features.
  • It is in this sense that the language model pθ(yx)p_\theta(y|x) is a mixture over multiple distributional modes.
  • In this work, we study the problem of learning a disentangled factorization of language models that reveal their latent abstract modes and enable different types direct of direct interventions, which ultimately have applications to interpretability, safety, steerability, and active exploration.
Built with LogoFlowershow