In-context learning (ICL), a key feature of modern large language models (LLMs), allows transformers to adapt based on examples within the input prompt. Few-shot prompting, using several task examples, effectively demonstrates the desired behavior. But how do transformers achieve this adaptation? This article explores potential mechanisms behind ICL.
The core of ICL is: given example pairs ((x,y)), can attention mechanisms learn an algorithm to map new queries (x) to their outputs (y)?
The softmax attention formula is:
Introducing an inverse temperature parameter, c, modifies the attention allocation:
As c approaches infinity, attention becomes a one-hot vector, focusing solely on the most similar token – effectively a nearest neighbor search. With finite c, attention resembles Gaussian kernel smoothing. This suggests ICL might implement a nearest neighbor algorithm on input-output pairs.
Understanding how transformers learn algorithms (like nearest neighbor) opens doors for AutoML. Hollmann et al. demonstrated training a transformer on synthetic datasets to learn the entire AutoML pipeline, predicting optimal models and hyperparameters from new data in a single pass.
Anthropic's 2022 research suggests "induction heads" as a mechanism. These pairs of attention heads copy and complete patterns; for example, given "...A, B...A", they predict "B" based on prior context.
Recent studies (Garg et al. 2022, Oswald et al. 2023) link transformers' ICL to gradient descent. Linear attention, omitting the softmax operation:
Resembles preconditioned gradient descent (PGD):
One layer of linear attention performs one PGD step.
Attention mechanisms can implement learning algorithms, enabling ICL by learning from demonstration pairs. While the interplay of multiple attention layers and MLPs is complex, research sheds light on ICL's mechanics. This article offers a high-level overview of these insights.
Further Reading:
This article is inspired by Fall 2024 graduate coursework at the University of Michigan. Any errors are solely the author's.
The above is the detailed content of The Math Behind In-Context Learning. For more information, please follow other related articles on the PHP Chinese website!