The Math Behind In-Context Learning-AI-php.cn

The Math Behind In-Context Learning

王林

Release： 2025-02-26 00:03:10

Original

643 people have browsed it

In-context learning (ICL), a key feature of modern large language models (LLMs), allows transformers to adapt based on examples within the input prompt. Few-shot prompting, using several task examples, effectively demonstrates the desired behavior. But how do transformers achieve this adaptation? This article explores potential mechanisms behind ICL.

The Math Behind In-Context Learning

The core of ICL is: given example pairs ((x,y)), can attention mechanisms learn an algorithm to map new queries (x) to their outputs (y)?

Softmax Attention and Nearest Neighbor Search

The softmax attention formula is:

The Math Behind In-Context Learning

Introducing an inverse temperature parameter, c, modifies the attention allocation:

The Math Behind In-Context Learning

As c approaches infinity, attention becomes a one-hot vector, focusing solely on the most similar token – effectively a nearest neighbor search. With finite c, attention resembles Gaussian kernel smoothing. This suggests ICL might implement a nearest neighbor algorithm on input-output pairs.

Implications and Further Research

Understanding how transformers learn algorithms (like nearest neighbor) opens doors for AutoML. Hollmann et al. demonstrated training a transformer on synthetic datasets to learn the entire AutoML pipeline, predicting optimal models and hyperparameters from new data in a single pass.

Anthropic's 2022 research suggests "induction heads" as a mechanism. These pairs of attention heads copy and complete patterns; for example, given "...A, B...A", they predict "B" based on prior context.

Recent studies (Garg et al. 2022, Oswald et al. 2023) link transformers' ICL to gradient descent. Linear attention, omitting the softmax operation:

The Math Behind In-Context Learning

Resembles preconditioned gradient descent (PGD):

The Math Behind In-Context Learning

One layer of linear attention performs one PGD step.

Conclusion

Attention mechanisms can implement learning algorithms, enabling ICL by learning from demonstration pairs. While the interplay of multiple attention layers and MLPs is complex, research sheds light on ICL's mechanics. This article offers a high-level overview of these insights.

Acknowledgment

This article is inspired by Fall 2024 graduate coursework at the University of Michigan. Any errors are solely the author's.

The above is the detailed content of The Math Behind In-Context Learning. For more information, please follow other related articles on the PHP Chinese website!