The Math Behind In-Context Learning
In-context learning (ICL), a key feature of modern large language models (LLMs), allows transformers to adapt based on examples within the input prompt. Few-shot prompting, using several task examples, effectively demonstrates the desired behavior. But how do transformers achieve this adaptation? This article explores potential mechanisms behind ICL.
The core of ICL is: given example pairs ((x,y)), can attention mechanisms learn an algorithm to map new queries (x) to their outputs (y)?
Softmax Attention and Nearest Neighbor Search
The softmax attention formula is:
Introducing an inverse temperature parameter, c, modifies the attention allocation:
As c approaches infinity, attention becomes a one-hot vector, focusing solely on the most similar token – effectively a nearest neighbor search. With finite c, attention resembles Gaussian kernel smoothing. This suggests ICL might implement a nearest neighbor algorithm on input-output pairs.
Implications and Further Research
Understanding how transformers learn algorithms (like nearest neighbor) opens doors for AutoML. Hollmann et al. demonstrated training a transformer on synthetic datasets to learn the entire AutoML pipeline, predicting optimal models and hyperparameters from new data in a single pass.
Anthropic's 2022 research suggests "induction heads" as a mechanism. These pairs of attention heads copy and complete patterns; for example, given "...A, B...A", they predict "B" based on prior context.
Recent studies (Garg et al. 2022, Oswald et al. 2023) link transformers' ICL to gradient descent. Linear attention, omitting the softmax operation:
Resembles preconditioned gradient descent (PGD):
One layer of linear attention performs one PGD step.
Conclusion
Attention mechanisms can implement learning algorithms, enabling ICL by learning from demonstration pairs. While the interplay of multiple attention layers and MLPs is complex, research sheds light on ICL's mechanics. This article offers a high-level overview of these insights.
Further Reading:
- In-context Learning and Induction Heads
- What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
- Transformers Learn In-Context by Gradient Descent
- Transformers learn to implement preconditioned gradient descent for in-context learning
Acknowledgment
This article is inspired by Fall 2024 graduate coursework at the University of Michigan. Any errors are solely the author's.
The above is the detailed content of The Math Behind In-Context Learning. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Nine years ago, Elon Musk stood before reporters and declared that Tesla was making a daring leap into the future—equipping every new electric vehicle with the complete hardware necessary for full self-driving capability.“All Teslas produced from thi

Why is Perplexity so determined to acquire a web browser? The answer might lie in a fundamental shift on the horizon: the rise of the agentic AI internet — and browsers could be at the heart of it.I recently spoke with Henrik Lexow, senior product le

Now she’s taking a permanent leave of absence, gripped by fear that the arrival of “artificial general intelligence”—a theoretical form of AI capable of matching or exceeding human performance across countless domains—could lead to the collapse of ci

As the conversation around AI agents continues to evolve between businesses and individuals, one central theme stands out: not all AI agents are created equal. There’s a wide spectrum—from basic, rule-driven systems to highly advanced, adaptive model

Why is Nvidia’s upcoming earnings report drawing more attention than the Federal Reserve Chair’s speech? The answer lies in growing investor anxiety over the actual returns from massive corporate investments in artificial intelligence. While Powell’s

A new study in The Lancet investigated how using AI during colonoscopies affects doctors' diagnostic abilities. Researchers assessed physicians’ skill in identifying specific abnormalities over three months without AI, then re-evaluated them after th

The AI Bubble And The Dot-com Era There are growing concerns. The so-called “Magnificent Seven” — Alphabet, Amazon, Apple, Meta, Microsoft, Nvidia, and Tesla — now represent over a third of the S&P 500’s total value, with much of their recent su

As OpenAI CEO Sam Altman puts it, GPT‑5 is “a significant step” toward AGI and is “the smartest, fastest, most useful model yet.” He compares the jump from GPT-4 to GPT-5 to moving from a college graduate to a “PhD-level expert.” The model’s release
