The visual talent of large language models: GPT can also solve visual tasks through contextual learning-AI-php.cn

The visual talent of large language models: GPT can also solve visual tasks through contextual learning

王林

Release： 2023-07-14 15:37:06

forward

1515 people have browsed it

Currently, large language models (LLM) have set off a wave of changes in the field of natural language processing (NLP). We see that LLM has strong emergence capabilities and performs well on complex language understanding tasks, generation tasks and even reasoning tasks. This inspires people to further explore the potential of LLM in another subfield of machine learning - computer vision (CV).

One of the remarkable talents of LLMs is their ability to learn in context. Contextual learning does not update any parameters of the LLM, but it shows amazing results in various NLP tasks. So, can GPT solve visual tasks through contextual learning?

Recently, a paper jointly published by researchers from Google and Carnegie Mellon University (CMU) shows that as long as we can convert images (or other non-verbal modalities) Translated into a language that LLM can understand, this seems feasible.

The visual talent of large language models: GPT can also solve visual tasks through contextual learning Picture

Paper address: https://arxiv.org/abs/2306.17842

This paper reveals the ability of PaLM or GPT in solving visual tasks through contextual learning, and proposes a new method SPAE (Semantic Pyramid AutoEncoder). This new approach enables LLM to perform image generation tasks without any parameter updates. This is also the first successful method to use contextual learning to enable LLM to generate image content.

Let’s first take a look at the experimental effect of LLM on generating image content through context learning.

For example, by providing 50 images of handwriting in a given context, the paper asks PaLM 2 to answer a complex query that requires generating a digital image as output:

The visual talent of large language models: GPT can also solve visual tasks through contextual learning Pictures

can also generate realistic realistic images with image context input:

The visual talent of large language models: GPT can also solve visual tasks through contextual learning Picture

In addition to generating images, through context learning, PaLM 2 can also perform image description:

The visual talent of large language models: GPT can also solve visual tasks through contextual learning

# #There are also visual Q&A for image-related questions:

The visual talent of large language models: GPT can also solve visual tasks through contextual learning Pictures

You can even generate videos with denoising:

The visual talent of large language models: GPT can also solve visual tasks through contextual learning Picture

Method Overview

In fact, convert the image into a language that LLM can understand , is a problem that has been studied in the Visual Transformer (ViT) paper. In this paper from Google and CMU, they take it to the next level — using actual words to represent images.

This approach is like building a tower filled with text, capturing the semantics and detail of the image. This text-filled representation allows image descriptions to be easily generated and allows LLMs to answer image-related questions and even reconstruct image pixels.

The visual talent of large language models: GPT can also solve visual tasks through contextual learning

Specifically, this research proposes to use a trained encoder and CLIP model to convert the image into a token space; and then use LLM to generate a suitable lexical tokens; finally using a trained decoder to convert these tokens back to pixel space. This ingenious process converts images into a language that LLM can understand, allowing us to exploit the generative power of LLM in vision tasks.

The visual talent of large language models: GPT can also solve visual tasks through contextual learning

Experiments and results

This study experimentally compared SPAE with SOTA methods Frozen and LQAE, and the results are shown in Table 1 below. SPAEGPT outperforms LQAE on all tasks while using only 2% of tokens.

The visual talent of large language models: GPT can also solve visual tasks through contextual learning Picture

Overall, testing on the mini-ImageNet benchmark shows that the SPAE method outperforms the previous SOTA The method improves performance by 25%.

The visual talent of large language models: GPT can also solve visual tasks through contextual learning Picture

In order to verify the effectiveness of the SPAE design method, this study conducted an ablation experiment. The experimental results are as follows Table 4 and Shown in Figure 10:

The visual talent of large language models: GPT can also solve visual tasks through contextual learning Picture

Feeling Interested readers can read the original text of the paper to learn more about the research content.

The above is the detailed content of The visual talent of large language models: GPT can also solve visual tasks through contextual learning. For more information, please follow other related articles on the PHP Chinese website!