Simplify Vincent diagram prompt, LLM model generates high-quality images-AI-php.cn

The diffusion model has become a mainstream text-to-image generation model, which can guide the generation of high-quality and content-rich images through text prompts

If the input prompts are too Simplicity, existing models have limitations in semantic understanding and common sense reasoning, which will lead to a significant decrease in the quality of the generated images

Lin Liang’s team from the HCP Laboratory of Sun Yat-sen University proposed a A simple yet effective fine-tuning method called SUR-adapter is designed to improve the model's ability to understand narrative cues. This method is a semantic understanding and inference adapter for pre-trained diffusion models and is parameter-efficient

Simplify Vincent diagram prompt, LLM model generates high-quality images

Please click the link below View the paper: https://arxiv.org/abs/2305.05189

Open source address: https://github.com/Qrange-group/SUR-adapter

To achieve this goal, the researchers first collected and annotated a dataset called SURD. This dataset contains over 57,000 multi-modal samples, each containing a simple narrative prompt, a complex keyword-based prompt, and a high-quality image

Researchers align the semantic representation of narrative prompts with complex prompts and transfer the knowledge of large language models (LLM) to SUR adapters through knowledge distillation to be able to obtain strong semantic understanding and reasoning capabilities to build high-quality texts Semantic representations for text-to-image generation. Then, they aligned the semantic representation of the narrative prompts with the complex prompts and transferred the knowledge of the large language model (LLM) to the SUR adapter through knowledge distillation to be able to obtain strong semantic understanding and reasoning capabilities to build high-quality textual semantic representations. For text-to-image generation

Simplify Vincent diagram prompt, LLM model generates high-quality images

We experimented by integrating multiple LLMs and pre-trained diffusion models and found that this method can effectively make the diffusion model Understand and reason about concise natural language descriptions without degrading image quality

This approach can make text-to-image diffusion models easier to use, provide a better user experience, and further Advancing the development of user-friendly text-to-image generative models and bridging the semantic gap between simple narrative prompts and keyword-based prompts

Background Introduction

Currently, the text-to-image pre-training model represented by stable diffusion has become one of the most important basic models in the field of artificial intelligence generated content, playing an important role in tasks such as image editing, video generation, and 3D object generation. Important role

At present, the semantic ability of these pre-trained diffusion models mainly depends on the text encoder (such as CLIP), and its semantic understanding ability directly affects the generation effect of the diffusion model

This article first tests the image-text matching accuracy of Stable diffusion by constructing common question categories in the visual question answering task (VQA), such as "counting", "color" and "action". We will manually count and conduct testing

The following are examples of constructing various prompts, see the table below for details

Simplify Vincent diagram prompt, LLM model generates high-quality images

According to the results shown in the table below, the article reveals that the current Vincent graph pre-trained diffusion model has serious semantic understanding problems. The image-text matching accuracy of a large number of questions is less than 50%, and even in some questions, the accuracy is only 0%

Simplify Vincent diagram prompt, LLM model generates high-quality images

In order to obtain matching text To generate conditioned images, we need to find ways to enhance the semantic capabilities of our encoder in the pre-trained diffusion model

Overview of methods

Rewritten content: 1. Data preprocessing

First, we can start from the commonly used diffusion model online websites lexica.art, civitai.com and stablediffusionweb Get a large number of image-text pairs. Then, we need to clean and filter these data to obtain more than 57,000 high-quality triplet data (including complex prompts, simple prompts, and pictures) and form it into a SURD dataset

Simplify Vincent diagram prompt, LLM model generates high-quality images

As shown in the figure below, complex prompts refer to the text prompt conditions required by the diffusion model when generating images. Usually these prompts have complex formats and descriptions. A simple prompt is a text description of an image generated through BLIP. It uses a language format that is consistent with human description

Generally speaking, a simple prompt that is consistent with normal human language description is difficult for the diffusion model to generate Images that are semantically adequate, and complex cues (what users jokingly call the diffusion model's "mantra") can achieve satisfactory results

#What needs to be rewritten is :2. Semantic distillation of large language models

This article introduces a method of using the Adapter of the Transformer structure to distill the semantic features of large language models in specific hidden layers. And by linearly combining the large language model information guided by the Adapter with the semantic features output by the original text encoder, the final semantic features are obtained

The large language model uses LLaMA of different sizes model, and the parameters of the UNet part of the diffusion model are frozen during the entire training process

Simplify Vincent diagram prompt, LLM model generates high-quality images

The content that needs to be rewritten is :3. Image quality restoration

In order to keep the original meaning unchanged, the content needs to be rewritten into Chinese: Since the structure of this article introduces learnable modules in the pre-training large model inference process, it destroys the original image generation quality of the pre-training model to a certain extent. Therefore, it is necessary to bring the quality of image generation back to the generation quality level of the original pre-training model

Simplify Vincent diagram prompt, LLM model generates high-quality images

This paper uses triples in the SURD dataset and introduces the corresponding quality loss function during the training process to restore the quality of image generation. Specifically, this article hopes that the semantic features obtained through the new module can be as aligned as possible with the semantic features of complex cues

The following figure shows the effect of SUR-adapter on the pre-trained diffusion model fine-tuning framework. The right side is the network structure of the Adapter

Simplify Vincent diagram prompt, LLM model generates high-quality images

Experimental results

##For SUR-adapter Performance, this article analyzes the performance from two aspects: semantic matching and image quality

On the one hand, according to the following table, SUR-adapter can effectively solve common semantics in the Vincentian graph diffusion model Mismatch problem,applicable to different experimental settings. Under different categories of semantic criteria, the accuracy has also been improved to a certain extent

On the other hand, this paper uses common image quality evaluation indicators such as BRISQUE to compare the original pretrain diffusion model and After using the SUR-adapter diffusion model to perform statistical tests on the quality of the images generated, we can find that there is no significant difference between the two.

We also conducted a human preference questionnaire test

Through the above analysis, it can be concluded that the proposed method can Alleviating the inherent image-text mismatch problem from pre-trained text to image while maintaining the quality of image generation

Simplify Vincent diagram prompt, LLM model generates high-quality images

We can also demonstrate qualitatively through the following image generated examples. For more detailed analysis and details, please refer to this article and the open source warehouse

The content that needs to be rewritten is:

Simplify Vincent diagram prompt, LLM model generates high-quality images

Introduction to HCP Laboratory

Professor Lin Liang founded Sun Yat-sen University’s Human-Machine-Object Intelligent Fusion Laboratory (HCP Lab) in 2010. In recent years, the laboratory has achieved rich academic results in the fields of multimodal content understanding, causal and cognitive reasoning, and embodied intelligence. The laboratory has won many domestic and foreign science and technology awards and best paper awards, and is committed to developing product-level artificial intelligence technology and platforms

The above is the detailed content of Simplify Vincent diagram prompt, LLM model generates high-quality images. For more information, please follow other related articles on the PHP Chinese website!