Surpassing the CVPR 2024 method, DynRefer achieves multiple SOTAs in regional-level multi-modal recognition tasks-AI-php.cn

In order to achieve high-precision regional-level multi-modal understanding, this paper proposes a dynamic resolution scheme to simulate the human visual cognitive system.

The author of this article is from the LAMP Laboratory of the University of Chinese Academy of Sciences. The first author Zhao Yuzhong is a doctoral student of the University of Chinese Academy of Sciences in 2023, and the co-author Liu Feng is a direct doctoral student of the University of Chinese Academy of Sciences in 2020. Their main research directions are visual language models and visual object perception.

Introduction

DynRefer significantly improves regional-level multi-modal recognition capabilities by simulating the human visual cognitive process. By introducing the dynamic resolution mechanism of the human eye, DynRefer can simultaneously complete the tasks of region recognition, region attribute detection and region-level captioning with a single model, and achieve SOTA performance in all the above tasks. Among them, 115.7 CIDEr was achieved on the region-level captioning task of the RefCOCOg data set, which is significantly higher than the CVPR 2024 methods such as RegionGPT, GlaMM, Osprey, Alpha-CLIP and so on.

超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

Paper title: DynRefer: Delving into Region-level Multi-modality Tasks via Dynamic Resolution
Paper link: https://arxiv.org/abs/2405.16071
Paper code: https ://github.com/callsys/DynRefer

超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

Motivation

The region-level multi-modal task is dedicated to converting specified image regions into language descriptions consistent with human preferences. Humans have a resolution-adaptive ability when completing regional-level multi-modal tasks, that is, the area of interest is high-resolution, and the non-attention area is low-resolution. However, current regional-level multi-modal large language models often adopt a fixed-resolution encoding scheme, that is, encoding the entire image, and then extracting regional features through RoI Align. This approach lacks the resolution adaptive capability in the human visual cognitive system, and has low encoding efficiency and ability for areas of interest. In order to achieve high-precision regional-level multi-modal understanding, we propose a dynamic resolution scheme to simulate the human visual cognitive system, as shown in the figure below.

超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

区 Figure 1: Comparison of traditional regional multi -modal methods (left) and Dynrefer method (right).

Method

1. Simulate dynamic resolution image (Multi-view construction).

Since the mainstream pre-trained visual language model (CLIP) can only receive uniform resolution input, we simulate a dynamic resolution image by constructing multiple uniform resolution views. The image has high resolution in the referent area and low resolution in the non-reference area. The specific process is shown in Figure 2. The original image x is cropped and resized into multiple candidate views. The cropping area is calculated as

, where . Hererepresents the bounding box of the reference area, 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

represents the size of the entire image, and t represents the interpolation coefficient. During training, we randomly select n views from candidate views to simulate images generated due to gaze and rapid eye movements. These n views correspond to the interpolation coefficient t, which is 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

. We fixedly retain the view containing only the reference region (i.e. 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

). This view has been experimentally proven to help preserve regional details, which is crucial for all regional multi-modal tasks. 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

^{Figure 2: DynRefer training (top) and inference (bottom).}

2. Stochastic Multi-view Embedding.The specific process is shown in Figure 3. The sampled n views are encoded into spatial features via frozen CLIP and then processed by the RoI-Align module to obtain region embeddings, i.e., 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

. This is shown on the left side of Figure 3. These region embeddings are not spatially aligned due to spatial errors introduced by cropping, resizing, and RoI-Align. Inspired by the deformable convolution operation, we propose an alignment module to reduce the bias by aligning 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

, where

Surpassing the CVPR 2024 method, DynRefer achieves multiple SOTAs in regional-level multi-modal recognition tasks

is the region embedding of the view encoding containing only the reference region. For each region embedding 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

, it is first concatenated with 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

and then a 2D offset map is calculated through a convolutional layer. The spatial features of 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

are then resampled based on the 2D offset. Finally, the aligned region embeddings are concatenated along the channel dimension and fused through linear layers. The output is further compressed through a visual resampling module, i.e. Q-former, which extracts a regional representation of the reference region 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

of the original image x ( 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

in Figure 3).

超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

^{Figure 3: DynRefer network structure}

3. Vision-language Alignment.The region representation 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

computed by the stochastic multi-view embedding module is decoded by three decoders 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

as shown in Figure 3 (right), respectively supervised by three multi-modal tasks:

i ) Image region label generation. We employ a lightweight query-based recognition decoder for region label generation. The decoder 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

is shown in Figure 3 (right). The tagging process is completed by calculating the confidence of a predefined tag using the tag as query, 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

as key and value. We parse labels from ground-truth subtitles to supervise the recognition decoder. ii) Region-text contrastive learning. Similar to the region tag decoder, the decoder 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

is defined as a query-based recognition decoder. The decoder computes similarity scores between subtitles and region features, supervised using SigLIP loss. iii) Language modeling. We use a pre-trained large language model 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

to convert the regional representation 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

into a language description.

超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

^{Figure 4: Performance of dual-view (n=2) DynRefer model on region-level multi-modal tasks. Under different interpolation coefficients t,} ^{. View one is fixed (} ^{), view two is randomly selected or fixed.}

4. During the inference process, the trained DynRefer model performs multi-modal tasks on images with dynamic resolution. By adjusting the interpolation coefficients 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

of the sampled n views, we can obtain a regional representation with dynamic resolution characteristics. To evaluate the properties at different dynamic resolutions, we trained a dual-view (n=2) DynRefer model and evaluated it on four multi-modal tasks. As can be seen from the curves in Figure 4, attribute detection achieves better results for views without contextual information ( 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

). This can be explained by the fact that such tasks often require detailed regional information. For Region-level captioning and Dense captioning tasks, a context-rich view ( 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

) is required to fully understand the reference region. It is important to note that views with too much context ( 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

) degrade performance on all tasks because they introduce too much region-irrelevant information. When the task type is known, we can sample appropriate views based on task characteristics. When the task type is unknown, we first construct a set of candidate views under different interpolation coefficients t, 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

. From the candidate set, n views are sampled via a greedy search algorithm. The objective function of the search is defined as:

超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA whererepresents the interpolation coefficient of the i-th view,represents the i-th view, pHASH (・) represents the perceptual image hash function, andrepresents the XOR operation. In order to compare the information of views from a global perspective, we use the "pHASH (・)" function to convert the views from the spatial domain to the frequency domain and then encode them into hash codes. For this item 超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA , we reduce the weight of context-rich views to avoid introducing too much redundant information.

Experiment

Region-level Captioning

超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

In the regional subtitle generation task, DynRefer uses a smaller model (4.2B vs. 7B) on the RefCOCOg and VG datasets, In both METEOR and CIDEr indicators, it significantly surpasses many methods in CVPR 2024, such as RegionGPT, GlaMM, Alpha-CLIP and Osprey, etc., demonstrating the huge performance advantage of DynRefer.

Dense Captioning

超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

In the task of dense subtitle generation, on the VG1.2 data set, DynRefer improved 7.1% mAP compared to the previous SOTA method GRiT.

Open Vocabulary Attribute Detection

超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

In the regional attribute detection task, DynRefer also achieved SOTA performance.

Open Vocabulary Region Recognition

超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

In the region recognition task, DynRefer improves 15% mAP and 8.8% Accuracy compared with RegionGPT of CVPR 24, and is 15.7% mAP higher than ASM of ICLR 24.

Ablation experiment

超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

Line 1-6: Random dynamic multi-view is better than fixed view.
Line 6-10: Selecting views by maximizing information is better than randomly selecting views.
Line 10-13: Multi-task training can learn better regional representations.

Visualization

The following pictures show the inference results of DynRefer. DynRefer can use one model to output regional subtitles, tags, attributes and categories at the same time.

超越CVPR 2024方法，DynRefer在区域级多模态识别任务上，多项SOTA

The above is the detailed content of Surpassing the CVPR 2024 method, DynRefer achieves multiple SOTAs in regional-level multi-modal recognition tasks. For more information, please follow other related articles on the PHP Chinese website!