CLIP is selected as CVPR when used as RNN: it can segment countless concepts without training | Oxford University & Google Research-AI-php.cn

CLIP is selected as CVPR when used as RNN: it can segment countless concepts without training | Oxford University & Google Research

PHPz

Release： 2024-06-09 12:53:28

Original

435 people have browsed it

Call CLIP in a loop to effectively segment countless concepts without additional training.

Any phrase including movie characters, landmarks, brands, and general categories.

CLIP is selected as CVPR when used as RNN: it can segment countless concepts without training | Oxford University & Google Research

This new result of the joint team of Oxford University and Google Research has been accepted by CVPR 2024 and the code has been open sourced.

CLIP is selected as CVPR when used as RNN: it can segment countless concepts without training | Oxford University & Google Research

The team proposed a new technology called CLIP as RNN (CaR for short), which solves several key problems in the field of open vocabulary image segmentation:

No training data required: While traditional methods require extensive mask annotations or image-text datasets for fine-tuning, CaR technology works without any additional training data.
Limitations of Open Vocabulary: Pre-trained visual-language models (VLMs) are limited in their ability to handle open vocabularies after fine-tuning. CaR technology preserves the wide vocabulary space of VLMs.
Text query processing for concepts not in images: Without fine-tuning, VLMs are difficult to accurately segment concepts that do not exist in images. CaR is gradually optimized through an iterative process to improve the segmentation quality.

Inspired by RNN, circularly calling CLIP

To understand the principle of CaR, you need to first review the recurrent neural network RNN.

RNN introduces the concept of hidden state, which is like a "memory" that stores information from past time steps. And each time step shares the same set of weights, which can model sequence data well.

Inspired by RNN, CaR is also designed as a cyclic framework, consisting of two parts:

Mask proposal generator: generates a mask for each text query with the help of CLIP.
Mask classifier: Then use a CLIP model to evaluate the matching degree of each generated mask and the corresponding text query. If the matching degree is low, the text query is eliminated.

If iteration continues like this, the text query will become more and more accurate, and the quality of the mask will become higher and higher.

Finally, when the query set no longer changes, the final segmentation result can be output.

CLIP is selected as CVPR when used as RNN: it can segment countless concepts without training | Oxford University & Google Research

The reason why this recursive framework is designed is to retain the "knowledge" of CLIP pre-training to the greatest extent.

There are a huge number of concepts seen in CLIP pre-training, covering everything from celebrities, landmarks to anime characters. If you fine-tune on a split data set, the vocabulary is bound to shrink significantly.

For example, the "divide everything" SAM model can only recognize a bottle of Coca-Cola, but not even a bottle of Pepsi-Cola.

CLIP is selected as CVPR when used as RNN: it can segment countless concepts without training | Oxford University & Google Research

#But using CLIP directly for segmentation, the effect is not satisfactory.

This is because CLIP’s pre-training goal was not originally designed for dense prediction. Especially when certain text queries do not exist in the image, CLIP can easily generate some wrong masks.

CaR cleverly solves this problem through RNN-style iteration. By repeatedly evaluating and filtering queries while improving the mask, high-quality open vocabulary segmentation is finally achieved.

Finally, let’s follow the team’s interpretation and learn about the details of the CaR framework.

CaR technical details

CLIP is selected as CVPR when used as RNN: it can segment countless concepts without training | Oxford University & Google Research

Recurrent neural network framework: CaR adopts a novel circular framework to continuously optimize the correspondence between text queries and images through an iterative process.
Two-stage segmenter: consists of a mask proposal generator and a mask classifier, both built on the pre-trained CLIP model, and the weights remain unchanged during the iteration process.
Mask proposal generation: Use gradCAM technology to generate mask proposals based on similarity scores of image and text features.
Visual cues: Apply visual cues such as red circles, background blur, etc. to enhance the model's focus on specific areas of the image.
Threshold function: By setting a similarity threshold, mask proposals that are highly aligned with the text query are filtered out.
Post-processing: Mask refinement using dense conditional random fields (CRF) and optional SAM models.

Through these technical means, CaR technology has achieved significant performance improvements on multiple standard data sets, surpassing traditional zero-shot learning methods, and working with models that have been fine-tuned on a large amount of data. It also showed competitiveness in comparison. As shown in the table below, although no additional training and fine-tuning is required, CaR shows stronger performance on eight different indicators of zero-shot semantic segmentation than previous methods fine-tuned on additional data.

CLIP is selected as CVPR when used as RNN: it can segment countless concepts without training | Oxford University & Google Research

The author also tested the effect of CaR on zero-sample Referring segmentation. CaR also showed stronger performance than the previous zero-sample method.

CLIP is selected as CVPR when used as RNN: it can segment countless concepts without training | Oxford University & Google Research

To sum up, CaR (CLIP as RNN) is an innovative recurrent neural network framework that can effectively perform zero training without additional training data. Sample semantic and referent image segmentation tasks. It significantly improves segmentation quality by preserving the broad vocabulary space of pre-trained visual-language models and leveraging an iterative process to continuously optimize the alignment of text queries with mask proposals.

The advantages of CaR are its ability to handle complex text queries without fine-tuning and its scalability to the video field, which has brought breakthrough progress to the field of open vocabulary image segmentation.

Paper link: https://arxiv.org/abs/2312.07661.
Project homepage: https://torrvision.com/clip_as_rnn/.

The above is the detailed content of CLIP is selected as CVPR when used as RNN: it can segment countless concepts without training | Oxford University & Google Research. For more information, please follow other related articles on the PHP Chinese website!