The Bengio team proposes a new multi-modal benchmark, targeting the weaknesses of Claude 3.5 and GPT-4o-AI-php.cn

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com

The author of this article, Zhang Tianyu, studied at the Mila Artificial Intelligence Institute in Canada and studied under Professor Yoshua Bengio, the winner of the Turing Award. The main work during the doctoral period focused on multi-modal, GFlowNet, multi-agent reinforcement learning, and the application of AI in climate change. Currently, he has published papers at top machine learning conferences such as ICML, ICLR, and ICASSP. Represented as Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation (CLAP).

To achieve the ultimate goal of general artificial intelligence AGI, the first thing that must be achieved is that the model must be able to complete tasks that humans can easily do. In order to do this, one of the key guidelines for large model development is how to make machines think and reason like humans. Technologies such as attention mechanisms and Chain-of-Thought were inspired by this.

However, many people may not realize that many very simple cognitive tasksfor humans are often accompanied by very complex reasoning processes. As an example, please try filling in the blocked text gaps based on the image below:

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

(Correct answer: Machine learning researchers from around the world are excited about the new GPU. Its cutting-edge features can also enable Large-scale experiments are more efficient and cheaper, even if it is as big as a stove )

For most native Chinese speakers, this task should not be difficult, and I believe you can get the answer in just a few seconds. But inferring the complete text from the exposed part of the text still requires a very complex reasoning process: contemporary neuroscience research shows that recovering partially occluded objects requires a high degree of involvement of the prefrontal cortex, which is capable of high-level decision-making.

We know that the current visual language models (Vision-Language Models, VLM) can perform object recognition and text recognition very accurately. However, when the occluded part is text; when the optical character recognition (OCR) of the model fails; when the only key information is only a few pixels of the occluded text, can the model simulate the human reasoning process to complete this task?

To this end, the team from Turing Award winner Yoshua Bengio proposed a new visual question and answer task: Visual Caption Restoration (VCR). Let us use this task to explore the reasoning capabilities of visual language models: How far are the current visual language models from human cognitive levels?

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

Paper title: VCR: Visual Caption Restoration
Paper link: arxiv.org/abs/2406.06462
Code repository: github.com/tianyu-z/VCR (Click to read the original text for direct access, including Review the data generation code for model evaluation and pre-training)
Hugging Face link: huggingface.co/vcr-org

VCR dataset introduction

For development For the VCR task, the researchers built a process for generating VCR composite images from image-text. In this process, you can change the visibility of the text in the image by controlling the size of the white rectangle that covers the text, thereby controlling the difficulty of the task.

With this data generation process, the researchers generated the VCR-wiki data set through Wikipedia’s main image - introduction pair . There are two difficulty levels for both languages: “Easy” and “Hard”. Among them:

"Easy" difficulty VCR task can make the OCR model invalid ;
"Difficulty" VCR task only retain 1-2 top and bottom for each occluded text The height of pixels, but still allows users of the corresponding language to complete the task.

In each language and difficulty, there are 5000 samples in the test set and validation set, and the remaining samples are in the training set.

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

^{- - to}

The example at the beginning of the article is only a small challenge for humans. It cannot well demonstrate the ultimate level of humans in doing this task and the thinking and skills humans use when solving problems. A sample VCR mission on "Hard" difficulty is shown below. Readers can focus more intently on trying to fill in the blank text gaps below themselves.

(Correct answer: The Great Treatise, a treatise on mathematics and astronomy compiled by Ptolemy in ancient Greece in about 140 AD, which proposed the complex motion paths of stars and planets. Until the Middle Ages and the early Renaissance, the The geocentric model proposed in the book was adopted by Islam and Europe...)

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点 How do humans complete partially obscured text?

There is a concept in education and cognitive science called

meta-cognition. When designing AI, we humans, as teachers, can use monitoring our own thinking processes

as a reference to help students who serve as models improve their learning efficiency. Therefore, thinking about “how humans complete VCR tasks” can be instructive for model design.

The picture below shows one of the author's problem-solving ideas for the VCR task as a reference:

It seems like there are many steps, but in fact, it is just constantly getting information through different areas
and then verifying it repeatedly

to increase the answers confidence level.

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

When I first saw the picture, I only had a vague guess in my mind. As I continued to read the pictures to obtain new information, I gradually verified the guess. After reading, when you start to fill in the blanks, you still don’t stop comparing different aspects of the information to confirm your answers. When the "hypothesis" is not consistent with other information, the "hypothesis" is overturned and a new hypothesis is tried again.

Human evaluation results

How good are humans at VCR tasks? The figure below shows the accuracy of native speakers or fluent users of each language in English/Chinese on easy/hard settings:

If errors including time, place names, and people’s names are taken into account, The average accuracy of Chinese in easy difficulty is about 98.58%, and the average accuracy of Chinese in hard difficulty is about 91.84%. Excluding these errors due to time, place names, and people's names, humans are almost close to full marks in the easy Chinese difficulty level, and the accuracy rate in the Chinese hard difficulty level has also reached 96.63%. As can be seen, the VCR task is very simple for humans.

Existing model results

The author tested the "all-star lineup": Claude 3 Opus, Claude 3.5 Sonnet, Gemini 1.5 Pro, GPT-4o, GPT-4 Turbo, Qwen-VL- Max, Reka Core and some of the best performing open source models available today.

The following figure shows the performance of each model on the simple difficulty of VCR-Wiki Chinese:

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

The red box measurement indicators include representatives including image (VI) and text in the image ( TEI)The two parts are used as contextual information, and the model can restore the accuracy of the obscured text. The blue box indicates the accuracy of the model that can restore the covered text that only contains the text in the image (TEI) as contextual information and does not include the image (VI).

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

The results show that:

The vast majority of models are currently unable to do this task;
The vast majority of models do not make good use of image information, not because of image information (VI) And improve the accuracy.

On the Chinese Hard difficulty , the model ran into greater trouble. The best performer is GPT-4o, but its accuracy is only 2.2%. Except for CogVLM2-Chinese and Qwen-VL-Max, the accuracy of most models is close to 0%.

It can be observed that in hard mode, the original model has a hard time answering this question correctly at a significant rate, let alone getting close to humans.

English VCR evaluation results

The author also tested the current best open source and closed source visual-language models on the English VCR-Wiki. Before showing the test results, please take a look at two examples of the English VCR-Wiki task:

Simple English example:

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

(Correct answer: Since the United States Post Office issued its first stamp in 1847, over 4,000 stamps have been issued and over 800 people featured. Many of these people...)

English Difficulty Example:

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

(Correct answer: Lincoln is the luxury vehicle division of American automobile manufacturer Ford. Marketed among the top luxury vehicle brands in the United States, for...)

The test results of the English VCR-Wiki shown in the article are as follows:

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

從結果整體來看，模型在英文的簡單模式和困難模式下都分別比中文表現得要好。這個結果與我們一般認為的 "因為特殊的模組化構形，殘缺的中文更加容易被補全" 的直覺不一致。或許這是由於在預訓練過程中，英文在資料量和資料品質上相比中文有更大的優勢。

在所測試的眾多模型中，GPT-4o 是閉源模型中的效果最佳的，CogVLM2 是開源模型中表現最佳的。

一個很有趣的現像是加入了圖片對 CogVLM2 來說有了明顯的幫助（在困難模式下提升了 20.3%），而對於 GPT-4o 而言反而結果有下降。在中文測驗中，也有相似的現象。筆者認為這是模型的結構所導致的。具體的細節，歡迎讀者參閱 CogVLM 系列的論文以及程式碼。

另外，閉源模型普遍取得了比開源模型更優的結果，這可能歸功於更優的訓練策略或更多的模型參數。但即使如此，模型依然在 “困難” 設定下遇到了很大的挑戰。開源模型雖然可以部分完成 “簡單” 設定，但在困難設定下，大多數開源模型都無法完成這個對人類而言十分簡單的任務。

相關任務簡介

VQA

VQA

的圖像

由於沒有唯一的標準答案，評估 VQA 具有很大的挑戰性

。傳統的 VQA 方法主要集中在圖像中可見元素的直接查詢，而不涉及圖像中嵌入的文字內容與整體圖像上下文之間的複雜關係。

在一些文字在圖片中資訊佔比比較大的 VQA 評測中，模型的視覺模組甚至可能完全不需要與語言模組對齊就可以勝任。此類流程為：影像輸入至 OCR 視覺模組，OCR 視覺模組輸出影像中的字元資訊並以此為上下文輸入給語言模組。這樣就導致了 VQA 任務退化了不需要影像資訊的 QA 任務。原本比較不同的 VLM 所需的視覺模組對齊能力被忽略而 OCR 能力被重視。

OCR

光學字元辨識（Optical Character Recognition, OCR）任務通常輸入影像中的完整字元，並輸出表示影像中字元的字串文字，而無需考慮影像中的完整字元。

預訓練過 OCR 的模型能夠從輸入圖像中提取嵌入的文本，即使這些文本是不完整或模糊的。然而，

隨著文字組件模糊或被遮蔽的程度增加

，只利用可見部分恢復原始文字變得困難，

OCR 方法在這種情況下效果有限

。

可以看出，VQA 任務沒有標準答案，評估模型回答的品質仍然是一個開放性問題。而 OCR 任務不需要透過上下文來完成，無法檢驗模型是否真的學會利用了上下文中的資訊。

VCR 任務的不可替代性

視覺字幕恢復（Visual Caption Restoration, VCR

。
VCR 任務的獨特挑戰在於要求
模型在視覺和文字訊息之間進行精確的對齊
利用可用的部分像素級文字提示和視覺上下文來準確地重建被遮蔽的內容
。這不僅測試了模型處理嵌入文字和視覺元素的能力，還考驗了其保持內部一致性的能力，類似於人類透過情境和視覺線索進行理解和回應的認知過程。
VCR 任務的問題有唯一的答案
，這使得評估可以透過準確度進行，使評測指標更加明確。
🎜透過調整文本的遮蓋比例，可以控制任務的難度🎜，從而提供一個豐富的測試環境。 🎜🎜🎜🎜與 OCR 任務一樣，VCR 任務也可以充當 VLM 的訓練任務。作者開放了 transform 程式碼，可以產生任意給定圖像 - 文字對的 VCR 任務圖。

小結

本文提出的視覺字幕恢復（VCR）透過看似簡單的字幕恢復任務巧妙地揭開了現有模型，現有圖片模型與人類在高階認知任務上的推理能力差異。相信這項任務可以啟發未來更有效的 VLM 訓練、評測和推理方法，進一步拉近多模態模型和人類認知能力的差距。

The above is the detailed content of The Bengio team proposes a new multi-modal benchmark, targeting the weaknesses of Claude 3.5 and GPT-4o. For more information, please follow other related articles on the PHP Chinese website!