The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com
The author of this article is from the National University of Singapore, Nanyang Technological University and Harbin Institute of Technology. Among them, Fei Hao’s research directions are multi-modal learning and multi-modal large language models. Wu Shengqiong is a doctoral student at the National University of Singapore. Her main research direction is multi-modal large language models. Ji Wei's main research directions are multi-modal learning and multi-modal content generation. Professor Zhang Hanwang’s research interests include computer vision and causal inference. Professor Zhang Meishan’s research interests include code intelligence, natural language processing, and multi-modal generation and understanding. The research directions of Professors Mong-Li Lee and Wynne Hsu include social media analysis, collaborative machine learning, etc.
Recently, researchers from the National University of Singapore, Nanyang Technological University and Harbin Institute of Technology jointly proposed a new video reasoning framework. This is also the first time that the large model reasoning community has proposed a video-oriented thinking chain framework (Video -of-Thought, VoT). Video Thinking Chain VoT allows video multi-modal large language models to greatly improve the understanding and reasoning performance of complex videos. This work has been accepted as an Oral paper by ICML 2024.
- Paper link: https://openreview.net/pdf?id=fO31YAyNbI
- Project link: http://haofei.vip/VoT/
A leap from perception to cognitionCompared with understanding and reasoning about static pictures, in the visual processing community, reasoning about videos is much more complicated and difficult because videos naturally have more challenging dynamic timing characteristics. and the presence of more redundant visual content. Past video understanding research mostly focused on shallow perception of videos, such as video action recognition, dynamic recognition and video description generation. However, these methods still have significant shortcomings in deep understanding and reasoning of complex videos. Compared with shallow video perception, complex video reasoning requires not only a complex understanding of the spatiotemporal characteristics of the video, but also a deep grasp of the inherent high-order common sense behind the pixels. In order to completely solve this problem, VoT came into being. For humans, understanding videos is as easy as eating and drinking. So how do we humans perform video understanding reasoning? Let us consider the following case. The video below shows a scene where a high-speed car collides with a red tanker truck on the highway. The corresponding question is: "What will happen to this red oil tank truck?" When humans get the video of this question, first, we will determine the target of interest based on the question, which is the red oil tank truck. Then, we watch the video carefully and track the semantics of the target object's actions in the video. Next, we'll do some deep and high-level reasoning, perhaps combined with some common sense knowledge. Finally, we give the reasoning answer: "It may catch fire or even explode."
Dual ability: the perfect combination of perception and cognitionDraw inspiration from the above human cognitive patterns and patterns , the research team pointed out that to achieve complex video reasoning, two key capabilities are required: the perceptual ability of pixel understanding and the cognitive ability of semantic understanding. And most importantly, video reasoning may not be an instant, one-step process, but a multi-hop process from low-level perception to high-level cognition. Perception: To achieve accurate content perception, a detailed pixel-level understanding of video motion is necessary. This process may require deep integration of a given video content and granular content targeting.
However, most existing video understanding methods are limited to instance analysis and lack fine control and accurate object-level recognition or tracking, let alone in-depth video understanding. Cognitive abilities: In-depth reasoning requires cognitive abilities, allowing models to provide reasonable explanations and even causal imagination. This level requires a certain amount of common sense knowledge of the world. For example, understand that "jumping from a high place may cause fractures", or "colliding with an oil tanker may cause an explosion." New reasoning framework: the birth of the video thinking chainIn order to achieve this goal, the research team proposed a new reasoning framework - "Video Thinking Chain". This thinking chain decomposes complex video reasoning problems into a series of sub-problems from bottom-level visual perception to upper-level common sense cognition. At the same time, in order to assist in achieving the above-mentioned fine-grained video perception capabilities, the author also proposed to use the Spatial-Temporal Scene Graph (STSG) representation to assist the reasoning process and help generate fine-grained perception intermediate results, This enables fine spatial and temporal feature understanding.
And combined with the video multi-modal large model, a new video MLLM, MotionEmpic, was finally proposed.
Experimental results show that the new inference framework proposed by the author can significantly improve the performance of the model on various types of video QA, surpassing the performance of all current traditional video MLLM and CoT methods. A. Video thinking chain VoT reasoning framework VoT reasoning framework contains a total of 5 steps: Step-1: Task definition and goal identificationFirst, given an input video and a question, VoT identifies all possible targets involved in the question. This process ensures that the system has a clear understanding of the objects that need to be analyzed and the associated tasks.
Next, VoT analyzes the video content, tracks the target behavior trajectory involved in the problem, and outputs a spatiotemporal scene graph (STSG) at the perceptual level. The generated STSG about the target trajectory will be the perceptual evidence for the next step of behavioral analysis. Step-3: Behavior Analysis In this step, VoT further prompts the model to consider more potentially relevant common sense knowledge by integrating the target tracking results in STSG, so that the model can Connect video pixel observations to the real world to achieve a deeper understanding of video.
Step-4: Ranking Mechanism to Answer the Question After deeply understanding the target behavior in the video, now start answering the original question. First, this system unifies all QA questions into multiple questions, that is, selects the final answer from multiple candidate answers provided. Furthermore, inspired by the way humans answer multiple-choice questions, this system also uses a ranking mechanism to determine the final answer. Specifically, for each candidate answer, VoT will prompt the model to evaluate its likelihood (from 1 to 10) based on common sense knowledge and provide corresponding reasons. Based on the final decision, the candidate with the highest ranking will be the final answer.
Langkah-5: Pengesahan JawapanMemandangkan tugasan video yang kompleks biasanya melibatkan soalan dan jawapan yang rumit, dan keseluruhan proses penaakulan mengandungi berbilang pautan, adalah penting untuk mengesahkan jawapan yang diberikan dalam langkah sebelumnya. Idea pengesahan asas sistem ini ialah dengan menganggap jawapan A betul, kami akan menilai secara retrospektif sama ada jawapan itu bercanggah dengan soalan input dan kandungan video dari dua aspek:
- Pengesahan persepsi: Semak sama ada piksel maklumat kedudukan adalah Selaras dengan fakta yang dibentangkan dalam video dari perspektif persepsi.
- Pengesahan kognitif: Gesa model dari perspektif kognitif untuk menentukan sama ada pengetahuan akal yang wujud dalam jawapan bercanggah dengan pemerhatian utama yang disimpulkan dalam langkah penaakulan ketiga.
Akhirnya, pastikan VoT boleh mengeluarkan hasil yang paling betul. Lima langkah rangka kerja penaakulan VoT, daripada definisi tugas dan pengenalpastian sasaran kepada pengesahan jawapan akhir, meningkatkan ketepatan dan kebolehpercayaan pemahaman dan penaakulan video secara menyeluruh, menyediakan penyelesaian yang berkuasa kepada Skim tugasan video yang kompleks 1. Perbandingan percubaan utamaPengarang pertama kali menguji pada beberapa set data VideoQA yang kompleks. Keputusan eksperimen membuktikan bahawa VoT mencapai prestasi yang lebih baik secara konsisten daripada model garis dasar SoTA pada semua set ujian, malah mengatasi prestasi CoT tradisional. . Perlu diingat bahawa berbanding dengan CoT tradisional, peningkatan prestasi VoT adalah lebih besar dan lebih jelas. Selain itu, kesan peningkatan pada dua tugasan menjawab soalan video kompleks adalah lebih jelas berbanding tugasan yang agak mudah (cth., MSR-VTT dan ActivityNet). Ini terutamanya kerana set data yang terakhir lebih menjurus kepada penaakulan persepsi (cth., menerangkan perkara dalam video) dan bukannya penaakulan kognitif (cth., menerangkan, menjangka).
3. Analisis terperinci kebolehan penaakulan
Pertama, penulis menjalankan penilaian manusia. Seperti yang ditunjukkan dalam jadual atas Rajah 7, MotionEpic menggunakan rangka kerja inferens VoT mencapai keputusan yang agak cemerlang, malah setanding dengan prestasi manusia. Selanjutnya, penulis meringkaskan enam kes ralat biasa dan menganalisis perbezaan antara enam kategori ralat yang paling biasa. Seperti yang ditunjukkan di bahagian bawah rajah, MotionEpic (menggunakan VoT) dengan ketara mengurangkan kadar ralat VideoLLaVA (menggunakan CoT), terutamanya dari segi semantik tindakan dan pemahaman akal.
4. Analisis visual proses penaakulanAkhirnya, penulis secara intuitif menunjukkan keunggulan VoT melalui analisis kes. Seperti yang ditunjukkan dalam Rajah 8, kandungan video menunjukkan adegan kompleks "seorang jurulatih mengetuai anak anjing untuk bersaing merentasi pelbagai halangan", dan masalah yang diberikan adalah abstrak dan kompleks serta memerlukan akal dan bukannya hanya dilihat melalui video itu sendiri secara langsung. Keputusan eksperimen mendapati hanya sistem ini memberikan jawapan yang betul. Khususnya, pada tahap kesedaran kandungan, VoT memastikan pemahaman yang tepat dan mantap melalui penyetempatan video berasaskan STSG, menghalang ilusi mentafsir dengan betul bahawa haiwan itu ialah anjing dan kemudian membuat kesimpulan dari akal fikiran bahawa adegan itu melibatkan jurulatih melatih seekor anjing. Kemudian, pada tahap kognitif, ia menganalisis setiap pilihan untuk menentukan jawapan yang terbaik. Melalui pengesahan lanjut, keputusan adalah konsisten dengan kandungan video dan pemahaman akal tentang fakta. Secara keseluruhan, melalui penguraian masalah, penaakulan keseluruhan meningkatkan ketepatan pada setiap langkah sambil memastikan rasional yang boleh dijelaskan untuk keputusan proses.
Pengarang juga menyediakan lebih banyak analisis visual:
The above is the detailed content of The first Video-of-Thought reasoning framework is here: Comprehensive video reasoning from perception to cognition like a human being. For more information, please follow other related articles on the PHP Chinese website!