Home Hardware Tutorial Hardware Review Easily understand 4K HD images! This large multi-modal model automatically analyzes the content of web posters, making it very convenient for workers.

Easily understand 4K HD images! This large multi-modal model automatically analyzes the content of web posters, making it very convenient for workers.

Apr 23, 2024 am 08:04 AM
git composer resolution Effect radar beautiful pictures Chinese University of Hong Kong lab

A large model that can automatically analyze the content of PDFs, web pages, posters, and Excel charts is not too convenient for part-time workers.

The InternLM-XComposer2-4KHD (abbreviated as IXC2-4KHD) model proposed by Shanghai AI Lab, the Chinese University of Hong Kong and other research institutions makes this a reality.

 轻松拿捏 4K 高清图像理解!这个多模态大模型自动分析网页海报内容,打工人简直不要太方便

Compared with other multi-modal large models that do not exceed the resolution limit of 1500x1500, this work increases the maximum input image of the multi-modal large model to more than 4K (3840 x1600) resolution, and supports any aspect ratio and dynamic resolution changes from 336 pixels to 4K.

Three days after its release, the model topped the Hugging Face visual question and answer model popularity list.

 轻松拿捏 4K 高清图像理解!这个多模态大模型自动分析网页海报内容,打工人简直不要太方便

Easy 4K image understanding

Let’s take a look at the effect first~

The researcher inputs the paper (ShareGPT4V: Improving Large Multi-Modal Models with Better Captions) (resolution is 2550x3300), and asked the paper which model has the highest performance on MMBench.

It should be noted that this information is not mentioned in the text part of the input screenshot, but only appears in a rather complicated radar chart. Faced with such a tricky question, IXC2-4KHD successfully understood the information in the radar chart and answered the question correctly.

 轻松拿捏 4K 高清图像理解!这个多模态大模型自动分析网页海报内容,打工人简直不要太方便

Faced with more extreme resolution image input (816 x 5133), IXC2-4KHD easily understands that the image consists of 7 parts and accurately explains what each part contains. Text message content.

 轻松拿捏 4K 高清图像理解!这个多模态大模型自动分析网页海报内容,打工人简直不要太方便

Subsequently, the researchers also comprehensively tested the capabilities of IXC2-4KHD on 16 multi-modal large model evaluation indicators, of which 5 evaluations (DocVQA, ChartQA, InfographicVQA , TextVQA, OCRBench) focuses on the model’s high-resolution image understanding capabilities.

Using only 7B parameters, IXC2-4KHD achieved results that were comparable to or even surpassed GPT4V and Gemini Pro in 10 of the evaluations, demonstrating that it is not limited to high-resolution image understanding, but also for various tasks and Scenario versatility.

 轻松拿捏 4K 高清图像理解!这个多模态大模型自动分析网页海报内容,打工人简直不要太方便

△With only 7B parameters, the performance of IXC2-4KHD is comparable to GPT-4V and Gemini-Pro. How to achieve 4K dynamic resolution?

In order to achieve the goal of 4K dynamic resolution, IXC2-4KHD includes three main designs:

(1) Dynamic resolution training:

 轻松拿捏 4K 高清图像理解!这个多模态大模型自动分析网页海报内容,打工人简直不要太方便

△4K resolution image processing strategy

In the framework of IXC2-4KHD, the input image is randomly enlarged to a value between the input area and the maximum area (not exceeding An intermediate size (55x336x336, equivalent to 3840x1617 resolution).

Subsequently, the image is automatically cut into multiple 336x336 areas to extract visual features respectively. This dynamic resolution training strategy allows the model to adapt to visual input of any resolution, while also making up for the problem of insufficient high-resolution training data.

Experiments show that as the upper limit of dynamic resolution increases, the model achieves stable performance improvement on high-resolution image understanding tasks (InfographicVQA, DocVQA, TextVQA), and it still does not reach the upper limit at 4K resolution. world, demonstrating the potential for further expansion at higher resolutions.

 轻松拿捏 4K 高清图像理解!这个多模态大模型自动分析网页海报内容,打工人简直不要太方便

(2) Add tile layout information:

In order to enable the model to adapt to changing dynamic resolutions, the researchers found that it is necessary to add tile layout information information as additional input. To achieve this, the researchers adopted a simple strategy: a special ‘newline’ (‘ n ’) token is inserted after each row of tiles to inform the model of the layout of the tiles. Experiments show that adding tile layout information has little impact on dynamic resolution training with relatively small changes (HD9 represents that the number of tile areas does not exceed 9), but can bring significant performance improvements to dynamic 4K resolution training .

 轻松拿捏 4K 高清图像理解!这个多模态大模型自动分析网页海报内容,打工人简直不要太方便

(3) Expanding the resolution during the inference phase

Researchers also found that models using dynamic resolution can be directly expanded during the inference phase by increasing the maximum tile upper limit resolution and bring additional performance gains. For example, by testing a trained model on HD9 (up to 9 blocks) directly using HD16, a performance improvement of up to 8% can be observed on InfographicVQA.

 轻松拿捏 4K 高清图像理解!这个多模态大模型自动分析网页海报内容,打工人简直不要太方便

IXC2-4KHD increases the resolution supported by multi-modal large models to the 4K level. Researchers said that currently this method supports larger images by increasing the number of tiles. The input strategy encountered computational cost and video memory bottlenecks, so they plan to propose a more efficient strategy to support higher resolutions in the future.

Paper link:

https://arxiv.org/pdf/2404.06512.pdf

Project link:

https://github.com /InternLM/InternLM-XComposer

— Finished—

Please send an email to:

ai@qbitai.com

Indicate the title and tell us :

Who are you, where are you from, submission content

Attach the paper/project homepage link and contact information

We will (try our best) to reply to you in time

 轻松拿捏 4K 高清图像理解!这个多模态大模型自动分析网页海报内容,打工人简直不要太方便

Click here to follow me and remember to star~

One click three times to "share", "like" and "watch"

The cutting-edge progress of science and technology will be seen every day~

The above is the detailed content of Easily understand 4K HD images! This large multi-modal model automatically analyzes the content of web posters, making it very convenient for workers.. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undress AI Tool

Undress AI Tool

Undress images for free

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

ArtGPT

ArtGPT

AI image generator for creative art from text prompts.

Stock Market GPT

Stock Market GPT

AI powered investment research for smarter decisions

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Delphi Digital: Learn from history, how does interest rate cuts affect the short-term trend of Bitcoin? Analysis of this article Delphi Digital: Learn from history, how does interest rate cuts affect the short-term trend of Bitcoin? Analysis of this article Sep 08, 2025 pm 02:33 PM

Contents 2019: Expected rise, fall after cashing out 2020: Exceptions under emergency rate cuts 2024: Narrative overwhelms liquidity September 2025: Conditional market starts Core view Markets generally expect the Fed to conduct its first rate cut in this cycle in September. Historically, Bitcoin usually rose before the introduction of loose policies, but fell back after interest rate cuts were implemented. However, this pattern is not always true. This article will review the situation in 2019, 2020 and 2024 to predict possible trends in September 2025. ‍2019: Expected rise, after cashing out

What is Ethereum (ETH) currency? ETH price forecast 2025-2030 What is Ethereum (ETH) currency? ETH price forecast 2025-2030 Sep 17, 2025 pm 04:42 PM

Directory What is Ethereum? Why is its prediction relevant? Highlights of ETH price related to key upgrades: Key factors affecting ETH price forecasting Network technology progress Supply and demand dynamics Institutional demand Macro background ETH forecast for 2025: What are you looking forward to? What happened in 2026 ETH forecast: Medium-term trend 2030 Ethereum forecast: Long-term outlook How do we analyze ETH price forecast Comparative conclusions of Ethereum with other major cryptocurrencies: The future of Ethereum and its price forecast How to trade Ethereum? Frequently Asked Questions What Factors Impact

Bitcoin ETP now holds 7% of the maximum supply of BTC Bitcoin ETP now holds 7% of the maximum supply of BTC Sep 05, 2025 am 09:12 AM

At present, the total amount of Bitcoin held by global Bitcoin exchange-traded products (ETPs) has exceeded 1.47 million, accounting for about 7% of the total supply of Bitcoin 21 million. Among them, the US Bitcoin ETFs have become the main holding force, and BlackRock's products dominate. According to statistics released by X platform user HODL15Capital on Monday, as of August 31, 11 Bitcoin ETFs in the US market held a total of more than 1.29 million BTC, the main force in global ETP holdings. Specifically, BlackRock's iShares Bitcoin Trust ETF (IBIT) ranked first with 746,810 BTC holdings, becoming the world's largest single bit

Solana Price Forecast: Will SOL be the next cryptocurrency to reach $300? Solana Price Forecast: Will SOL be the next cryptocurrency to reach $300? Sep 17, 2025 pm 04:12 PM

Contents Solana Recent Price Performance Solana Structural Cost Base Solana TVL Data: DeFi Growth Drives Solana Price Rise ETFs and Treasury Bond Development: The Next Round of Solana Price Soaring Solana Price Forecast: Can $300 be Achieved? 2025 Solana Price Forecast 2026 Solana Price Forecast 2030 Solana Price Forecast 2040 Solana Price Forecast Conclusion ‍In 2024, Bitcoin soared to nearly 12.50 from $40,000 in January

What is BIP? Why are they so important to the future of Bitcoin? What is BIP? Why are they so important to the future of Bitcoin? Sep 24, 2025 pm 01:51 PM

Table of Contents What is Bitcoin Improvement Proposal (BIP)? Why is BIP so important? How does the historical BIP process work for Bitcoin Improvement Proposal (BIP)? What is a BIP type signal and how does a miner send it? Taproot and Cons of Quick Trial of BIP Conclusion‍Any improvements to Bitcoin have been made since 2011 through a system called Bitcoin Improvement Proposal or “BIP.” Bitcoin Improvement Proposal (BIP) provides guidelines for how Bitcoin can develop in general, there are three possible types of BIP, two of which are related to the technological changes in Bitcoin each BIP starts with informal discussions among Bitcoin developers who can gather anywhere, including Twi

From beginners to experts: 10 must-have free public dataset websites From beginners to experts: 10 must-have free public dataset websites Sep 15, 2025 pm 03:51 PM

For beginners in data science, the core of the leap from "inexperience" to "industry expert" is continuous practice. The basis of practice is the rich and diverse data sets. Fortunately, there are a large number of websites on the Internet that offer free public data sets, which are valuable resources to improve skills and hone your skills.

BBVA and Ripple provide institutional-level Bitcoin (BTC) and Ethereum (ETH) hosting services in Europe BBVA and Ripple provide institutional-level Bitcoin (BTC) and Ethereum (ETH) hosting services in Europe Sep 17, 2025 am 06:36 AM

Ripple will provide Spanish BBVA banks with crypto asset custody solutions. This move is an important step for the two sides to deepen cooperation against the backdrop of the gradual implementation of the EU's Crypto Asset Market Supervision Regulations (MiCA), aiming to promote the wider acceptance of digital assets by traditional European financial institutions. The US-based blockchain technology company Ripple, the development team behind Ripple (XRP), has officially announced a strategic cooperation with Spain's Banco Bilbao Vizcaya Argentaria (BBVA) to provide it with institutional-level cryptocurrency custody services. The partnership follows BBVA's announcement of Bitcoin (BTC) and Ethereum (E) to retail customers, according to a joint statement released on Tuesday.

Subvert BTC/ETH? Can the Solana '8% real income' myth support a gamble of 1.65 billion US dollars? Subvert BTC/ETH? Can the Solana '8% real income' myth support a gamble of 1.65 billion US dollars? Sep 20, 2025 pm 01:00 PM

A sudden capital storm is pushing Solana into the spotlight of the crypto world. In early 2024, Multicoin Capital joined hands with top investment institutions such as Galaxy Digital and JumpCrypto to announce the injecting up to US$1.65 billion in private equity funds into Solana's "Decentralized Autonomous Treasury" (DAT) strategy. What's more striking is that Multicoin co-founder Kyle Samani not only personally served as chairman of Solana's Forward Industries, but also invested an additional $25 million in personal investment.

See all articles