What kind of experience will it bring when using visual prompts?
Just draw a random outline in the picture and the same category will be marked immediately!
Even the grain-counting step is difficult for GPT-4V to handle. You only need to manually pull the box to find all the rice grains.
There is a new target detection paradigm!
At the just-concluded IDEA Annual Conference, Shen Xiangyang, founding chairman of the IDEA Research Institute and foreign academician of the National Academy of Engineering, presented the latest research results -
Based on the Visual Prompt model The content of T-Rex needs to be rewritten
The entire interactive process is ready to use out of the box and can be completed in just a few steps.
Previously, Meta’s open source SAM segmented all models, which directly ushered in the GPT-3 moment in the CV field. However, it was still based on the text prompt paradigm, which would be more difficult to deal with some complex and rare scenarios.
Now you can easily solve the problem by exchanging pictures for pictures.
In addition, the entire conference is also full of useful information, such as Think-on-Graph knowledge-driven large model, developer platform MoonBit, AI scientific research artifact ReadPaper update 2.0, SPU confidential computing co-processor , controllable portrait video generation platform HiveNet, etc.
Finally, Shen Xiangyang also shared the project on which he spent the most time in the past few years: Low-altitude Economy.
I believe that when the low-altitude economy is relatively mature, there will be 100,000 drones in the sky of Shenzhen every day, and millions of drones taking off every day
In addition to the basic single-round prompt function, T-Rex also supports three advanced modes
This is similar to multiple rounds of dialogue, which can produce more accurate results and avoid missed detections
It is suitable for scenarios where visual cues are ambiguous and cause false detections.
Cross-graph mode allows you to redesign and layout charts to easily visualize data and information
By using one reference chart to detect other images
According to reports, T-Rex is not restricted by predefined categories and can use visual examples to specify detection targets, thereby solving the problem that certain objects are difficult to fully express in words and improving prompt efficiency. Especially in the case of complex components in some industrial scenarios, the effect is particularly obvious
In addition, by interacting with users, it can also be quickly evaluated at any time Test results and perform error correction, etc.
The composition of T-Rex mainly includes three components: image encoder, prompt encoder and frame decoder
This work comes from IDEA Research Institute Computer Vision and Robotics Research Center.
The team’s previously open source target detection model DINO is the first DETR model to rank first in the COCO target detection list; it has become a hit on Github (it has received 11K stars so far) Grounding DINO, a zero-sample detector, and Grounded SAM, which can detect and segment everything. For more technical details, please click on the link at the end of the article.
In addition, several research results were also shared at the IDEA conference.
For exampleThink-on-Graph knowledge-driven large model, simply speaking, it combines the large model with the knowledge graph.
Large models are good at intention understanding and autonomous learning, while knowledge graphs are better at logical chain reasoning because of their structured knowledge storage methods.
Think-on-Graph drives the large model agent to "think" on the knowledge graph, and gradually searches and infers the optimal answer (search and reason step by step on the associated entities of the knowledge graph). In every step of reasoning, the large model is personally involved and learns from each other's strengths and weaknesses with the knowledge graph.
MoonBit is a developer platform powered by Wasm and designed for cloud computing and edge computing.
The system not only provides universal programming language design, but also integrates modules such as compilers, build systems, integrated development environments (IDEs), and deployment tools to improve development experience and efficiency The previously released scientific research artifact ReadPaper has also been updated to 2.0. New functions such as reading copilot and polishing copilot were demonstrated at the press conference. At the end of the press conference, Shen Xiangyang released the "White Paper on Low-altitude Economic Development (2.0) - Fully Digital Solution". In his Smart Integrated Lower Airspace System, SILAS), a new concept of Temporal Spatial Process was proposed.T-Rex link:
https://trex-counting.github.io/
The above is the detailed content of Use vision to prompt! Shen Xiangyang showed off the new model of IDEA Research Institute, which requires no training or fine-tuning and can be used out of the box.. For more information, please follow other related articles on the PHP Chinese website!