More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture-AI-php.cn

In the blink of an eye, 2024 is already halfway through. It is not difficult to find that there is an increasingly obvious trend in the field of AI, especially AIGC: the Wenshengtu track has entered a stage of steady advancement and accelerated commercial implementation, but at the same time, only generating static images can no longer satisfy people's demand for generative AI capabilities. Looking forward to it, the demand for dynamic video creation has never been higher.

Therefore, the Wensheng video track continues to be hot, especially since OpenAI released Sora at the beginning of the year, the video generation model with Diffusion Transformer (DiT) as the underlying architecture has ushered in a blowout period. On this track, domestic and foreign video generation model manufacturers are quietly launching a technology competition.

In China, a generative AI start-up company founded in March last year that focuses on building visual multi-modal basic models and applications continues to appear in people's field of vision. It is HiDream.ai. Its self-developed visual multi-modal basic model realizes the generation and conversion between different modalities, supports Wensheng pictures, Wensheng videos, Wensheng videos and Wensheng 3D, and has launched The one-stop AI image and video generation platform "Pixeling" is available for the public to get started.

Experience address: www.hidreamai.com

Since the Zhixiang large model was launched in August 2023, it has gone through several iterations and polishings, and has optimized the basic model to deeply explore and expand the Wensheng diagram and Vincent Video and other AIGC capabilities. Especially in the field of video generation, the supported generation time has been increased from the initial 4 seconds to 15 seconds, and the imaging effect is also visibly better.

Now, the Zhixiang large model has been upgraded again. The unique DiT architecture based on Chinese native releases more powerful, more stable, and more user-friendly image and video generation capabilities, including

more aesthetic and artistic Image generation, text embedding in images, minute-level video generation, etc..

More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture

The demonstration of all these new image and video generation skills is inseparable from Zhixiang Future’s technological accumulation and continuous innovation in the field of multi-modal visual generation.

The generation effect continues to improve

The more powerful basic model capability is the engine

Zhixiang Large Model has been targeting the joint modeling of text, image, video and 3D from the beginning. Interactive generation technology enables precise and controllable multi-modal content generation and builds powerful prototype capabilities, allowing users to have a better creative experience in its Vincent Picture and Vincent Video AIGC platforms.

This

Intelligent Elephant Large Model 2.0 overall upgrade has qualitative changes in the underlying architecture, training data and training strategies compared to the 1.0 version, which brings text, images, videos and 3D Another leap in multi-mode capabilities and a tangible improvement in interactive experience.

More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture

It can be said that the upgraded smart elephant model has ushered in all-round enhancements in the field of image and video generation, and has injected stronger driving force into the one-stop AIGC generation platform for multi-modal large model creation.

More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture

Vincent Picture skills have evolved again

With a higher level of "pursuit"

As AIGC's one-stop generation platform, Vincent Tu is the premise and important technical barrier of Vincent Video. Therefore, Zhixiang has placed high expectations in the direction of Wenshengtu in the future, and will promote more diverse functions, more realistic visual effects, and a more user-friendly experience at its own pace.

After a series of targeted adjustments and optimizations, the Vincentian diagram capability of Zhixiang Large Model 2.0 has been significantly improved compared to previous versions, and it is easy to see from multiple external presentation effects.

First of all, the images generated by Zhixiang Large Model 2.0 are more beautiful and artistic. The current Vincentian large model can do very well in more intuitive aspects such as semantic understanding, generating image structure and picture details, but it may not be satisfactory in partial sensory aspects such as texture, beauty, and artistry. Therefore, the pursuit of beauty has become the focus of this Vincent Picture upgrade. What is the effect? We can look at the following two examples.

The prompt input for the first example is "a little girl wearing a huge hat with many castles, flowers, trees, birds, colorful, close-up, details, illustration style" on the hat.

More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture

The prompt input in the second example is "close-up photo of green plant leaves, dark theme, water drop details, mobile wallpaper".

More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture

The two images generated look eye-catching in terms of composition, tone, and richness of details, which greatly enhances the overall beauty of the picture.

In addition to making the generated images look more beautiful, the correlation of the generated images is also stronger. This is also an aspect that everyone pays great attention to after image generation has developed to a certain stage.

In order to improve the relevance of generated images, the large model of Intelligent Image focuses on strengthening the understanding of some complex logic, such as different spatial layouts, positional relationships, different types of objects, the number of generated objects, etc., these are An important factor in achieving higher relevance. After some training, the large model of Intelligent Elephant can easily handle image generation tasks involving multiple objects, multi-location distribution, and complex spatial logic, and better meet the actual needs of users in real life.

Let’s look at the following three generation examples that require a deep understanding of different objects and spatial position relationships. The results show that Vincent Diagram can now easily handle long and short text prompts containing complex logic.

The prompt input for the first example is "There are three baskets filled with fruit on the kitchen table. The middle basket is filled with green apples. The left basket is filled with strawberries. The right basket is filled with Blueberries. Behind the basket is a white dog. The background is a turquoise wall with the colorful text "Pixeling v2".

More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture

The input prompt of the second example is "a cat is on the right, a dog is on the left, and a green cube is placed on a blue ball in the middle".

More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture

The prompt input for the third example is "On the moon, an astronaut is riding a cow, wearing a pink tutu skirt and holding a blue umbrella. To the right of the cow is a cow wearing a top hat penguin. The text "HiDream.Al" is written at the bottom.

More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture

At the same time, the generation of text embedded in images is more accurate and efficient, which is a function that is used more frequently in posters or marketing copywriting.

In terms of technical implementation, generating embedded text in images requires a large model to deeply understand the visual appearance description and precise text content in the input Prompt, so as to achieve accurate depiction of text content while ensuring the overall beauty and artistry of the image.

In an exclusive interview with this site, Dr. Yao Ting, CTO of Zhixiang Future, mentioned that for such tasks, previous versions were often unable to generate them. Even if they could be generated, there were still problems, in terms of generated characters or accuracy. All are lacking. Now these problems have been well solved. The large model of Zhixiang has realized the embedding generation of long text in images, which can be up to dozens of words.

The three generated examples from left to right below show good text embedding effects, especially the right side of the picture where more than twenty words and punctuation marks are accurately embedded.

More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture

It can be said that the Vincentian diagram function of the Intelligent Elephant model has achieved industry-leading results in the industry, laying a key foundation for video generation.

Video generation has reached the minute level

If the upgraded Intelligent Image Model 2.0 has achieved steady progress in the direction of Vincentian graphics, then it has made a leap forward in the direction of Vincentian videos.

In December last year, the Vincent video of the Zhixiang large model broke the 4-second limit and supported the generation time of more than 15 seconds. Half a year later, Wensheng Video has significantly improved in terms of duration, naturalness of pictures, content and character consistency, and this is thanks to its self-developed mature DiT architecture.

Compared with U-Net, the DiT architecture is more flexible and can enhance the quality of image and video generation. The emergence of Sora more intuitively verifies this. Diffusion models using this type of architecture show a natural tendency to generate high-quality images and videos, and have relative advantages in customizability and controllability of generated content. For the Intelligent Elephant Large Model 2.0, the DiT architecture it adopts has some unique features.

We know that the underlying implementation of the DiT architecture is based on Transformer. Intelligence Model 2.0 adopts completely self-developed modules in the entire Transformer network structure, training data composition and training strategy, especially in network training The strategy has been well-thought-out.

First of all, the Transformer network structure adopts an efficient spatio-temporal joint attention mechanism, which not only fits the characteristics of video in both spatial and temporal domains, but also solves the problem that the traditional attention mechanism cannot keep up with the speed during the actual training process. difficult problem.

Secondly, the generation of long shots in AI video tasks puts higher requirements on the source and screening of training data. Therefore, the Zhixiang large model supports training of video clips of up to several minutes or even ten minutes, making it possible to directly output minutes-long videos. At the same time, describing minute-level video content is also difficult. Zhixiang Future has independently developed a Captioning Model to generate video descriptions, achieving detailed and accurate description output.

Finally, in terms of training strategy, due to limited long-lens video data, the Intelligent Elephant Model 2.0 uses video clips of different lengths for joint training of video and picture data, and dynamically changes the sampling of videos of different lengths. rate, and then complete long-shot training. At the same time, reinforcement learning will be performed based on user feedback data during training to further optimize model performance.

Therefore, the more powerful self-developed DiT architecture provides technical support for the further improvement of the Wensheng video effect.

Now, the video duration supported by Intelligent Elephant Large Model 2.0 has been increased from about 15 seconds to minutes, reaching a high level in the industry.

In addition to the video duration reaching the minute level, variable duration and size are also a major highlight of this Wensheng video feature upgrade.

The current video generation model usually has a fixed generation duration, which users cannot choose. In the future, Zhixiang will open the choice of generation duration to users, allowing them to specify the duration or make dynamic judgments based on the input Prompt content. If it is more complex, a longer video will be generated, and if it is relatively simple, a shorter video will be generated. Through such a dynamic process, the user's creative needs can be adaptively met. The size of the generated video can also be customized as needed, making it very user-friendly.

In addition, The overall picture look and feel has become better, the actions or movements of objects in the generated video are more natural and smooth, the details are rendered more in place, and it supports 4K ultra-clear image quality.

In just half a year, compared with previous versions, the upgraded Vincent Video function can be described as "reborn". However, in Dr. Yao Ting’s view, most video generation, whether it is Intelligent Future or other peers, is still in the single-lens stage. If compared to the L1 to L5 stages in the autonomous driving field, Vincent Video is roughly at the L2 stage. With the help of this upgrade of basic model capabilities, Zhixiang will pursue higher-quality multi-lens video generation in the future, and has also taken a key step towards exploring the L3 stage.

More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture

智象未來表示，迭代後的文生視訊功能將在 7 月中旬上線使用。大家可以狠狠地期待一波了！

寫在最後

成立不到一年半的時間，無論是基礎模型能力的持續迭代，還是文生圖、文生視頻實際體驗的提升，智象未來在視覺多模態生成這一方向上走得既穩又快，並收穫了大量C 端和B 端用戶。

我們了解到，智象未來 C 端用戶單月訪問量超過了百萬，生成 AI 圖像和視頻的總數量也超過千萬。低門檻、好應用構成了智像大模型的特質，並基於它打造了最適合社會大眾使用的首款 AIGC 應用平台。

在B 端，智象未來積極與中國移動、聯想集團、科大訊飛、上影集團、慈文集團、神州數碼、央視網、印象筆記、天工異彩、杭州靈伴等企業達成策略合作協議，深化模型應用場景，將模型能力延展到包括營運商、智慧終端、影視製作、電子商務、文旅宣傳和品牌行銷在內的更多產業，最終在商業化落地過程中發揮模型潛能並創造價值。

目前，智像大模型擁有約 100 家頭部企業客戶，並為 30000 + 小型企業客戶提供了 AIGC 服務。

More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture

在智像大模型2.0 發布之前，智象未來已經聯合中國移動咪咕集團推出了國民級AIGC 應用“AI 一語成片”，不僅為普通用戶提供零基礎AI 視頻彩鈴創作功能，也協助企業客戶產生豐富的品牌及行銷影片內容，讓企業擁有屬於自己的彩鈴品牌，讓我們看到了影片生成融合產業場景的巨大潛力。

此外，AI 生態也是大模型廠商發力的重要陣地。在這方面，智象未來持開放的態度，聯合聯想集團、科大訊飛、神州數碼等大客戶、小型開發團隊和獨立開發者共建包括視頻生成在內的廣泛AI 生態，覆蓋用戶的更多元化需求。

2024 年被視為大模型應用落地元年，對所有廠商來說都是關鍵的發展節點。智象未來正在圍繞更強大的基模能力做深文章。

一方面，在統一的框架中強化圖像、視頻、3D 多模態的理解與生成能力，例如在視頻生成領域繼續優化底層架構、算法、數據以求得時長、質量上的更大突破，成為推動未來通用人工智慧的不可或缺的一部分；另一方面在使用者體驗、創新應用、產業生態等多個方向發力，擴大自身的產業影響力。

搶佔視頻生成賽道的高地，智象未來已經做好了充足準備。

The above is the detailed content of More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture. For more information, please follow other related articles on the PHP Chinese website!