Generate one image in 10 milliseconds and 6,000 images in 1 minute. What is the concept?
In the picture below, you can deeply feel the super power of AI.
Picture
Even, when you continue to add new elements to the prompts generated by the two-dimensional lady pictures, each The change of pictures in this style also flashes in an instant.
Pictures
Such an amazing real-time picture generation speed is the result of StreamDiffusion proposed by researchers from UC Berkeley, University of Tsukuba, Japan, etc. bring results.
This new solution is a diffusion model process that enables real-time interactive image generation at over 100fps.
Picture
Paper address: https://arxiv.org/abs/2312.12491
After being open sourced, StreamDiffusion directly dominated the GitHub rankings, garnering 3.7k stars.
Picture
StreamDiffusion innovatively uses a batch processing strategy instead of sequence denoising, which is about 1.5 times faster than traditional methods . Moreover, the new residual classifier-free guidance (RCFG) algorithm proposed by the author can be 2.05 times faster than the traditional classifier-free guidance.
The most noteworthy thing is that the new method can achieve an image-to-image generation speed of 91.07fps on the RTX 4090.
Picture
#In the future, StreamDiffusion will quickly generate in different scenarios such as the metaverse, video game graphics rendering, and live video streaming. Able to meet the high throughput requirements of these applications.
In particular, real-time image generation can provide powerful editing and creative capabilities for those who work in game development and video rendering.
Picture
Currently, in various fields, diffusion models The application needs a diffusion pipeline with high throughput and low latency to ensure the efficiency of human-computer interaction
A typical example is to use the diffusion model to create the virtual character VTuber - able to Respond fluidly to user input.
Picture
In order to improve high throughput and real-time interaction capabilities, the current research direction is mainly focused on reducing denoising iterations The number of iterations, for example, is reduced from 50 iterations to a few, or even once.
A common strategy is to refine the multi-step diffusion model into several steps and reconstruct the diffusion process using ODEs. To improve efficiency, diffusion models have also been quantified.
In the latest paper, researchers started from the orthogonal direction and introduced StreamDiffusion, a real-time diffusion pipeline designed for high throughput of interactive image generation. design.
Existing model design work can be integrated with StreamDiffusion while also using N-step denoising diffusion models to maintain high throughput and provide users with more flexible options
Picture
Real-time image generation|First and second columns: examples of AI-assisted real-time drawing, third column: real-time rendering from 3D avatars 2D illustration. Columns 4 and 5: Live camera filters. Real-time image generation | The first and second columns show examples of AI-assisted real-time drawing, and the third column shows the process of generating 2D illustrations by rendering 3D avatars in real time. The fourth and fifth columns show the effect of real-time camera filters
How is it implemented?
StreamDiffusion is a new diffusion pipeline designed to increase throughput.
It consists of several key parts:
Streaming batch processing strategy, residual classifier-free guidance (RCFG), input and output queue, random Model acceleration tools for Stochastic Similarity Filter, precomputation programs, and micro-autoencoders.
In the diffusion model, the denoising steps are performed in sequence, which leads to the U-Net Processing time,increases proportionally to the number of steps.
However, in order to generate high-fidelity images, the number of steps has to be increased.
In order to solve the problem of high-latency generation in interactive diffusion, researchers proposed a method called Stream Batch.
As shown in the figure below, in the latest methods, instead of waiting for a single image to be completely denoised before processing the next input image, it accepts after each denoising step Next input image.
This forms a denoising batch, and the denoising steps for each image are staggered.
By concatenating these interleaved denoising steps into a batch, researchers can use U-Net to efficiently process batches of consecutive inputs.
The input image encoded at time step t is generated and decoded at time step t n, where n is the number of denoising steps.
Picture
Common Classifier-free guidance (CFG) is a method that performs vector calculations between the unconditional or negative conditional term and the original conditional term. An algorithm to enhance the effect of the original condition.
Picture
This can bring benefits such as enhancing the effect of the prompt.
However, in order to compute negative conditional residual noise, each input latent variable needs to be paired with a negative conditional embedding and passed to U-Net at each inference time.
To solve this problem, the author introduces an innovative residual classifier-free bootstrapping (RCFG)
This method utilizes virtual residual Noise is used to approximate the negative condition, so that we only need to calculate the negative condition noise in the initial stage of the process, thereby significantly reducing the additional U-Net inference calculation cost when embedding negative conditions
Input and output queue
#Convert the input image into a pipeline-manageable tensor data format, and in turn, convert the decoded tensor back to the output image, both Requires non-negligible additional processing time.
To avoid adding these image processing times to the neural network inference process, we separate image pre- and post-processing into different threads, thereby enabling parallel processing.
In addition, by using input tensor queues, it is also possible to cope with temporary interruptions in input images due to device failures or communication errors, allowing for smooth streaming.
picture
The following figure shows the core diffusion inference pipeline, including VAE and U-Net.
Improves the speed of the inference pipeline and enables real-time image generation by introducing denoising batching and pre-computed hint embedding cache, sampled noise cache and scheduler value cache.
Stochastic Similarity Filtering (SSF) is designed to save GPU power consumption and can dynamically close the diffusion model pipeline, thereby achieving fast and efficient real-time inference.
Picture
The U-Net architecture requires both input potential Variables also require conditional embedding.
Normally, conditional embedding is derived from "hint embedding" and remains unchanged between different frames.
To optimize this, the researchers pre-compute hint embeddings and store them in cache. In interactive or streaming mode, this precomputed hint embedding cache is recalled.
In U-Net, the calculation of keys and values for each frame is implemented based on pre-computed hint embeddings
Therefore, The researchers modified U-Net to store these key and value pairs so that they can be reused. Whenever the input prompt is updated, the researchers recompute and update these key and value pairs within U-Net.
To optimize speed, we configured the system to use a static batch size and a fixed input size (height and width).
This approach ensures that the computation graph and memory allocation are optimized for the specific input size, resulting in faster processing.
However, this means that if you need to process images of different shapes (i.e. different heights and widths), use different batch sizes (including the batch size for the denoising step).
## Figure 8 shows batch denoising and original sequential U- Efficiency comparison of Net loop
When implementing the batch denoising strategy, the researchers found significant improvements in processing time. This reduces the time in half compared to traditional U-Net loops with sequential denoising steps.
Even with the neural module acceleration tool TensorRT applied, the streaming batch processing proposed by the researchers can still significantly improve the efficiency of the original sequential diffusion pipeline in different denoising steps.
Picture
Additionally, the researchers compared the latest method with the AutoPipeline-ForImage2Image pipeline developed by Huggingface Diffusers.
The average inference time comparison is shown in Table 1. The latest pipeline shows that the speed has been greatly improved.
When using TensorRT, StreamDiffusion is able to achieve a 13x speedup when running 10 denoising steps. When only a single denoising step is involved, the speed increase can reach 59.6 times
Even without TensorRT, StreamDiffusion is 29.7 times faster than AutoPipeline when using single-step denoising. An 8.3x improvement when using 10-step denoising.
Picture
Table 2 compares the inference time of the flow diffusion pipeline using RCFG and conventional CFG.
In the case of single-step denoising, the inference time of Onetime-Negative RCFG and traditional CFG is almost the same.
So the inference time of One-time RCFG and traditional CFG during single-step denoising is almost the same. However, as the number of denoising steps increases, the inference speed improvement from traditional CFG to RCFG becomes more obvious.
In the fifth step of denoising, Self-Negative RCFG is 2.05 times faster than traditional CFG, and Onetime-Negative RCFG is 1.79 times faster than traditional CFG.
Picture
Picture
After this, the researchers carried out the Energy consumption was comprehensively assessed. The results of this process can be seen in Figures 6 and 7
These figures demonstrate the application of SSF (setting the threshold eta to 0.98) to the input video to contain periodic static Comparative analysis of GPU usage patterns in characteristic scenes shows that when the input images are mainly static images and have a high degree of similarity, using SSF can significantly reduce GPU usage.
Picture
Ablation study
Pictures
Qualitative results
The generated images, without using any form of CFG, show weak alignment hints, especially in Aspects such as color changes or adding non-existent elements were not implemented efficiently.
In contrast, the use of CFG or RCFG enhances the ability to modify the original image, such as changing hair color, adding body patterns, and even including objects like glasses. Notably, the use of RCFG can enhance the impact of cues compared with standard CFG.
Picture
Finally, the quality of the standard text-to-image generation results is shown in Figure 11.
Using the sd-turbo model, you can generate high-quality images like the one shown in Figure 11 in just one step.
When using the flow diffusion pipeline and sd-turbo model proposed by the researcher in the environment of GPU: RTX 4090, CPU: Core i9-13900K, OS: Ubuntu 22.04.3 LTS When generating images, it is feasible to produce such high quality images at over 100fps.
Picture
Netizens got started, and a large wave of two-dimensional ladies came
The code of the latest project has been Open source, it has collected 3.7k stars on Github.Picture
Project address: https://github.com/cumulo-autumn/StreamDiffusion
Many netizens have begun to generate their own two-dimensional wives.
Pictures
There are also real-time animations.
Pictures
10x speed hand-drawn generation.
Picture
Picture
##Picture
Those who are interested in children's shoes, why not do it yourself.
Reference:
//m.sbmmt.com/link/f9d8bf6b7414e900118caa579ea1b7be
//m.sbmmt.com/link/75a6e5993aefba4f6cb07254637a6133
The above is the detailed content of Launched a free personalized academic paper recommendation system - the 'arXiv customized platform' of the top visual teams of German universities. For more information, please follow other related articles on the PHP Chinese website!