Title rewrite: ICCV 2023 excellent student paper tracking, Github has obtained 1.6K stars, comprehensive information like magic!-AI-php.cn

1. Paper information

This year’s ICCV2023 best student paper was awarded to qianqian wang from Cornell University, who is currently a postdoctoral researcher at the University of California, Berkeley! 标题重写：ICCV 2023优秀学生论文跟踪，Github已经获得1.6K star，仿佛魔法般的全面信息！

2. Field background

标题重写：ICCV 2023优秀学生论文跟踪，Github已经获得1.6K star，仿佛魔法般的全面信息！

#In the field of video motion estimation, the author points out that traditional methods are mainly divided into two types: sparse feature tracking and Dense optical flow. While both methods have proven effective in their respective applications, neither fully captures motion in video. Paired optical flow cannot capture motion trajectories within long time windows, while sparse tracking cannot model the motion of all pixels

To bridge this gap, many studies have attempted to simultaneously estimate dense and long distances in videos pixel trajectory. The methods of these studies vary from simply linking the optical flow fields of two frames to directly predicting the trajectory of each pixel across multiple frames. However, these methods often only consider limited context when estimating motion and ignore information that is far away in time or space. This shortsightedness can lead to error accumulation in long trajectories, as well as spatiotemporal inconsistencies in motion estimation. Although some methods consider long-term context, they still operate in the 2D domain, which may lead to tracking loss in occlusion events.

Overall, dense and long-range trajectory estimation in videos is still an unsolved problem in the field. This problem involves three main challenges: 1) How to maintain trajectory accuracy in long sequences, 2) How to track the position of points under occlusion, 3) How to maintain spatiotemporal consistency

Here In this article, the author proposed a novel video motion estimation method that uses all the information in the video to jointly estimate the complete motion trajectory of each pixel. This method is called "OmniMotion" and it uses a quasi-3D representation. In this representation, a standard 3D volume is mapped to a local volume at each frame. This mapping serves as a flexible extension to dynamic multi-view geometry and can simulate camera and scene motion simultaneously. This representation not only ensures loop consistency but also keeps track of all pixels during occlusions. The authors optimize this representation for each video, providing a solution for motion throughout the video. After optimization, this representation can be queried on any continuous coordinates of the video to obtain motion trajectories spanning the entire video

The method proposed in this article can: 1) Generate for all points in the entire video Globally consistent complete motion trajectories, 2) tracking points through occlusion, and 3) processing real videos with various camera and scene action combinations. On the TAP video tracking benchmark, the method performs well, far surpassing previous methods.

3. Method

The paper proposes a method based on test-time optimization for estimating dense and long-distance motion from video sequences. First, let’s give an overview of the method proposed in the paper:

Input: The author’s method takes a set of frames and pairs of noisy motion estimates (such as optical flow fields) as input .
Method Operation: Using these inputs, the method seeks to find a complete and globally consistent motion representation for the entire video.
Result Features: After optimization, this representation can be queried with any pixel of any frame in the video, resulting in a smooth, accurate motion trajectory across the entire video. This method also identifies when a point is occluded, and can track points that pass through occlusion.
Core content:

OmniMotion Representation: In the subsequent section, the author first describes their basic representation, called for OmniMotion.
Optimization process: Next, the authors describe the optimization process of how to recover this representation from video.

This method can provide a comprehensive and coherent video motion representation, and can effectively solve challenging problems such as occlusion. Now let’s learn more about

标题重写：ICCV 2023优秀学生论文跟踪，Github已经获得1.6K star，仿佛魔法般的全面信息！

3.1 Canonical 3D volume

Video content is represented by a typical volume named G, which acts as a three-dimensional map of the observed scene. Similar to what was done in NeRF, they defined a coordinate-based network nerf for each typical 3D coordinate uvw## in G #Map to a density σ and color c. The densities stored in G tell us where the surface is in typical space. When combined with 3D bijections, this enables us to track surfaces over multiple frames and understand occlusion relationships. The colors stored in G allow us to calculate photometric losses during optimization.

3.2 3D bijections

This article introduces a continuous bijection mapping, denoted as , which converts 3D points from a local coordinate system to a canonical 3D coordinate system. This canonical coordinate serves as a consistent reference or "index" in time for a scene point or 3D trajectory. The main advantage of using bijective mappings is the periodic consistency they provide in 3D points between different frames, since they all originate from the same canonical point.

The mapping equation from one local frame to a 3D point in another is:

标题重写：ICCV 2023优秀学生论文跟踪，Github已经获得1.6K star，仿佛魔法般的全面信息！

To capture complex real-world motion, these bijections are parameterized are Invertible Neural Networks (INNs). The choice of Real-NVP as a model was influenced by its simplicity and its analytically reversible properties. Real-NVP implements bijective mapping by using basic transformations called affine coupling layers. These layers split the input so that one part remains unchanged while the other part undergoes an affine transformation.

To further enhance this architecture, we can achieve this by conditionalizing the latent code latent_i of each frame. Therefore, all reversible mapping mapping i are determined by a single reversible network mappingnet, but they have different latent codes

3.3 Computing frame-to-frame motion

Recompute inter-frame motion

In this section, describe how to calculate 2D motion for any querypixel in frame i. Intuitively, query pixels are first "lifted" to 3D by sampling points on the ray, then these 3D points are "mapped" to target frame j using bijection mapping i and mapping j, followed by alpha compositing from different samples" These mapped 3D points are "rendered" and finally "projected" back into 2D to obtain an assumed correspondence.

标题重写：ICCV 2023优秀学生论文跟踪，Github已经获得1.6K star，仿佛魔法般的全面信息！

4. Experimental comparison

标题重写：ICCV 2023优秀学生论文跟踪，Github已经获得1.6K star，仿佛魔法般的全面信息！

This experimental data table shows the results in three Dataset - Performance of various motion estimation methods on Kinetics, DAVIS and RGB-Stacking. To evaluate the performance of individual methods, four metrics are used: AJ, avg, OA and TC. In addition to the two methods proposed by the authors (our (TAP-Net) and our (RAFT)), there are 7 other methods. It is worth noting that both of the authors' methods perform well on most metrics and datasets. Specifically, our (RAFT) method achieves the best results on AJ, avg, and OA for all three datasets, while being the second best on TC. Our (TAP-Net) method also achieves similar excellent performance on some measures. Meanwhile, other methods have mixed performance on these metrics. It should be mentioned that the author's method and the "Deformable Sprites" method estimate global motion through test-time optimization on each video, while all other methods use a forward approach to perform motion estimation locally. In summary, the author's method surpasses all other tested methods in position accuracy, occlusion accuracy and temporal continuity, showing significant advantages

标题重写：ICCV 2023优秀学生论文跟踪，Github已经获得1.6K star，仿佛魔法般的全面信息！

This is A table of ablation experiment results for the DAVIS data set. Ablation experiments are conducted to verify the contribution of each component to the overall system performance. There are four methods listed in this table, three of which are versions that remove certain key components, and the final "Full" version includes all components.

No invertible: This version removes the "reversibility" component. Compared to the full method, all its metrics drop significantly, especially on AJ and , indicating that reversibility plays a crucial role in the entire system.
No photometric: This version removes the "photometric" component. Although its performance is lower than the "Full" version, it performs better compared to the "irreversible" version. This shows that although the photometric component plays a certain role in improving performance, its importance may be lower than that of the reversible component.
Uniform sampling: This version uses a unified sampling strategy. It's also slightly less performant than the full version, but still better than the "irreversibility" and "aluminum" versions.
Full: This is the full version with all components, which achieves the best performance on all metrics. This shows that each component contributes to the performance improvement, especially when all components are integrated, the system can achieve the best performance.

标题重写：ICCV 2023优秀学生论文跟踪，Github已经获得1.6K star，仿佛魔法般的全面信息！

# Overall, the results of this ablation experiment show that although each component has a certain improvement in performance, reversibility may be the most important component, because without it, the performance loss will be very serious

5. Discussion

标题重写：ICCV 2023优秀学生论文跟踪，Github已经获得1.6K star，仿佛魔法般的全面信息！

The DAVIS dataset used in this work The ablation experiments performed provided us with valuable insights into the critical role of each component on the overall system performance. From the experimental results, we can clearly see that the reversibility component plays a crucial role in the overall framework. When this critical component is missing, system performance drops significantly. This further emphasizes the importance of considering reversibility in dynamic video analysis. At the same time, although the loss of the photometric component also leads to performance degradation, it does not seem to have as big an impact on performance as reversibility. In addition, although the unified sampling strategy has a certain impact on performance, its impact is relatively small compared with the first two. Finally, the complete approach integrates all these components and shows us the best performance achievable under all considerations. Overall, this work provides a valuable opportunity to gain insights into how the various components in video analytics interact with each other and their specific contribution to overall performance, thereby emphasizing the need for an integrated approach when designing and optimizing video processing algorithms. Importance

However, like many motion estimation methods, our method faces difficulties in handling fast and highly non-rigid motions as well as small structures. In these scenarios, pairwise correspondence methods may not provide enough reliable correspondence for our method to compute accurate global motion. Additionally, due to the highly non-convex nature of the underlying optimization problem, we observe that for certain difficult videos, our optimization process can be very sensitive to initialization. This can lead to suboptimal local minima, for example, incorrect surface ordering or duplicate objects in the canonical space, which are sometimes difficult to correct through optimization.

Finally, our method can be computationally expensive in its current form. First, the flow collection process involves comprehensive calculation of all pairwise flows, which grows quadratically with the sequence length. But we believe that the scalability of this process can be improved by exploring more efficient matching methods, such as vocabulary trees or keyframe-based matching, and taking inspiration from the structural motion and SLAM literature. Second, like other methods using neural implicit representations, our method involves a relatively long optimization process. Recent research in this area may help accelerate this process and further extend it to longer sequences

6. Conclusion

This paper proposes a new A test-time optimization method for estimating complete and globally consistent motion across videos. A new video motion representation called OmniMotion is introduced, which consists of a quasi-3D standard volume and local-canonical bijections for each frame. OmniMotion can process ordinary video with different camera settings and scene dynamics and produce accurate and smooth long-distance motion through occlusion. Significant improvements are achieved over previous state-of-the-art methods, both qualitatively and quantitatively.

标题重写：ICCV 2023优秀学生论文跟踪，Github已经获得1.6K star，仿佛魔法般的全面信息！

The content that needs to be rewritten is: Original link: https://mp.weixin.qq.com/s/HOIi5y9j-JwUImhpHPYgkg

The above is the detailed content of Title rewrite: ICCV 2023 excellent student paper tracking, Github has obtained 1.6K stars, comprehensive information like magic!. For more information, please follow other related articles on the PHP Chinese website!