This article proposes a set of offline 3D object detection algorithm framework DetZero. Through comprehensive research and evaluation on Waymo’s public data set, DetZero can generate continuous and complete objects. Trajectory sequence, and make full use of long-term point cloud features to significantly improve the quality of perception results. At the same time, it ranked first in the WOD 3D object detection rankings with a performance of 85.15 mAPH (L2). In addition, DetZero can provide high-quality automatic labeling for online model training, and its results have reached or even exceeded the level of manual labeling.
This is the paper link: https://arxiv.org/abs/2306.06023
The content that needs to be rewritten is: Code link: https://github.com/PJLab-ADG/ DetZero
Please visit the homepage link: https://superkoma.github.io/detzero-page
In order to improve the data annotation efficiency, we studied a new approach. This method is based on deep learning and unsupervised learning and can automatically generate annotated data. By using large amounts of unlabeled data, we can train an autonomous driving perception model to recognize and detect objects on the road. This method can not only reduce the cost of labeling data, but also improve the efficiency of post-processing. We used Waymo's offline 3D object detection method 3DAL[] as a baseline for comparison in our experiments, and the results show that our proposed method has significant improvements in accuracy and efficiency. We believe that this method will play an important role in future autonomous driving technology
The optimization model based on motion state classification does not fully utilize the timing of the object feature. For example, the size of a rigid object remains consistent over time, and more accurate size estimation can be achieved by capturing data from different angles; the motion trajectory of the object should follow certain kinematic constraints, which is reflected in the smoothness of the trajectory. As shown in Figure (a) below, for dynamic objects, the optimization mechanism based on sliding windows does not consider the consistency of the object geometry, and only updates the bounding box through the time-series point cloud information of several adjacent frames, resulting in the predicted geometric size. Deviation occurs. In the example of (b), by aggregating all the point clouds of the object, dense time-series point cloud features can be obtained, and the accurate geometric size of the bounding box can be predicted for each frame.
This paper proposes a new offline 3D object detection algorithm framework called DetZero. This framework has the following characteristics: (1) Use multi-frame 3D detectors and offline trackers as upstream modules to provide accurate and complete object tracking, focusing on high recall of object sequences (track-level recall); (2) The downstream module includes an optimization model based on the attention mechanism, which uses long-term point cloud features to learn and predict different attributes of objects, including refined geometric dimensions, smooth motion trajectory positions, and updated confidence scores
We use the public CenterPoint[] as the basic detector. In order to provide more detection candidate frames, we proceed in three aspects Enhanced: (1) Use different frame point cloud combinations as input to maximize performance without reducing performance; (2) Use point cloud density information to fuse original point cloud features and voxel features into a two-stage module to optimize the first stage Boundary results; (3) Use inference stage data augmentation (TTA), multi-model result fusion (Ensemble) and other technologies to improve the model's adaptability to complex environments
A two-stage correlation strategy is introduced in the offline tracking module To reduce false matching, boxes are divided into high and low groups according to confidence, high groups are associated to update existing trajectories, and unupdated trajectories are associated with low groups. At the same time, the length of the object trajectory can last until the end of the sequence, avoiding ID switching problems. In addition, we will perform the tracking algorithm in reverse to generate another set of trajectories, associate them through position similarity, and finally use the WBF strategy to fuse the successfully matched trajectories to further improve the integrity of the beginning and end of the sequence. Finally, for the differentiated object sequence, the corresponding point cloud of each frame is extracted and saved; the unupdated redundant boxes and some shorter sequences will be directly merged into the final output without downstream optimization.
Previous object-centered optimization models ignored the correlation between objects in different motion states, such as Consistency of geometric shapes and consistency of object motion states at adjacent moments. Based on these observations, we decompose the traditional bounding box regression task into three modules: predicting the geometry, location and confidence attributes of objects respectively
DetZero achieved 85.15 mAPH ( L2) achieved the best results, DetZero showed significant performance advantages whether compared with methods for processing long-term point clouds or compared with state-of-the-art multi-modal fusion 3D detectors
Waymo 3D detection ranking results, all results use TTA or ensemble technology, † refers to offline model, ‡ refers to point cloud image fusion model, * indicates anonymous submission results
Similarly, thanks to the detection frame In terms of accuracy and completeness of object tracking sequences, we achieved first place in performance on the Waymo 3D tracking rankings with 75.05 MOTA (L2).
Waymo 3D tracking rankings, * indicates anonymous submission of results
In order to better verify the role of each module we proposed, we conducted an ablation experiment on the Waymo verification set and adopted a more stringent IoU Threshold as a measurement standard
Conducted on Vehicle and Pedestrian on the Waymo verification set, the IoU threshold selected standard value (0.7 & 0.5) and strict value (0.8 & 0.6) respectively
At the same time , for the same set of detection results, we selected the tracker and optimization model in 3DAL and DetZero for cross-combination verification. The results further proved that DetZero’s tracker and optimizer perform better, and the two are more effective when combined. The advantages.
Cross-validation experiments of different upstream and downstream module combinations, the subscripts 1 and 2 represent 3DAL and DetZero respectively, and the indicator is 3D APH
Our offline tracker pays more attention to the object sequence Completeness, although the MOTA performance difference between the two is very small, the performance of Recall@track is one of the reasons for the huge difference in final optimization performance
Offline tracker (Trk2) and 3DAL tracker (Trk1) performance comparison of MOTA and Recall@track
Furthermore, comparison with other state-of-the-art trackers also proves the point
Recall@track is Sequence recall after processing by the tracking algorithm, 3D APH is the final performance after processing by the same optimization model
In order to verify our optimization model Whether it is possible to fix the fit to a specific set of upstream results, we selected upstream detection tracking results with different performances as input. The results show that we have achieved significant performance improvements, further proving that as long as the upstream module can recall more and more complete object sequences, our optimizer can effectively utilize the characteristics of its time series point cloud for optimization
Generalization performance verification on the Waymo validation set, the indicator is 3D APH
We will use the experimental settings of 3DAL to compare Report the AP performance of DetZero on 5 specified sequences, measuring human performance by comparing the consistency of single-frame-based re-annotation results with the original ground-truth annotation results. Compared with 3DAL and humans, DetZero has shown advantages in different performance indicators
Performance comparison of 3D AP and BEV AP under different IoU thresholds for the Vehicle category
For To verify whether high-quality automatic annotation results can replace manual annotation results for online model training, we conducted semi-supervised learning verification on the Waymo verification set. We randomly selected 10% of the training data as the training data for the teacher model (DetZero), and performed inference on the remaining 90% of the data to obtain automatic annotation results, which will be used as labels for the student model. We chose single-frame CenterPoint as the student model. On the vehicle category, the results of training using 90% automatic labels and 10% true labels are close to the results of training using 100% true labels, while on the pedestrian category, the results of the model trained with automatic labels are already better than the original ones. The result, which shows that automatic labeling can be used for online model training
Semi-supervised experimental results on the Waymo validation set
The red box represents the input result of the upstream, and the blue box represents the output result of the optimization model The first line represents the input result of the upstream, the second line represents the output result of the optimization model, and the objects within the dotted line represent Positions with obvious differences before and after optimization
Original link: https://mp.weixin.qq.com/s/HklBecJfMOUCC8gclo-t7Q
The above is the detailed content of DetZero: Waymo ranks first on the 3D detection list, comparable to manual annotation!. For more information, please follow other related articles on the PHP Chinese website!