Autonomous driving systems rely on advanced perception, decision-making and control technologies, by using various Sensors (such as cameras, lidar, radar, etc.) are used to sense the surrounding environment and use algorithms and models for real-time analysis and decision-making. This enables vehicles to recognize road signs, detect and track other vehicles, predict pedestrian behavior, etc., thereby safely operating and adapting to complex traffic environments. This technology is currently attracting widespread attention and is considered an important development area in the future of transportation. one. But what makes autonomous driving difficult is figuring out how to make the car understand what's going on around it. This requires 3D object detection algorithms in autonomous driving systems that can accurately perceive and describe objects in the surrounding environment, including their location, shape, size and category. This comprehensive environmental awareness helps autonomous driving systems better understand the driving environment and make more precise decisions.
We conducted a comprehensive evaluation of 3D object detection algorithms in autonomous driving, mainly considering robustness. Three key factors were identified in the evaluation: environmental variability, sensor noise, and misalignment. These factors are important for the performance of detection algorithms under real-world changing conditions.
also dives into three key areas of performance evaluation: accuracy, latency, and robustness.
The paper points out the significant advantages of multi-modal 3D detection methods in safety perception. By fusing data from different sensors, it provides richer and diversified perception capabilities, thereby improving the automatic driving system. security.
The above briefly introduces the 3D object detection data set used in autonomous driving systems, focusing mainly on Evaluate the advantages and limitations of different sensor modalities, as well as the characteristics of public datasets.
First, the table shows three types of sensors: camera, point cloud, and multimodal (camera and lidar). For each type, their hardware costs, advantages, and limitations are listed. The advantage of camera data is that it provides rich color and texture information, but its limitations are its lack of depth information and its susceptibility to light and weather effects. LiDAR can provide accurate depth information, but is expensive and has no color information.
Next, there are some other public datasets available for 3D object detection in autonomous driving. These data sets include KITTI, nuScenes, Waymo, etc. Details of these datasets are as follows: - The KITTI dataset contains data released in multiple years, using different types of sensors. It provides a large number of frames and annotations, as well as a variety of scenes, including scene numbers and categories, and different scene types such as day, sunny, night, rainy, etc. - The nuScenes dataset is also an important dataset, which also contains data released in multiple years. This dataset uses a variety of sensors and provides a large number of frames and annotations. It covers a variety of scenarios, including different scene numbers and categories, as well as various scene types. - The Waymo dataset is another dataset for autonomous driving that also has data from multiple years. This dataset uses different types of sensors and provides a rich number of frames and annotations. It covers various scenarios
Additionally, research on “clean” autonomous driving datasets is mentioned, and the importance of evaluating model robustness under noisy scenarios is emphasized. Some studies focus on camera single-modality methods under harsh conditions, while other multi-modal datasets focus on noise issues. For example, the GROUNDED dataset focuses on ground-penetrating radar positioning under different weather conditions, while the ApolloScape open dataset includes lidar, camera and GPS data, covering a variety of weather and lighting conditions.
Due to the prohibitive cost of collecting large-scale noisy data in the real world, many studies turn to the use of synthetic datasets. For example, ImageNet-C is a benchmark study in combating common perturbations in image classification models. This research direction was subsequently extended to robust datasets tailored for 3D object detection in autonomous driving.
Recently, multi-view 3D object detection has improved in terms of accuracy and robustness. Compared with the aforementioned monocular and stereo vision 3D object detection methods, it shows superiority. Unlike LiDAR-based 3D object detection, the latest panoramic Bird's Eye View (BEV) method eliminates the need for high-precision maps and elevates detection from 2D to 3D. This progress has led to significant developments in multi-view 3D object detection. In multi-camera 3D object detection, the key challenge is to identify the same object in different images and aggregate body features from multiple viewpoint inputs. Current methods involve uniformly mapping multiple views into Bird's Eye View (BEV) space, which is a common practice.
Direct conversion from 2D to BEV space poses a significant challenge. LSS is the first to propose a depth-based method, which utilizes 3D space as an intermediary. This method first predicts the grid depth distribution of 2D features and then lifts these features into voxel space. This approach offers hope for more efficient transformation from 2D to BEV space. Following LSS, CaDDN adopts a similar deep representation method. By compressing voxel space features into BEV space, it performs the final 3D detection. It is worth noting that CaDDN is not part of multi-view 3D object detection, but single-view 3D object detection, which has had an impact on subsequent in-depth research. The main difference between LSS and CaDDN is that CaDDN uses actual ground-truth depth values to supervise the prediction of its classification depth distribution, thus creating a superior deep network capable of extracting 3D information from 2D space more accurately.
Under the influence of Transformer technology, query-based multi-view methods retrieve 2D space features from 3D space. DETR3D introduces 3D object query to solve the aggregation problem of multi-view features. It obtains image features in Bird's Eye View (BEV) space by clipping image features from different viewpoints and projecting them into 2D space using learned 3D reference points. Different from the depth-based multi-view method, the query-based multi-view method obtains sparse BEV features by using reverse query technology, which fundamentally affects the subsequent query-based development. However, due to potential inaccuracies associated with explicit 3D reference points, PETR adopted an implicit position encoding method to construct the BEV space, affecting subsequent work.
Currently, 3D object detection solutions based on Bird’s Eye View (BEV) perception are developing rapidly. Despite the existence of many review articles, a comprehensive review of this field is still insufficient. Shanghai AI Lab and SenseTime Research Institute provide an in-depth review of the technology roadmap for BEV solutions. However, unlike existing reviews, we consider key aspects such as autonomous driving safety perception. After analyzing the technology roadmap and current development status of camera-based solutions, we intend to discuss based on the basic principles of `Accuracy, Latency, Robustness'. We will integrate the perspective of safety awareness to guide the practical implementation of safety awareness in autonomous driving.
The voxel-based 3D object detection method proposes to combine sparse Point clouds are segmented and assigned into regular voxels, resulting in a dense data representation, a process called voxelization. Compared with view-based methods, voxel-based methods use spatial convolution to effectively perceive 3D spatial information and achieve higher detection accuracy, which is crucial for safety perception in autonomous driving. However, these methods still face the following challenges:
In order to overcome these challenges, it is necessary to solve the limitations of data representation, improve network feature capabilities and target positioning accuracy, and strengthen the algorithm's understanding of complex scenes. Although optimization strategies vary, they generally aim to optimize both data representation and model structure.
Thanks to the prosperity of PC in deep learning, point-based 3D object detection inherits many of its frameworks and proposes Detect 3D objects directly from original points without preprocessing. Compared with voxel-based methods, the original point cloud retains the maximum amount of original information, which is beneficial to fine-grained feature acquisition and results in high accuracy. At the same time, a series of work on PointNet naturally provides a strong foundation for point-based methods. Point-based 3D object detectors have two basic components: point cloud sampling and feature learning. As of now, the performance of point-based methods is still affected by two factors: the number of context points and the context radius adopted in feature learning. . e.g. Increasing the number of context points can obtain more detailed 3D information, but will significantly increase the model's inference time. Similarly, reducing the context radius can have the same effect. Therefore, choosing appropriate values for these two factors can allow the model to achieve a balance between accuracy and speed. In addition, since each point in the point cloud needs to be calculated, the point cloud sampling process is the main factor limiting the real-time operation of point-based methods. Specifically, to solve the above problems, most existing methods are optimized around two basic components of point-based 3D object detectors: 1) Point Sampling 2) feature learning
The point-based 3D object detection method inherits many deep learning frameworks and proposes to detect 3D objects directly from the original point cloud, while No preprocessing is performed. Compared with voxel-based methods, the original point cloud retains the original information to the maximum extent, which is conducive to the acquisition of fine-grained features, thereby achieving high accuracy. At the same time, the PointNet series of work provides a strong foundation for point-based methods. However, so far, the performance of point-based methods is still affected by two factors: the number of context points and the context radius used in feature learning. For example, increasing the number of context points can obtain more detailed 3D information, but will significantly increase the model's inference time. Similarly, reducing the context radius achieves the same effect. Therefore, choosing appropriate values for these two factors allows the model to achieve a balance between accuracy and speed. In addition, the point cloud sampling process is the main factor limiting the real-time operation of point-based methods due to the need to perform calculations for each point in the point cloud. To solve these problems, existing methods mainly optimize around two basic components of point-based 3D object detectors: 1) point cloud sampling; 2) feature learning.
Farthest Point Sampling (FPS) is derived from PointNet and is a point cloud sampling method widely used in point-based methods. Its goal is to select a representative set of points from the original point cloud to maximize the distance between them to best cover the spatial distribution of the entire point cloud. PointRCNN is a groundbreaking two-stage detector among point-based methods, using PointNet as the backbone network. In the first stage, it generates 3D proposals from point clouds in a bottom-up manner. In the second stage, the proposals are refined by combining semantic features and local spatial features. However, existing FPS-based methods still face some problems: 1) Points unrelated to detection also participate in the sampling process, bringing additional computational burden; 2) Points are unevenly distributed in different parts of the object, resulting in suboptimal sampling strategies . To address these issues, subsequent work adopted an FPS-like design paradigm and made improvements, such as background point filtering guided by segmentation, random sampling, feature space sampling, voxel-based sampling, and ray grouping-based sampling.
The feature learning stage of point-based 3D object detection methods aims to extract discriminative feature representations from sparse point cloud data. The neural network used in the feature learning stage should have the following characteristics: 1) Invariance, the point cloud backbone network should be insensitive to the order of the input point cloud; 2) It has local perception capabilities and can sense and model local areas, and extract Local features; 3) The ability to integrate context information and extract features from global and local context information. Based on the above characteristics, a large number of detectors are designed to process raw point clouds. Most methods can be divided according to the core operators used: 1) PointNet-based methods; 2) Graph neural network-based methods; 3) Transformer-based methods.
PointNet-based methods mainly rely on set abstraction to downsample original points, aggregate local information, and integrate contextual information while maintaining the original Symmetry invariance of points. Point-RCNN is the first two-stage work among point-based methods and achieves excellent performance, but still faces the problem of high computational cost. Subsequent work solved this problem by introducing an additional semantic segmentation task in the detection process to filter out background points that contribute minimally to detection.
Graph neural networks (GNN) have adaptive structures, dynamic neighborhoods, the ability to build local and global context relationships, and the ability to Robustness of regular sampling. Point-GNN is a pioneering work that designs a single-stage graph neural network to predict the category and shape of objects through automatic registration mechanism, merging and scoring operations, demonstrating the use of graph neural networks as a new method for 3D object detection. potential.
In recent years, Transformer (Transformer) has been explored in point cloud analysis and has performed well on many tasks. For example, Pointformer introduces local and global attention modules to process 3D point clouds, the local Transformer module is used to model interactions between points in local regions, and the global Transformer aims to learn scene-level context-aware representations. Group-free directly utilizes all points in the point cloud to calculate the features of each object candidate, where the contribution of each point is determined by an automatically learned attention module. These methods demonstrate the potential of Transformer-based methods in processing unstructured and unordered raw point clouds.
Point cloud-based 3D object detection methods provide high resolution and retain the spatial structure of the original data, but they Face high computational complexity and inefficiency when dealing with sparse data. In contrast, voxel-based methods provide structured data representation, improve computational efficiency, and facilitate the application of traditional convolutional neural network technology. However, they often lose fine spatial details due to the discretization process. To solve these problems, point-voxel (PV) based methods were developed. Point-voxel methods aim to exploit the fine-grained information capturing capabilities of point-based methods and the computational efficiency of voxel-based methods. By integrating these methods, point-voxel based methods can process point cloud data in more detail, capturing global structure and micro-geometric details. This is crucial for safety perception in autonomous driving, because the decision-making accuracy of the autonomous driving system depends on high-precision detection results.
The key goal of the point-voxel method is to achieve feature interaction between voxels and points through point-to-voxel or voxel-to-point conversion. Many works have explored the idea of utilizing point-voxel feature fusion in backbone networks. These methods can be divided into two categories: 1) early fusion; 2) late fusion.
a) Early Fusion: Some methods have explored the use of new convolution operators to fuse voxel and point features, and PVCNN may be the first work in this direction. In this approach, the voxel-based branch first converts points into a low-resolution voxel grid and aggregates neighboring voxel features through convolution. Then, through a process called devoxelization, the voxel-level features are converted back to point-level features and fused with features obtained by the point-based branch. The point-based branch extracts features for each individual point. Since it does not aggregate neighborhood information, this method can run at higher speeds. Then, SPVCNN was extended to the field of object detection based on PVCNN. Other methods try to improve from different perspectives, such as auxiliary tasks or multi-scale feature fusion.
b) Post-fusion: This series of methods mainly uses a two-stage detection framework. First, preliminary object proposals are generated using a voxel-based approach. Then, point-level features are used to accurately divide the detection frame. The PV-RCNN proposed by Shi et al. is a milestone in point-voxel based methods. It uses SECOND as the first-stage detector and proposes a second-stage refinement stage with RoI grid pooling for the fusion of keypoint features. Subsequent work mainly follows the above paradigm and focuses on the progress of second-stage detection. Notable developments include attention mechanisms, scale-aware pooling, and point density-aware refinement modules.
Point-voxel-based methods have both the computational efficiency of voxel-based methods and the ability to capture fine-grained information based on point-based methods. However, constructing point-to-voxel or voxel-to-point relationships, as well as feature fusion of voxels and points, will bring additional computational overhead. Therefore, point-voxel based methods can achieve better detection accuracy compared to voxel-based methods, but at the cost of increased inference time.
4. Multi-modal 3D object detection##4.1 Projection-based 3D object detection
4.2 Non-Projection-based 3D object detection
VirConv, MSMDFusion and SFD construct a unified space through pseudo point clouds, and projection occurs before feature learning. The problems introduced by direct projection are solved through subsequent feature learning. In summary, unified feature-based 3D object detection methods currently represent highly accurate and robust solutions. Although they contain a projection matrix, this projection does not occur between multi-modal fusions and is therefore considered a non-projective 3D object detection method. Different from automatic projection 3D object detection methods, they do not directly solve the problem of projection error, but choose to construct a unified space and consider multiple dimensions of multimodal 3D object detection to obtain highly robust multimodal features.
3D object detection plays a vital role in autonomous driving perception. In recent years, this field has developed rapidly and produced a large number of research papers. Based on the diverse data forms generated by sensors, these methods are mainly divided into three types: image-based, point cloud-based and multi-modal. The main evaluation metrics of these methods are high accuracy and low latency. Many reviews summarize these approaches, focusing mainly on the core principles of `high accuracy and low latency', describing their technical trajectories.
However, in the process of autonomous driving technology moving from breakthroughs to practical applications, existing reviews do not take safety perception as the core focus and fail to cover the current technical paths related to safety perception. For example, recent multimodal fusion methods are often tested for robustness during the experimental phase, an aspect that has not been fully considered in the current review.
Therefore, re-examine the 3D object detection algorithm, focusing on `accuracy, latency and robustness' as key aspects. We reclassify previous reviews with special emphasis on reclassification from a safety perception perspective. It is hoped that this work will provide new insights into future research on 3D object detection, going beyond just exploring the limitations of high accuracy.
The above is the detailed content of Choose camera or lidar? A recent review on achieving robust 3D object detection. For more information, please follow other related articles on the PHP Chinese website!