Table of Contents
0. Written in front&&Personal understanding
1. Dataset
2. Vision-based 3D object detection
2.3 Multi-view 3D object detection
Depth-based Multi-view methods:
Query-based Multi-view methods
2.4 Analysis: Accuracy, Latency, Robustness
3. Lidar-based 3D object detection
3.1 Voxel-based 3D object detection
3.2 Point-based 3D object detection
PointNet-based methods
Methods based on graph neural networks
Transformer-based methods
3.3 Point-Voxel based 3D object detection
The projection-based 3D object detection method uses the projection matrix in the feature fusion stage to integrate point cloud and image features. The key here is to focus on projection during feature fusion, rather than other projection processes in the fusion stage, such as data augmentation, etc. According to the different types of projections used in the fusion stage, projection-based 3D object detection methods can be further subdivided into the following categories:
5. Conclusion
Home Technology peripherals AI Choose camera or lidar? A recent review on achieving robust 3D object detection

Choose camera or lidar? A recent review on achieving robust 3D object detection

Jan 26, 2024 am 11:18 AM
3d Autopilot

0. Written in front&&Personal understanding

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!

Autonomous driving systems rely on advanced perception, decision-making and control technologies, by using various Sensors (such as cameras, lidar, radar, etc.) are used to sense the surrounding environment and use algorithms and models for real-time analysis and decision-making. This enables vehicles to recognize road signs, detect and track other vehicles, predict pedestrian behavior, etc., thereby safely operating and adapting to complex traffic environments. This technology is currently attracting widespread attention and is considered an important development area in the future of transportation. one. But what makes autonomous driving difficult is figuring out how to make the car understand what's going on around it. This requires 3D object detection algorithms in autonomous driving systems that can accurately perceive and describe objects in the surrounding environment, including their location, shape, size and category. This comprehensive environmental awareness helps autonomous driving systems better understand the driving environment and make more precise decisions.

We conducted a comprehensive evaluation of 3D object detection algorithms in autonomous driving, mainly considering robustness. Three key factors were identified in the evaluation: environmental variability, sensor noise, and misalignment. These factors are important for the performance of detection algorithms under real-world changing conditions.

  1. Environmental variability: The article emphasizes that the detection algorithm needs to adapt to different environmental conditions, such as changes in lighting, weather, and seasons.
  2. Sensor noise: The algorithm must effectively deal with sensor noise, which may include camera motion blur and other issues.
  3. Misalignment: For misalignment caused by calibration errors or other factors, the algorithm needs to take these factors into account, whether they are external (such as uneven road surfaces) or internal (e.g. system clock misalignment).

also dives into three key areas of performance evaluation: accuracy, latency, and robustness.

  • Accuracy: Although studies often focus on accuracy as a key performance indicator, performance under complex and extreme conditions requires a deeper understanding to ensure real-world reliability sex.
  • Latency: Real-time capabilities in autonomous driving are crucial. Delays in detection methods impact the system's ability to make timely decisions, especially in emergency situations.
  • Robustness: Calls for a more comprehensive assessment of the stability of systems under different conditions, as many current assessments may not fully account for the diversity of real-world scenarios.

The paper points out the significant advantages of multi-modal 3D detection methods in safety perception. By fusing data from different sensors, it provides richer and diversified perception capabilities, thereby improving the automatic driving system. security.

1. Dataset

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!

The above briefly introduces the 3D object detection data set used in autonomous driving systems, focusing mainly on Evaluate the advantages and limitations of different sensor modalities, as well as the characteristics of public datasets.

First, the table shows three types of sensors: camera, point cloud, and multimodal (camera and lidar). For each type, their hardware costs, advantages, and limitations are listed. The advantage of camera data is that it provides rich color and texture information, but its limitations are its lack of depth information and its susceptibility to light and weather effects. LiDAR can provide accurate depth information, but is expensive and has no color information.

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!

Next, there are some other public datasets available for 3D object detection in autonomous driving. These data sets include KITTI, nuScenes, Waymo, etc. Details of these datasets are as follows: - The KITTI dataset contains data released in multiple years, using different types of sensors. It provides a large number of frames and annotations, as well as a variety of scenes, including scene numbers and categories, and different scene types such as day, sunny, night, rainy, etc. - The nuScenes dataset is also an important dataset, which also contains data released in multiple years. This dataset uses a variety of sensors and provides a large number of frames and annotations. It covers a variety of scenarios, including different scene numbers and categories, as well as various scene types. - The Waymo dataset is another dataset for autonomous driving that also has data from multiple years. This dataset uses different types of sensors and provides a rich number of frames and annotations. It covers various scenarios

Additionally, research on “clean” autonomous driving datasets is mentioned, and the importance of evaluating model robustness under noisy scenarios is emphasized. Some studies focus on camera single-modality methods under harsh conditions, while other multi-modal datasets focus on noise issues. For example, the GROUNDED dataset focuses on ground-penetrating radar positioning under different weather conditions, while the ApolloScape open dataset includes lidar, camera and GPS data, covering a variety of weather and lighting conditions.

Due to the prohibitive cost of collecting large-scale noisy data in the real world, many studies turn to the use of synthetic datasets. For example, ImageNet-C is a benchmark study in combating common perturbations in image classification models. This research direction was subsequently extended to robust datasets tailored for 3D object detection in autonomous driving.

2. Vision-based 3D object detection

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!

##2.1 Monocular 3D object detection

In this part, the concept of monocular 3D object detection and three main methods are discussed: prior-based monocular 3D object detection, camera-only monocular 3D object detection, and depth-assisted monocular 3D object detection. detection.

Prior-guided monocular 3D object detection
This method utilizes prior knowledge of object shapes and scene geometry hidden in the image to solve monocular 3D objects Detection challenges. By introducing pre-trained sub-networks or auxiliary tasks, prior knowledge can provide additional information or constraints to help accurately locate 3D objects and enhance the accuracy and robustness of detection. Common prior knowledge includes object shape, geometric consistency, temporal constraints and segmentation information. For example, the Mono3D algorithm first assumes that the 3D object lies on a fixed ground plane, and then uses the object's prior 3D shape to reconstruct the bounding box in 3D space.

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!

Camera-only monocular 3D object detection
This method uses only images captured by a single camera to detect and localize 3D objects . It uses a convolutional neural network (CNN) to directly regress 3D bounding box parameters from images to estimate the size and pose of objects in three-dimensional space. This direct regression method can be trained in an end-to-end manner, promoting overall learning and inference of 3D objects. For example, the Smoke algorithm abandons the regression of 2D bounding boxes and predicts the 3D box of each detected object by combining the estimation of individual keypoints and the regression of 3D variables.

Depth-assisted monocular 3D object detection
Depth estimation plays a key role in depth-assisted monocular 3D object detection. To achieve more accurate monocular detection results, many studies utilize pre-trained auxiliary depth estimation networks. The process starts by converting the monocular image into a depth image by using a pre-trained depth estimator such as MonoDepth. Then, two main methods are adopted to process depth images and monocular images. For example, the Pseudo-LiDAR detector uses a pretrained depth estimation network to generate Pseudo-LiDAR representations, but there is a huge performance gap between Pseudo-LiDAR and LiDAR-based detectors due to errors in image-to-LiDAR generation.

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!

Through the exploration and application of these methods, monocular 3D object detection has made significant progress in the fields of computer vision and intelligent systems, bringing breakthroughs and opportunities to these fields.

2.2 Stereo-based 3D object detection

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!

In this part, the 3D object detection technology based on stereo vision is discussed . Stereo vision 3D object detection utilizes a pair of stereoscopic images to identify and locate 3D objects. By exploiting dual views captured by stereo cameras, these methods excel in obtaining high-precision depth information through stereo matching and calibration, which is a feature that differentiates them from monocular camera setups. Despite these advantages, stereo vision methods still suffer from a considerable performance gap compared to lidar-based methods. Furthermore, the area of ​​3D object detection from stereo images is relatively little explored, with only limited research efforts dedicated to this area.

  1. 2D-detection based methods: The traditional 2D object detection framework can be modified to solve the stereo detection problem. For example, Stereo R-CNN uses an image-based 2D detector to predict 2D proposals, generating left and right regions of interest (RoIs) for the corresponding left and right images. Subsequently, in the second stage, it directly estimates the 3D object parameters based on the previously generated RoIs. This paradigm was widely adopted in subsequent work.
  2. Pseudo-LiDAR based methods: The disparity map predicted from the stereo image can be converted into a depth map and further converted into pseudo LiDAR points. Therefore, similar to monocular detection methods, pseudo-lidar representation can also be used in stereo vision-based 3D object detection methods. These methods aim to enhance disparity estimation in stereo matching to achieve more accurate depth prediction. For example, Wang et al. were pioneers in introducing pseudo-lidar representation. This representation is generated from an image with a depth map, requiring the model to perform depth estimation tasks to assist detection. Subsequent work followed this paradigm and refined it by introducing additional color information to enhance pseudo-point clouds, auxiliary tasks (such as instance segmentation, foreground and background segmentation, domain adaptation) and coordinate transformation schemes. It is worth noting that PatchNet proposed by Ma et al. challenges the traditional concept of using pseudo-lidar representation for monocular 3D object detection. By encoding 3D coordinates for each pixel, PatchNet can achieve comparable monocular detection results without pseudo-lidar representation. This observation suggests that the power of the pseudo-lidar representation comes from the coordinate transformation rather than the point cloud representation itself.

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!

2.3 Multi-view 3D object detection

Recently, multi-view 3D object detection has improved in terms of accuracy and robustness. Compared with the aforementioned monocular and stereo vision 3D object detection methods, it shows superiority. Unlike LiDAR-based 3D object detection, the latest panoramic Bird's Eye View (BEV) method eliminates the need for high-precision maps and elevates detection from 2D to 3D. This progress has led to significant developments in multi-view 3D object detection. In multi-camera 3D object detection, the key challenge is to identify the same object in different images and aggregate body features from multiple viewpoint inputs. Current methods involve uniformly mapping multiple views into Bird's Eye View (BEV) space, which is a common practice.

Depth-based Multi-view methods:

Direct conversion from 2D to BEV space poses a significant challenge. LSS is the first to propose a depth-based method, which utilizes 3D space as an intermediary. This method first predicts the grid depth distribution of 2D features and then lifts these features into voxel space. This approach offers hope for more efficient transformation from 2D to BEV space. Following LSS, CaDDN adopts a similar deep representation method. By compressing voxel space features into BEV space, it performs the final 3D detection. It is worth noting that CaDDN is not part of multi-view 3D object detection, but single-view 3D object detection, which has had an impact on subsequent in-depth research. The main difference between LSS and CaDDN is that CaDDN uses actual ground-truth depth values ​​to supervise the prediction of its classification depth distribution, thus creating a superior deep network capable of extracting 3D information from 2D space more accurately.

Query-based Multi-view methods

Under the influence of Transformer technology, query-based multi-view methods retrieve 2D space features from 3D space. DETR3D introduces 3D object query to solve the aggregation problem of multi-view features. It obtains image features in Bird's Eye View (BEV) space by clipping image features from different viewpoints and projecting them into 2D space using learned 3D reference points. Different from the depth-based multi-view method, the query-based multi-view method obtains sparse BEV features by using reverse query technology, which fundamentally affects the subsequent query-based development. However, due to potential inaccuracies associated with explicit 3D reference points, PETR adopted an implicit position encoding method to construct the BEV space, affecting subsequent work.

2.4 Analysis: Accuracy, Latency, Robustness

Currently, 3D object detection solutions based on Bird’s Eye View (BEV) perception are developing rapidly. Despite the existence of many review articles, a comprehensive review of this field is still insufficient. Shanghai AI Lab and SenseTime Research Institute provide an in-depth review of the technology roadmap for BEV solutions. However, unlike existing reviews, we consider key aspects such as autonomous driving safety perception. After analyzing the technology roadmap and current development status of camera-based solutions, we intend to discuss based on the basic principles of `Accuracy, Latency, Robustness'. We will integrate the perspective of safety awareness to guide the practical implementation of safety awareness in autonomous driving.

  1. Accuracy: There is a lot of focus on accuracy in most research articles and reviews, and it is really important. Although accuracy can be reflected by AP (average precision), considering AP alone may not provide a comprehensive perspective as different methods may exhibit significant differences due to different paradigms. As shown in the figure, we selected 10 representative methods for comparison, and the results show that there are significant metric differences between monocular 3D object detection and stereoscopic 3D object detection. The current situation shows that the accuracy of monocular 3D object detection is much lower than that of stereoscopic 3D object detection. Stereo vision 3D object detection utilizes images captured from two different perspectives of the same scene to obtain depth information. The larger the baseline between cameras, the wider the range of depth information captured. Over time, multi-view (bird's-eye view perception) 3D object detection gradually replaced monocular methods, significantly improving mAP. The increase in the number of sensors has a significant impact on mAP.
  2. Latency: In the field of autonomous driving, latency is crucial. It refers to the time it takes for a system to react to an input signal, including the entire process from sensor data collection to system decision-making and execution of actions. In autonomous driving, the requirements for latency are very strict, as any form of latency can lead to serious consequences. The importance of latency in autonomous driving is reflected in the following aspects: real-time responsiveness, safety, user experience, interactivity and emergency response. In the field of 3D object detection, latency (frames per second, FPS) and accuracy are key indicators for evaluating algorithm performance. As shown in the figure, the graph of monocular and stereo vision 3D object detection shows the average accuracy (AP) versus FPS for equal difficulty levels in the KITTI dataset. For the implementation of autonomous driving, 3D object detection algorithms must strike a balance between latency and accuracy. While monocular detection is fast, it lacks accuracy; conversely, stereo and multi-view methods are accurate but slower. Future research should not only maintain high accuracy, but also pay more attention to improving FPS and reducing latency to meet the dual requirements of real-time responsiveness and safety in autonomous driving.
  3. Robustness: Robustness is a key factor in autonomous driving safety perceptions and represents an important topic that has been previously overlooked in comprehensive reviews. This aspect is often not addressed in current well-designed clean datasets and benchmarks such as KITTI, nuScenes, and Waymo. Currently, research works such as RoboBEV and Robo3D incorporate robustness considerations in 3D object detection, such as sensor loss and other factors. They employ a methodology that involves introducing perturbations into datasets related to 3D object detection to assess robustness. This includes the introduction of various types of noise, such as changes in weather conditions, sensor failures, motion disturbances and object-related perturbations, aiming to reveal the different effects of different noise sources on the model. Typically, most papers studying robustness are evaluated by introducing noise to the validation set of clean datasets (such as KITTI, nuScenes, and Waymo). Additionally, we highlight the findings in Ref., which highlight KITTI-C and nuScenes-C as examples of camera-only 3D object detection methods. The table provides an overall comparison showing that overall the camera-only approach is less robust than the lidar-only and multi-model fusion approaches. They are very susceptible to various types of noise. In KITTI-C, three representative works—SMOKE, PGD, and ImVoxelNet—show consistently lower overall performance and reduced robustness to noise. In nuScenes-C, noteworthy methods such as DETR3D and BEVFormer show greater robustness compared to FCOS3D and PGD, indicating that the overall robustness increases as the number of sensors increases. In summary, future camera-only approaches need to consider not only cost factors and accuracy metrics (mAP, NDS, etc.), but also factors related to safety perception and robustness. Our analysis aims to provide valuable insights into the safety of future autonomous driving systems.

3. Lidar-based 3D object detection

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!

The voxel-based 3D object detection method proposes to combine sparse Point clouds are segmented and assigned into regular voxels, resulting in a dense data representation, a process called voxelization. Compared with view-based methods, voxel-based methods use spatial convolution to effectively perceive 3D spatial information and achieve higher detection accuracy, which is crucial for safety perception in autonomous driving. However, these methods still face the following challenges:

  1. High Computational Complexity: Compared with camera-based methods, voxel-based methods require large amounts of memory and computing resources because of the huge number of voxels used to represent the 3D space.
  2. Loss of spatial information: Due to the discretization characteristics of voxels, details and shape information may be lost or blurred during the voxelization process. At the same time, the limited resolution of voxels makes it difficult to detect accurately. Small objects.
  3. Scale and density inconsistency: Voxel-based methods usually require detection on voxel grids of different scales and densities, but due to the changes in scale and density of targets in different scenes is large, choosing the right scale and density to suit different goals becomes a challenge.

In order to overcome these challenges, it is necessary to solve the limitations of data representation, improve network feature capabilities and target positioning accuracy, and strengthen the algorithm's understanding of complex scenes. Although optimization strategies vary, they generally aim to optimize both data representation and model structure.

3.1 Voxel-based 3D object detection

Thanks to the prosperity of PC in deep learning, point-based 3D object detection inherits many of its frameworks and proposes Detect 3D objects directly from original points without preprocessing. Compared with voxel-based methods, the original point cloud retains the maximum amount of original information, which is beneficial to fine-grained feature acquisition and results in high accuracy. At the same time, a series of work on PointNet naturally provides a strong foundation for point-based methods. Point-based 3D object detectors have two basic components: point cloud sampling and feature learning. As of now, the performance of point-based methods is still affected by two factors: the number of context points and the context radius adopted in feature learning. . e.g. Increasing the number of context points can obtain more detailed 3D information, but will significantly increase the model's inference time. Similarly, reducing the context radius can have the same effect. Therefore, choosing appropriate values ​​for these two factors can allow the model to achieve a balance between accuracy and speed. In addition, since each point in the point cloud needs to be calculated, the point cloud sampling process is the main factor limiting the real-time operation of point-based methods. Specifically, to solve the above problems, most existing methods are optimized around two basic components of point-based 3D object detectors: 1) Point Sampling 2) feature learning

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!

3.2 Point-based 3D object detection

The point-based 3D object detection method inherits many deep learning frameworks and proposes to detect 3D objects directly from the original point cloud, while No preprocessing is performed. Compared with voxel-based methods, the original point cloud retains the original information to the maximum extent, which is conducive to the acquisition of fine-grained features, thereby achieving high accuracy. At the same time, the PointNet series of work provides a strong foundation for point-based methods. However, so far, the performance of point-based methods is still affected by two factors: the number of context points and the context radius used in feature learning. For example, increasing the number of context points can obtain more detailed 3D information, but will significantly increase the model's inference time. Similarly, reducing the context radius achieves the same effect. Therefore, choosing appropriate values ​​for these two factors allows the model to achieve a balance between accuracy and speed. In addition, the point cloud sampling process is the main factor limiting the real-time operation of point-based methods due to the need to perform calculations for each point in the point cloud. To solve these problems, existing methods mainly optimize around two basic components of point-based 3D object detectors: 1) point cloud sampling; 2) feature learning.

Farthest Point Sampling (FPS) is derived from PointNet and is a point cloud sampling method widely used in point-based methods. Its goal is to select a representative set of points from the original point cloud to maximize the distance between them to best cover the spatial distribution of the entire point cloud. PointRCNN is a groundbreaking two-stage detector among point-based methods, using PointNet as the backbone network. In the first stage, it generates 3D proposals from point clouds in a bottom-up manner. In the second stage, the proposals are refined by combining semantic features and local spatial features. However, existing FPS-based methods still face some problems: 1) Points unrelated to detection also participate in the sampling process, bringing additional computational burden; 2) Points are unevenly distributed in different parts of the object, resulting in suboptimal sampling strategies . To address these issues, subsequent work adopted an FPS-like design paradigm and made improvements, such as background point filtering guided by segmentation, random sampling, feature space sampling, voxel-based sampling, and ray grouping-based sampling.

The feature learning stage of point-based 3D object detection methods aims to extract discriminative feature representations from sparse point cloud data. The neural network used in the feature learning stage should have the following characteristics: 1) Invariance, the point cloud backbone network should be insensitive to the order of the input point cloud; 2) It has local perception capabilities and can sense and model local areas, and extract Local features; 3) The ability to integrate context information and extract features from global and local context information. Based on the above characteristics, a large number of detectors are designed to process raw point clouds. Most methods can be divided according to the core operators used: 1) PointNet-based methods; 2) Graph neural network-based methods; 3) Transformer-based methods.

PointNet-based methods

PointNet-based methods mainly rely on set abstraction to downsample original points, aggregate local information, and integrate contextual information while maintaining the original Symmetry invariance of points. Point-RCNN is the first two-stage work among point-based methods and achieves excellent performance, but still faces the problem of high computational cost. Subsequent work solved this problem by introducing an additional semantic segmentation task in the detection process to filter out background points that contribute minimally to detection.

Methods based on graph neural networks

Graph neural networks (GNN) have adaptive structures, dynamic neighborhoods, the ability to build local and global context relationships, and the ability to Robustness of regular sampling. Point-GNN is a pioneering work that designs a single-stage graph neural network to predict the category and shape of objects through automatic registration mechanism, merging and scoring operations, demonstrating the use of graph neural networks as a new method for 3D object detection. potential.

Transformer-based methods

In recent years, Transformer (Transformer) has been explored in point cloud analysis and has performed well on many tasks. For example, Pointformer introduces local and global attention modules to process 3D point clouds, the local Transformer module is used to model interactions between points in local regions, and the global Transformer aims to learn scene-level context-aware representations. Group-free directly utilizes all points in the point cloud to calculate the features of each object candidate, where the contribution of each point is determined by an automatically learned attention module. These methods demonstrate the potential of Transformer-based methods in processing unstructured and unordered raw point clouds.

3.3 Point-Voxel based 3D object detection

Point cloud-based 3D object detection methods provide high resolution and retain the spatial structure of the original data, but they Face high computational complexity and inefficiency when dealing with sparse data. In contrast, voxel-based methods provide structured data representation, improve computational efficiency, and facilitate the application of traditional convolutional neural network technology. However, they often lose fine spatial details due to the discretization process. To solve these problems, point-voxel (PV) based methods were developed. Point-voxel methods aim to exploit the fine-grained information capturing capabilities of point-based methods and the computational efficiency of voxel-based methods. By integrating these methods, point-voxel based methods can process point cloud data in more detail, capturing global structure and micro-geometric details. This is crucial for safety perception in autonomous driving, because the decision-making accuracy of the autonomous driving system depends on high-precision detection results.

The key goal of the point-voxel method is to achieve feature interaction between voxels and points through point-to-voxel or voxel-to-point conversion. Many works have explored the idea of ​​utilizing point-voxel feature fusion in backbone networks. These methods can be divided into two categories: 1) early fusion; 2) late fusion.

a) Early Fusion: Some methods have explored the use of new convolution operators to fuse voxel and point features, and PVCNN may be the first work in this direction. In this approach, the voxel-based branch first converts points into a low-resolution voxel grid and aggregates neighboring voxel features through convolution. Then, through a process called devoxelization, the voxel-level features are converted back to point-level features and fused with features obtained by the point-based branch. The point-based branch extracts features for each individual point. Since it does not aggregate neighborhood information, this method can run at higher speeds. Then, SPVCNN was extended to the field of object detection based on PVCNN. Other methods try to improve from different perspectives, such as auxiliary tasks or multi-scale feature fusion.

b) Post-fusion: This series of methods mainly uses a two-stage detection framework. First, preliminary object proposals are generated using a voxel-based approach. Then, point-level features are used to accurately divide the detection frame. The PV-RCNN proposed by Shi et al. is a milestone in point-voxel based methods. It uses SECOND as the first-stage detector and proposes a second-stage refinement stage with RoI grid pooling for the fusion of keypoint features. Subsequent work mainly follows the above paradigm and focuses on the progress of second-stage detection. Notable developments include attention mechanisms, scale-aware pooling, and point density-aware refinement modules.

Point-voxel-based methods have both the computational efficiency of voxel-based methods and the ability to capture fine-grained information based on point-based methods. However, constructing point-to-voxel or voxel-to-point relationships, as well as feature fusion of voxels and points, will bring additional computational overhead. Therefore, point-voxel based methods can achieve better detection accuracy compared to voxel-based methods, but at the cost of increased inference time.

4. Multi-modal 3D object detection

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!##4.1 Projection-based 3D object detection

The projection-based 3D object detection method uses the projection matrix in the feature fusion stage to integrate point cloud and image features. The key here is to focus on projection during feature fusion, rather than other projection processes in the fusion stage, such as data augmentation, etc. According to the different types of projections used in the fusion stage, projection-based 3D object detection methods can be further subdivided into the following categories:

3D object detection based on point projection
    : This type of method enhances the representation ability of original point cloud data by projecting image features onto the original point cloud. The first step in these methods is to use a calibration matrix to establish strong correlations between lidar points and image pixels. Next, the point cloud features are enhanced by adding additional data. This enhancement comes in two forms: one by merging segmentation scores (like PointPainting), and the other using CNN features from relevant pixels (like MVP). PointPainting enhances lidar points by appending segmentation scores, but has limitations in effectively capturing color and texture details in images. To solve these problems, more sophisticated methods such as FusionPainting were developed.
  1. 3D object detection based on feature projection
  2. : Different from methods based on point projection, this type of method mainly focuses on fusing point cloud features with image features in the point cloud feature extraction stage. In this process, point cloud and image modalities are effectively fused by applying a calibration matrix to transform the 3D coordinate system of voxels into the pixel coordinate system of the image. For example, ContFuse fuses multi-scale convolutional feature maps through continuous convolution.
  3. 3D object detection based on automatic projection
  4. : Many studies perform fusion through direct projection, but do not solve the problem of projection error. Some works (such as AutoAlignV2) mitigate these errors by learning offsets and neighborhood projections, etc. For example, HMFI, GraphAlign and GraphAlign utilize prior knowledge of the projection calibration matrix for image projection and local graph modeling.
  5. Decision projection-based 3D object detection
  6. : This type of method uses a projection matrix to align features in a region of interest (RoI) or a specific result. For example, Graph-RCNN projects graph nodes to positions in a camera image and collects feature vectors for that pixel in the camera image through bilinear interpolation. F-PointNet determines the category and positioning of objects through 2D image detection, and obtains point clouds in the corresponding 3D space through calibrated sensor parameters and transformation matrices in 3D space.
  7. These methods show how to use projection technology to achieve feature fusion in multi-modal 3D object detection, but they still have certain limitations in handling the interaction between different modalities and accuracy. .

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!4.2 Non-Projection-based 3D object detection

##Non-Projection-based 3D object detection The detection method achieves fusion by not relying on feature alignment, resulting in robust feature representation. They circumvent the limitations of camera-to-lidar projection, which often reduces the semantic density of camera features and affects the effectiveness of techniques such as Focals Conv and PointPainting. Non-projective methods usually adopt a cross-attention mechanism or construct a unified space to solve the inherent misalignment problem in direct feature projection. These methods are mainly divided into two categories: (1) query learning-based methods and (2) unified feature-based methods. Query learning-based methods completely avoid the need for alignment during the fusion process. In contrast, unified feature-based methods, although constructing a unified feature space, do not completely avoid projection; it usually occurs in a single modality context. For example, BEVFusion utilizes LSS for camera-to-BEV projection. This process occurs before fusion and shows considerable robustness in scenarios where features are misaligned.

  1. Three-dimensional object detection based on query learning: Three-dimensional object detection methods based on query learning, such as Transfusion, DeepFusion, DeepInteraction, autoalign, CAT-Det, MixedFusion, etc., avoid the feature fusion process projection requirements in . Instead, they achieve feature alignment before performing feature fusion through a cross-attention mechanism. Point cloud features are usually used as queries, and image features are used as keys and values. Highly robust multi-modal features are obtained through global feature queries. In addition, DeepInteraction introduces multi-modal interaction, in which point cloud and image features are used as different queries to achieve further feature interaction. Comprehensive integration of image features leads to the acquisition of more robust multi-modal features compared to using only point cloud features as queries. In general, the three-dimensional object detection method based on query learning uses a Transformer-based structure to perform feature query to achieve feature alignment. Eventually, multimodal features were integrated into lidar-based processes such as CenterPoint.
  2. Three-dimensional object detection based on unified features: Three-dimensional object detection methods based on unified features, such as EA-BEV, BEVFusion, cai2023bevfusion4d, FocalFormer3D, FUTR3D, UniTR, Uni3D, virconv, MSMDFusion, sfd, cmt, UVTR, sparsefusion, etc. usually achieve pre-fusion unification of heterogeneous modalities through projection before feature fusion. In the BEV fusion series, LSS is used for depth estimation, the front-view features are converted into BEV features, and then the BEV image and BEV point cloud features are fused. On the other hand, CMT and UniTR use Transformer for tokenization of point clouds and images, and construct an implicit unified space through Transformer encoding. CMT uses projection in the position encoding process but completely avoids reliance on projection relationships at the feature learning level. FocalFormer3D, FUTR3D and UVTR use Transformer's query to implement a solution similar to DETR3D, and build a unified sparse BEV feature space through query, thus alleviating the instability caused by direct projection.

VirConv, MSMDFusion and SFD construct a unified space through pseudo point clouds, and projection occurs before feature learning. The problems introduced by direct projection are solved through subsequent feature learning. In summary, unified feature-based 3D object detection methods currently represent highly accurate and robust solutions. Although they contain a projection matrix, this projection does not occur between multi-modal fusions and is therefore considered a non-projective 3D object detection method. Different from automatic projection 3D object detection methods, they do not directly solve the problem of projection error, but choose to construct a unified space and consider multiple dimensions of multimodal 3D object detection to obtain highly robust multimodal features.

5. Conclusion

3D object detection plays a vital role in autonomous driving perception. In recent years, this field has developed rapidly and produced a large number of research papers. Based on the diverse data forms generated by sensors, these methods are mainly divided into three types: image-based, point cloud-based and multi-modal. The main evaluation metrics of these methods are high accuracy and low latency. Many reviews summarize these approaches, focusing mainly on the core principles of `high accuracy and low latency', describing their technical trajectories.

However, in the process of autonomous driving technology moving from breakthroughs to practical applications, existing reviews do not take safety perception as the core focus and fail to cover the current technical paths related to safety perception. For example, recent multimodal fusion methods are often tested for robustness during the experimental phase, an aspect that has not been fully considered in the current review.

Therefore, re-examine the 3D object detection algorithm, focusing on `accuracy, latency and robustness' as key aspects. We reclassify previous reviews with special emphasis on reclassification from a safety perception perspective. It is hoped that this work will provide new insights into future research on 3D object detection, going beyond just exploring the limitations of high accuracy.

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!

The above is the detailed content of Choose camera or lidar? A recent review on achieving robust 3D object detection. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Why is Gaussian Splatting so popular in autonomous driving that NeRF is starting to be abandoned? Why is Gaussian Splatting so popular in autonomous driving that NeRF is starting to be abandoned? Jan 17, 2024 pm 02:57 PM

Written above & the author’s personal understanding Three-dimensional Gaussiansplatting (3DGS) is a transformative technology that has emerged in the fields of explicit radiation fields and computer graphics in recent years. This innovative method is characterized by the use of millions of 3D Gaussians, which is very different from the neural radiation field (NeRF) method, which mainly uses an implicit coordinate-based model to map spatial coordinates to pixel values. With its explicit scene representation and differentiable rendering algorithms, 3DGS not only guarantees real-time rendering capabilities, but also introduces an unprecedented level of control and scene editing. This positions 3DGS as a potential game-changer for next-generation 3D reconstruction and representation. To this end, we provide a systematic overview of the latest developments and concerns in the field of 3DGS for the first time.

How to solve the long tail problem in autonomous driving scenarios? How to solve the long tail problem in autonomous driving scenarios? Jun 02, 2024 pm 02:44 PM

Yesterday during the interview, I was asked whether I had done any long-tail related questions, so I thought I would give a brief summary. The long-tail problem of autonomous driving refers to edge cases in autonomous vehicles, that is, possible scenarios with a low probability of occurrence. The perceived long-tail problem is one of the main reasons currently limiting the operational design domain of single-vehicle intelligent autonomous vehicles. The underlying architecture and most technical issues of autonomous driving have been solved, and the remaining 5% of long-tail problems have gradually become the key to restricting the development of autonomous driving. These problems include a variety of fragmented scenarios, extreme situations, and unpredictable human behavior. The "long tail" of edge scenarios in autonomous driving refers to edge cases in autonomous vehicles (AVs). Edge cases are possible scenarios with a low probability of occurrence. these rare events

CLIP-BEVFormer: Explicitly supervise the BEVFormer structure to improve long-tail detection performance CLIP-BEVFormer: Explicitly supervise the BEVFormer structure to improve long-tail detection performance Mar 26, 2024 pm 12:41 PM

Written above & the author’s personal understanding: At present, in the entire autonomous driving system, the perception module plays a vital role. The autonomous vehicle driving on the road can only obtain accurate perception results through the perception module. The downstream regulation and control module in the autonomous driving system makes timely and correct judgments and behavioral decisions. Currently, cars with autonomous driving functions are usually equipped with a variety of data information sensors including surround-view camera sensors, lidar sensors, and millimeter-wave radar sensors to collect information in different modalities to achieve accurate perception tasks. The BEV perception algorithm based on pure vision is favored by the industry because of its low hardware cost and easy deployment, and its output results can be easily applied to various downstream tasks.

Choose camera or lidar? A recent review on achieving robust 3D object detection Choose camera or lidar? A recent review on achieving robust 3D object detection Jan 26, 2024 am 11:18 AM

0.Written in front&& Personal understanding that autonomous driving systems rely on advanced perception, decision-making and control technologies, by using various sensors (such as cameras, lidar, radar, etc.) to perceive the surrounding environment, and using algorithms and models for real-time analysis and decision-making. This enables vehicles to recognize road signs, detect and track other vehicles, predict pedestrian behavior, etc., thereby safely operating and adapting to complex traffic environments. This technology is currently attracting widespread attention and is considered an important development area in the future of transportation. one. But what makes autonomous driving difficult is figuring out how to make the car understand what's going on around it. This requires that the three-dimensional object detection algorithm in the autonomous driving system can accurately perceive and describe objects in the surrounding environment, including their locations,

This article is enough for you to read about autonomous driving and trajectory prediction! This article is enough for you to read about autonomous driving and trajectory prediction! Feb 28, 2024 pm 07:20 PM

Trajectory prediction plays an important role in autonomous driving. Autonomous driving trajectory prediction refers to predicting the future driving trajectory of the vehicle by analyzing various data during the vehicle's driving process. As the core module of autonomous driving, the quality of trajectory prediction is crucial to downstream planning control. The trajectory prediction task has a rich technology stack and requires familiarity with autonomous driving dynamic/static perception, high-precision maps, lane lines, neural network architecture (CNN&GNN&Transformer) skills, etc. It is very difficult to get started! Many fans hope to get started with trajectory prediction as soon as possible and avoid pitfalls. Today I will take stock of some common problems and introductory learning methods for trajectory prediction! Introductory related knowledge 1. Are the preview papers in order? A: Look at the survey first, p

SIMPL: A simple and efficient multi-agent motion prediction benchmark for autonomous driving SIMPL: A simple and efficient multi-agent motion prediction benchmark for autonomous driving Feb 20, 2024 am 11:48 AM

Original title: SIMPL: ASimpleandEfficientMulti-agentMotionPredictionBaselineforAutonomousDriving Paper link: https://arxiv.org/pdf/2402.02519.pdf Code link: https://github.com/HKUST-Aerial-Robotics/SIMPL Author unit: Hong Kong University of Science and Technology DJI Paper idea: This paper proposes a simple and efficient motion prediction baseline (SIMPL) for autonomous vehicles. Compared with traditional agent-cent

nuScenes' latest SOTA | SparseAD: Sparse query helps efficient end-to-end autonomous driving! nuScenes' latest SOTA | SparseAD: Sparse query helps efficient end-to-end autonomous driving! Apr 17, 2024 pm 06:22 PM

Written in front & starting point The end-to-end paradigm uses a unified framework to achieve multi-tasking in autonomous driving systems. Despite the simplicity and clarity of this paradigm, the performance of end-to-end autonomous driving methods on subtasks still lags far behind single-task methods. At the same time, the dense bird's-eye view (BEV) features widely used in previous end-to-end methods make it difficult to scale to more modalities or tasks. A sparse search-centric end-to-end autonomous driving paradigm (SparseAD) is proposed here, in which sparse search fully represents the entire driving scenario, including space, time, and tasks, without any dense BEV representation. Specifically, a unified sparse architecture is designed for task awareness including detection, tracking, and online mapping. In addition, heavy

FisheyeDetNet: the first target detection algorithm based on fisheye camera FisheyeDetNet: the first target detection algorithm based on fisheye camera Apr 26, 2024 am 11:37 AM

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

See all articles