Autonomous vehicles often have varying camera sensor setups, which is inevitable due to restricted placement options for different vehicle types. Training a perception model on one particular setup and evaluating it on a new, different sensor setup reveals the so-called cross-sensor domain gap, typically leading to a degradation in accuracy. In this paper, we investigate the impact of the cross-sensor domain gap on state-of-the-art 3D object detectors. To this end, we introduce CamShift, a dataset inspired by nuScenes and created in CARLA to specifically simulate the domain gap between subcompact vehicles and sport utility vehicles (SUVs). Using CamShift, we demonstrate significant cross-sensor performance degradation, identify robustness dependencies on model architecture, and propose a data-driven solution to mitigate the effect. On the one hand, we show that model architectures based on a dense Bird’s Eye View (BEV) representation with backward projection, such as BEVFormer, are the most robust against varying sensor configurations. On the other hand, we propose a novel data-driven sensor adaptation pipeline based on neural rendering, which can transform entire datasets to match different camera sensor setups. Applying this approach improves performance across all investigated 3D object detectors, mitigating the cross-sensor domain gap by a large margin and reducing the need for new data collection by enabling efficient data reusability across vehicles with different sensor setups.
To systematically evaluate the effectiveness of our neural sensor adaptation pipeline, we introduce a comprehensive sensor adaptation benchmark.
This benchmark is designed to quantify the cross-sensor domain gap and assess the performance of 3D object detectors when models are transferred between different vehicle sensor configurations.
By providing standardized evaluation protocols and datasets, our benchmark enables fair and reproducible comparisons of adaptation strategies across a variety of real and synthetic sensor setups.
Training | Validation Split | |
---|---|---|
sim-SUV [a] | sim-SUB [b] | |
sim-SUV [A] | SUV baseline | cross-sensor domain gap |
sim-SUB [B] | cross-sensor domain gap | SUB baseline |
nerf-SUV [C] | neural rendering domain gap | n/a |
nerf-SUB [D] | n/a | neural sensor adaptation |
The following images demonstrate how our neural sensor adaptation pipeline mitigates the cross-sensor domain gap, using StreamPETR as an example.
Consider the scenario where training data has been collected for the SUV carline (sim-SUV). We then wish to deploy StreamPETR on a new carline, specifically a subcompact vehicle (sim-SUB).
Comparison of 3D object detection performance across different training and adaptation strategies.
Exp. | Training | Validation | mAP ↑ [%] | |||
---|---|---|---|---|---|---|
DETR3D | StreamPETR | BEVDet | BEVFormer | |||
Bb | sim-SUB | sim-SUB | 46.1 | 52.7 | 41.9 | 61.1 |
Ab | sim-SUV | sim-SUB | 29.7 (−16.4) | 17.3 (−35.4) | 29.4 (−12.5) | 50.8 (−10.3) |
Db | nerf-SUB | sim-SUB | 43.1 (+13.4) | 44.8 (+27.5) | 29.5 (+0.1) | 52.1 (+1.3) |
We introduce the CamShift dataset, a novel dataset designed to systematically evaluate the cross-sensor domain gap in 3D object detection.
CamShift consists of two splits (sim-SUV and sim-SUB). The sim-SUV split captures scenes from the perspective of an SUV, while the sim-SUB split captures the same scenes from the viewpoint of a subcompact vehicle.
Despite the difference in camera perspectives, both datasets are identical.
You can download a preview of the CamShift dataset here.
For complete dataset access, please reach out to camshift@gmx.de.
Category | Ratio [%] |
---|---|
Ambulance | 0.9 |
Bicycle | 3.7 |
Bus | 1.1 |
Car | 60.9 |
Human | 23.6 |
Motorcycle | 5.4 |
Truck | 4.4 |
Split | # Scenes | Town ID |
---|---|---|
Train | 750 | 01, 02, |
04, 05, | ||
06, 07, | ||
10, 12, | ||
15 | ||
Val | 150 | 03, 13 |