融合动态Bins和深度不确定性的智能车辆单目3D目标检测

于承峄; 孙宾宾; 王鹏伟(通讯作者); 高松; 张玉桦; 杨凡; 张泽恒

doi:10.13374/j.issn2095-9389.2026.01.13.004

融合动态Bins和深度不确定性的智能车辆单目3D目标检测

Monocular 3D target detection for intelligent vehicle based on dynamic bins and deep uncertainty

摘要

摘要: 针对目前自动驾驶单目3D目标检测中，主干网络多尺度信息捕获能力不足、3D深度预测误差较大和Transformer深度信息编码能力有限导致的目标识别精度低等问题，本文提出了一种面向自动驾驶车辆的单目3D目标检测算法MonoDBDU。首先，针对主干网络多尺度信息捕获能力不足的问题，提出了一种双向注意力门控特征融合模块，并结合ResNeSt50网络（BAGFF-ResNeSt50）作为MonoDBDU框架的主干提取网络，其通过引入注意力门控模块，以深层特征为引导信号对浅层高分辨率特征进行动态加权，构建了以双向逐级特征传递模式的多尺度特征融合链路。其次，针对3D深度预测误差较大的问题，本文设计了一种融合深度不确定性的动态Bins深度预测器，通过在深度特征中融入不确定性概率，同时将Bins区间的宽度和中心位置设置为可学习参数，解决了因深度预测和Bins分布引起的3D深度预测误差较大的问题。最后，针对Transformer深度信息编码能力有限的问题，本文提出了一种融合深度不确定性的轻量化Transformer（DU-Transformer），通过引入Swin Transformer中的移位多头自注意力机制，进一步减少了模型的参数量，同时对深度不确定性进行显式建模，进而构建了以深度不确定性为引导的Transformer结构。经过仿真实验验证，MonoDBDU在KITTI数据集IOU=0.7时AP3D、APBEV分别提高了3.28、1.01，在Waymo数据集IOU=0.7时，分别提高了2.78、2.71。经过实车实验验证，MonoDBDU具有良好的实用性和有效性。

Abstract: To address the low target recognition accuracy caused by insufficient multi-scale information capture capability of backbone networks, large 3D depth prediction errors, and limited depth information encoding capability in existing monocular 3D object detection for autonomous driving, a monocular 3D object detection algorithm (MonoDBDU) for autonomous vehicles is proposed. Firstly, to tackle the insufficient multi-scale information capture capability of backbone networks, a Bidirectional Attention-Gated Feature Fusion Module (BAGFF) is proposed. An attention-gated module is incorporated, which takes deep high-semantic features as guidance signals to dynamically weight shallow high-resolution features. A multi-scale feature fusion link is constructed via a bidirectional hierarchical semantic transmission mode, and the ResNeSt50 network is integrated to form the BAGFF-ResNeSt50, which serves as the backbone feature extraction network of the MonoDBDU framework. Secondly, to resolve the large 3D depth prediction errors, a dynamic Bins depth predictor fused with depth uncertainty is designed. Uncertainty probability is incorporated into depth features, and the width and central position of Bins intervals are set as learnable parameters, thereby addressing the large depth errors induced by depth prediction and Bins distribution. The predictor is adopted as the depth estimation component of the proposed algorithm. Finally, to overcome the limited depth information encoding capability of Transformers, a lightweight Transformer fused with depth uncertainty (DU-Transformer) is proposed, and an encoding-decoding structure guided by depth uncertainty is constructed. Experimental validation via simulations demonstrates that MonoDBDU improves AP3D and APBEV by 3.28 and 1.01, respectively, on the KITTI dataset at IOU=0.7, and by 2.78 and 2.71, respectively, on the Waymo dataset at IOU=0.7. Real-vehicle experiments further verify that MonoDBDU exhibits favorable practicability and effectiveness.

HTML全文

参考文献(0)

施引文献

资源附件(0)