基于特征增强和对齐融合的可见光–红外图像融合目标检测

代小波; 任思羽; 何小海; 卿粼波; 陈洪刚

doi:10.13374/j.issn2095-9389.2025.03.26.004

基于特征增强和对齐融合的可见光–红外图像融合目标检测

Visible–infrared fusion object detection based on feature enhancement and alignment fusion

摘要

摘要: 可见光–红外图像融合目标检测可以通过融合不同模态的信息来实现更好的检测性能. 然而，来自不同模态的特征图的精确对齐以及这些不同模态信息的高效融合仍然是可见光–红外图像融合目标检测中的关键挑战. 为了解决这些问题，本文提出了一种名为F3M-Det的可见光–红外图像融合目标检测方法，该方法利用跨模态特征增强、对齐和融合，实现了可见光–红外图像融合目标检测. 首先，设计了一个基于跨模态交叉注意力的特征增强模块，插入骨干网络的各个阶段，引导骨干网络关注目标的所在区域，提升了来自两种模态的特征表示能力. 其次，设计了一个全局到局部的特征对齐模块，从全局到局部渐进地对齐了来自两种模态的特征图，使模型适用于非对齐可见光–红外图像的场景. 最后，提出了基于频率感知的特征融合模块，克服了来自不同模态的特征图差异，有效地融合了红外和可见光的多模态特征. 本文在常用数据集DVTOD和LLVIP上进行了全面的实验评估. 与其他算法的对比实验结果表明，所提出的方法优于现有的可见光–红外图像融合目标检测方法，消融实验结果进一步证明了所设计的各个模块的有效性.

Abstract: Single-modal object detectors have developed rapidly and have achieved remarkable results in recent years. However, these detectors still exhibit significant limitations, primarily because they cannot leverage the complementary information intrinsic to multimodal images. Visible–infrared object detection technology addresses challenges such as poor visibility under low-light conditions by fusing information from visible and infrared images to exploit complementary features across the two modalities. However, the precise alignment of feature maps from different modalities and efficient fusion of modality-specific information remain key challenges in this field. Although various methods have been proposed to address these issues, effectively handling modality differences, enhancing the complementarity of cross-modal information, and achieving efficient feature fusion continue to be bottlenecks for high-performance object detectors. To overcome these challenges, this study proposes a visible–infrared object detection method called F3M-Det. This method significantly improves detection performance by enhancing, aligning, and fusing cross-modal features. The core concept of F3M-Det is to fully leverage the complementarity between visible and infrared images, thereby enhancing the model’s ability to understand and process cross-modal information. Specifically, the core components of the proposed F3M-Det consist of a feature extraction backbone, feature enhancement module (FEM), feature alignment module (FAM), and feature fusion module (FFM). The FEM utilizes cross-modal attention mechanisms to significantly enhance the expressive power of both visible and infrared image features. By effectively capturing subtle differences and complementary information between the modalities, the FEM enables F3M-Det to achieve higher detection accuracy. To reduce the computational cost of calculating the cross-attention on global feature maps while retaining the useful features of the input feature maps, the FEM employs a multiscale feature-pooling method to reduce the dimensionality of the feature maps. Next, the FAM is introduced to effectively align feature maps from different modalities. The FAM combines global information with local details to ensure that features captured from different perspectives and scales are accurately aligned. This approach reduces the modality differences and improves the comparability of cross-modal information. The design of the FAM allows the model to effectively handle misalignments between modalities in complex environments, thereby enhancing the robustness and generalization ability of the F3M-Det. Finally, the FFM is introduced to achieve the efficient fusion of cross-modal features. The FFM incorporates frequency-aware mechanisms to reduce irrelevant modality differences during feature fusion while preserving useful complementary information, thereby enhancing the effectiveness of the fused features. The FFM is also used as a cross-scale feature fusion module (SFFM) to reduce information loss. F3M-Det uses YOLOv5 as its baseline. In terms of structure, it builds a dual-stream backbone network using CSPDarknet, integrating the FPN structure and detection head from YOLOv5. To validate the effectiveness of the proposed F3M-Det, we conducted comprehensive experimental evaluations on two widely used datasets: the unaligned DVTOD and aligned LLVIP datasets. The experimental results show that F3M-Det outperforms existing visible–infrared image object detection methods on both datasets, demonstrating its superiority in handling cross-modal feature alignment and fusion. Additionally, ablation experiments were conducted to investigate the impact of each module on F3M-Det’s performance. The results demonstrate the importance of each proposed module in enhancing detection accuracy, further validating the effectiveness and superiority of F3M-Det.

HTML全文

参考文献(33)

施引文献

资源附件(0)