Abstract:
Single-modal object detectors have developed rapidly and have achieved remarkable results in recent years. However, these detectors still exhibit significant limitations, primarily because they cannot leverage the complementary information intrinsic to multimodal images. Visible–infrared object detection technology addresses challenges such as poor visibility under low-light conditions by fusing information from visible and infrared images to exploit complementary features across the two modalities. However, the precise alignment of feature maps from different modalities and efficient fusion of modality-specific information remain key challenges in this field. Although various methods have been proposed to address these issues, effectively handling modality differences, enhancing the complementarity of cross-modal information, and achieving efficient feature fusion continue to be bottlenecks for high-performance object detectors. To overcome these challenges, this study proposes a visible–infrared object detection method called F3M-Det. This method significantly improves detection performance by enhancing, aligning, and fusing cross-modal features. The core concept of F3M-Det is to fully leverage the complementarity between visible and infrared images, thereby enhancing the model’s ability to understand and process cross-modal information. Specifically, the core components of the proposed F3M-Det consist of a feature extraction backbone, feature enhancement module (FEM), feature alignment module (FAM), and feature fusion module (FFM). The FEM utilizes cross-modal attention mechanisms to significantly enhance the expressive power of both visible and infrared image features. By effectively capturing subtle differences and complementary information between the modalities, the FEM enables F3M-Det to achieve higher detection accuracy. To reduce the computational cost of calculating the cross-attention on global feature maps while retaining the useful features of the input feature maps, the FEM employs a multiscale feature-pooling method to reduce the dimensionality of the feature maps. Next, the FAM is introduced to effectively align feature maps from different modalities. The FAM combines global information with local details to ensure that features captured from different perspectives and scales are accurately aligned. This approach reduces the modality differences and improves the comparability of cross-modal information. The design of the FAM allows the model to effectively handle misalignments between modalities in complex environments, thereby enhancing the robustness and generalization ability of the F3M-Det. Finally, the FFM is introduced to achieve the efficient fusion of cross-modal features. The FFM incorporates frequency-aware mechanisms to reduce irrelevant modality differences during feature fusion while preserving useful complementary information, thereby enhancing the effectiveness of the fused features. The FFM is also used as a cross-scale feature fusion module (SFFM) to reduce information loss. F3M-Det uses YOLOv5 as its baseline. In terms of structure, it builds a dual-stream backbone network using CSPDarknet, integrating the FPN structure and detection head from YOLOv5. To validate the effectiveness of the proposed F3M-Det, we conducted comprehensive experimental evaluations on two widely used datasets: the unaligned DVTOD and aligned LLVIP datasets. The experimental results show that F3M-Det outperforms existing visible–infrared image object detection methods on both datasets, demonstrating its superiority in handling cross-modal feature alignment and fusion. Additionally, ablation experiments were conducted to investigate the impact of each module on F3M-Det’s performance. The results demonstrate the importance of each proposed module in enhancing detection accuracy, further validating the effectiveness and superiority of F3M-Det.