Abstract:
To meet the requirements of fusion efficiency and perception performance for infrared and visible images on resource-constrained UAV platforms, this paper proposes a text-guided dual-branch heterogeneous encoder-based multimodal image fusion method. A lightweight dual-branch heterogeneous encoding architecture is designed to exploit the complementary characteristics of infrared and visible modalities. Specifically, the infrared branch focuses on thermal target and edge response representation, while the visible branch emphasizes texture and detail modeling, effectively avoiding the feature redundancy and performance bottlenecks caused by homogeneous encoders. A lightweight cross-modal feature fusion module is further introduced to enhance the complementarity and expressive capacity between modalities. In addition, a pre-trained vision-language model is integrated to extract semantic text features, guiding and regulating the fusion process to improve the semantic consistency and adaptability of the fused images under diverse environmental conditions. Extensive experiments on three public multimodal image datasets demonstrate that the proposed method outperforms seven representative fusion algorithms in terms of multiple mainstream metrics such as mutual information (MI), structural similarity (SSIM), and peak signal-to-noise ratio (PSNR). The results show that the proposed network effectively enhances detail clarity and structural consistency of the fused images. Compared to baseline methods, it reduces inference time by approximately 50% with negligible performance degradation, demonstrating outstanding efficiency and deployment potential.