Abstract:
To meet the demands for fusion efficiency and perceptual performance of infrared and visible images on resource-constrained unmanned aerial vehicle (UAV) platforms, this paper proposes a text-guided lightweight multimodal image fusion network with heterogeneous encoders. The network employs a lightweight dual-branch heterogeneous encoding architecture designed to complementarily represent infrared and visible image information. Specifically, the infrared-encoding branch emphasizes thermal targets and edge responses, while the visible-encoding branch focuses on modeling texture and detail information. This design effectively avoids feature redundancy and performance bottlenecks commonly associated with homogeneous encoders. To enhance the collaborative representation capability of multimodal features, a lightweight cross-modal attention fusion module is introduced. This module jointly models attention relationships across channel and spatial dimensions, thereby strengthening complementary information interactions between modalities. Furthermore, leveraging semantic features extracted from the pre-trained vision–language model CLIP, the fusion process incorporates explicit semantic prior guidance. Through hierarchical feature-level modulation, the weights of infrared and visible features are dynamically adjusted, improving the semantic consistency and environmental adaptability of the fused images. Systematic comparative experiments and comprehensive evaluations were conducted on three publicly available multimodal image datasets: TNO, LLVIP, and M3FD. The proposed method was compared with nine representative image fusion algorithms. Results demonstrate that the proposed network achieves state-of-the-art performance on multiple mainstream evaluation metrics, including mutual information and structural similarity. The fused images surpass existing methods in detail clarity, edge structure consistency, and target discernibility. Ablation studies further show that the inference time of the proposed model is reduced by approximately 50% compared with baseline methods, achieving higher efficiency without significant performance degradation. In addition to quantitative evaluations, qualitative experiments guided by textual instructions were performed, demonstrating the model’s strong semantic responsiveness and content adaptability. For example, in low-light enhancement tasks, the model significantly improves brightness and visibility of fused images, highlighting thermal sources present in infrared images. However, in daytime scenarios, this instruction may cause excessive brightness in background areas, reflecting the model’s selective sensitivity to semantic inputs. In overexposure correction tasks, the model preserves thermal contrast features in infrared images, reducing interference from overexposed regions in visible images and producing fusion outcomes dominated by infrared characteristics. Under conditions of low contrast in infrared images, the model enhances texture and detail information from visible images, generating fusion results closer to the visual appearance of the visible modality. In noise-robustness tasks, when visible images are corrupted by noise, the model preferentially leverages the stable structural information from infrared images to reconstruct fused outputs, effectively mitigating noise-induced degradation and demonstrating strong anti-interference capability. In summary, the proposed model integrates heterogeneous dual-branch encoding with cross-modal attention and semantic guidance to achieve improved fusion quality and adaptability on resource-limited UAV platforms. Experimental results confirm that it can dynamically adjust fusion strategies based on different semantic inputs, enhancing the consistency and relevance of fused images for various tasks. Moreover, the model achieves a favorable balance between computational efficiency and fusion performance, making it suitable for practical deployment in complex environments.