基于文本引导的轻量异构编码多模态图像融合

Text-guided lightweight multimodal image fusion with heterogeneous encoders

  • 摘要: 针对资源受限的无人机平台对红外与可见光图像的融合效率与感知性能需求,本文提出一种基于文本引导的轻量异构编码多模态图像融合网络. 该网络设计了一种面向红外与可见光图像信息表达功能互补的轻量化双分支异构编码,红外图像编码分支强调热目标与边缘响应,可见光图像编码分支侧重于纹理与细节信息建模,从而有效避免同构编码器带来的特征冗余与性能瓶颈. 同时,引入轻量级跨模态特征融合模块,增强多模信息之间的互补性与融合表达能力. 进一步,通过预训练视觉语言模型结合语义文本特征对融合过程进行引导与调控,提升融合图像的语义一致性与环境适应性. 在三个公开多模态图像数据集TNO、LLVIP与M3FD上,本文方法与九种代表性图像融合算法进行了系统对比实验与综合评估,结果显示本文网络在互信息、结构相似性等多个主流评价指标上均表现优越,融合图像在细节清晰度、边缘结构一致性与目标可辨性方面优于现有方法. 同时,消融实验表明所提出模型的推理时间相较基线方法减少约50%,且在不显著牺牲性能的前提下实现了更高的效率. 除定量评估外,本文还开展了基于文本指令的定性实验,结果显示模型可根据不同语义指令灵活调整红外与可见光特征融合策略,适应低光、过曝、低对比、噪声等多种任务场景. 在保证语义一致性的同时,有效增强了热源感知、结构清晰度与抗干扰能力,展现出传统无引导方法难以实现的语义可控性与内容适应性.

     

    Abstract: To meet the demands for fusion efficiency and perceptual performance of infrared and visible images on resource-constrained unmanned aerial vehicle (UAV) platforms, this paper proposes a text-guided lightweight multimodal image fusion network with heterogeneous encoders. The network employs a lightweight dual-branch heterogeneous encoding architecture designed to complementarily represent infrared and visible image information. Specifically, the infrared-encoding branch emphasizes thermal targets and edge responses, while the visible-encoding branch focuses on modeling texture and detail information. This design effectively avoids feature redundancy and performance bottlenecks commonly associated with homogeneous encoders. To enhance the collaborative representation capability of multimodal features, a lightweight cross-modal attention fusion module is introduced. This module jointly models attention relationships across channel and spatial dimensions, thereby strengthening complementary information interactions between modalities. Furthermore, leveraging semantic features extracted from the pre-trained vision–language model CLIP, the fusion process incorporates explicit semantic prior guidance. Through hierarchical feature-level modulation, the weights of infrared and visible features are dynamically adjusted, improving the semantic consistency and environmental adaptability of the fused images. Systematic comparative experiments and comprehensive evaluations were conducted on three publicly available multimodal image datasets: TNO, LLVIP, and M3FD. The proposed method was compared with nine representative image fusion algorithms. Results demonstrate that the proposed network achieves state-of-the-art performance on multiple mainstream evaluation metrics, including mutual information and structural similarity. The fused images surpass existing methods in detail clarity, edge structure consistency, and target discernibility. Ablation studies further show that the inference time of the proposed model is reduced by approximately 50% compared with baseline methods, achieving higher efficiency without significant performance degradation. In addition to quantitative evaluations, qualitative experiments guided by textual instructions were performed, demonstrating the model’s strong semantic responsiveness and content adaptability. For example, in low-light enhancement tasks, the model significantly improves brightness and visibility of fused images, highlighting thermal sources present in infrared images. However, in daytime scenarios, this instruction may cause excessive brightness in background areas, reflecting the model’s selective sensitivity to semantic inputs. In overexposure correction tasks, the model preserves thermal contrast features in infrared images, reducing interference from overexposed regions in visible images and producing fusion outcomes dominated by infrared characteristics. Under conditions of low contrast in infrared images, the model enhances texture and detail information from visible images, generating fusion results closer to the visual appearance of the visible modality. In noise-robustness tasks, when visible images are corrupted by noise, the model preferentially leverages the stable structural information from infrared images to reconstruct fused outputs, effectively mitigating noise-induced degradation and demonstrating strong anti-interference capability. In summary, the proposed model integrates heterogeneous dual-branch encoding with cross-modal attention and semantic guidance to achieve improved fusion quality and adaptability on resource-limited UAV platforms. Experimental results confirm that it can dynamically adjust fusion strategies based on different semantic inputs, enhancing the consistency and relevance of fused images for various tasks. Moreover, the model achieves a favorable balance between computational efficiency and fusion performance, making it suitable for practical deployment in complex environments.

     

/

返回文章
返回