文本引导的双分支异构编码多模态图像融合方法

王传云; 周明奇; 孙冬冬; 王田; 高骞; 李照奎

doi:10.13374/j.issn2095-9389.2025.06.17.002

文本引导的双分支异构编码多模态图像融合方法

A Text-Guided Multimodal Image Fusion Method Based on Dual-Branch Heterogeneous Encoders

摘要

摘要: 针对资源受限的无人机平台对红外与可见光图像的融合效率与感知性能需求，本文提出一种文本引导的双分支异构编码多模态图像融合方法。该方法设计了一种面向红外与可见光图像信息表达功能互补的轻量化双分支异构编码网络，红外图像编码分支强调热目标与边缘响应，可见光图像编码分支侧重于纹理与细节信息建模，从而有效避免同构编码器带来的特征冗余与性能瓶颈。同时，引入轻量级跨模态特征融合模块，增强多模信息之间的互补性与融合表达能力。进一步，通过预训练视觉语言模型结合义文本特征对融合过程进行引导与调控，提升融合图像的语义一致性与环境适应性。在三个公开多模态图像数据集上，本文方法与七种具有代表性的图像融合算法进行了充分的对比实验与综合分析。结果表明：所提出的网络在互信息、结构相似性、峰值信噪比等多个主流评价指标上均表现优越，能够有效提升融合图像的细节清晰度与结构一致性，在推理时间上显著降低约50%，且性能损失可忽略不计，体现出极高的效率优势。

Abstract: To meet the requirements of fusion efficiency and perception performance for infrared and visible images on resource-constrained UAV platforms, this paper proposes a text-guided dual-branch heterogeneous encoder-based multimodal image fusion method. A lightweight dual-branch heterogeneous encoding architecture is designed to exploit the complementary characteristics of infrared and visible modalities. Specifically, the infrared branch focuses on thermal target and edge response representation, while the visible branch emphasizes texture and detail modeling, effectively avoiding the feature redundancy and performance bottlenecks caused by homogeneous encoders. A lightweight cross-modal feature fusion module is further introduced to enhance the complementarity and expressive capacity between modalities. In addition, a pre-trained vision-language model is integrated to extract semantic text features, guiding and regulating the fusion process to improve the semantic consistency and adaptability of the fused images under diverse environmental conditions. Extensive experiments on three public multimodal image datasets demonstrate that the proposed method outperforms seven representative fusion algorithms in terms of multiple mainstream metrics such as mutual information (MI), structural similarity (SSIM), and peak signal-to-noise ratio (PSNR). The results show that the proposed network effectively enhances detail clarity and structural consistency of the fused images. Compared to baseline methods, it reduces inference time by approximately 50% with negligible performance degradation, demonstrating outstanding efficiency and deployment potential.

HTML全文

参考文献(0)

施引文献

资源附件(0)