照度条件自适应的粒度渐进多模态图像融合方法

Illumination-adaptative granularity progressive multimodal image fusion method

  • 摘要: 为应对光照条件复杂多变下的多场景视觉感知挑战,本文提出了一种照度条件自适应的粒度渐进多模态图像融合方法. 首先,设计了基于大模型的场景信息嵌入模块,通过预训练的图像编码器对输入的可见光图像进行场景建模,并利用不同的线性层对场景向量进行处理. 随后,利用处理后的场景向量对图像重建阶段的图像特征在通道维度上进行调控,使得融合模型能够根据不同的场景光照生成不同风格的融合图像. 其次,为了解决现有特征提取模块在特征表达上的不足,本文设计了基于状态空间方程的特征提取模块,以线性复杂度实现全局特征感知,减少了信息传输过程中的信息丢失,提升了融合图像的视觉效果. 最后,设计了粒度渐进融合模块,利用状态空间方程对多模态特征进行全局聚合,并引入跨模态坐标注意力机制对聚合后的特征进行精细调优,从而实现多模态特征从全局到局部的多阶段融合,增强了网络的信息整合能力. 在训练过程中,本文采用先验知识生成增强图像作为标签,并根据不同环境构建同源与异构的损失函数,以实现场景自适应的多模态图像融合. 实验结果显示,本文方法在暗光场景数据集MSRS和LLVIP、混合光照数据集TNO、连续场景数据集RoadScene以及雾霾场景数据集M3FD上的表现均优于11种先进算法,在定量和定性对比中取得了更好的视觉效果和更高的定量指标. 所提出的方法在自动驾驶、军事侦察和环境监控等任务中展现出较大的潜力.

     

    Abstract: To address the challenges of multi-scene visual perception under complex and fluctuating lighting conditions, this study proposes a novel illumination condition-adaptive granularity progressive multimodal image fusion method. Visual perception in environments with varying lighting, such as urban areas at night or during harsh weather conditions, presents significant challenges for traditional imaging systems. This method integrates advanced techniques to ensure robust image fusion that dynamically adapts to different scene characteristics. First, a large model-based scene information embedding module is designed to effectively capture scene context from the input visible light image. This module leverages a pretrained image encoder to model the scene, generating scene vectors that are processed through various linear layers. The processed scene vectors are then progressively embedded into the fusion image reconstruction network, providing the fusion model with the ability to perceive scene information. This integration allows the fusion network to adjust its behavior according to contextual lighting conditions, resulting in more accurate image fusion. To overcome the limitations of existing feature extraction methods, an innovative feature extraction module based on state-space equations is proposed. This module enables global feature perception with linear computational complexity, minimizing the loss of critical information during transmission. The proposed feature extraction method enhances the visual quality of the fused images by reducing information loss and preserving the clarity of the reconstructed images. This approach maintains visual fidelity even under challenging lighting conditions, making it well-suited for dynamic environments. Finally, a granularity progressive fusion module is introduced. This module first employs state-space equations to globally aggregate multimodal features, then applies a cross-modal coordinate attention mechanism to fine-tune the aggregated features. This approach enables multi-stage fusion, from global to local granularity, enhancing the model’s ability to integrate information across various modalities. The multistage fusion process improves the coherence and detail of the output image, facilitating better scene interpretation and boosting model performance. During the training phase, prior knowledge is used to generate augmented images as pseudo-labels. Homogeneous and heterogeneous loss functions are constructed based on different environmental conditions, enabling adaptive learning. This method optimizes the performance of scene-adaptive multimodal image fusion by adjusting the fusion model to varying illumination conditions. Experimental results demonstrate the effectiveness of the proposed method. Extensive experiments across several benchmark datasets—including MSRS and LLVIP for dark-light scenarios, TNO for mixed lighting conditions, RoadScene for continuous scenes, and M3FD for hazy conditions—show that the proposed method outperforms 11 state-of-the-art algorithms in qualitative and quantitative evaluations. The method achieves superior visual effects and higher quantitative metrics across all test scenarios, demonstrating its robustness and versatility. Furthermore, when compared with a two-stage method, the proposed approach still outperforms it in terms of visual effects and quantitative metrics. The proposed scene-adaptive fusion framework holds significant potential for applications in fields such as autonomous driving, military reconnaissance, and environmental surveillance, where reliable visual perception under complex lighting conditions is essential. These results highlight the method’s promise for real-world tasks involving dynamic lighting changes, setting a new benchmark in multimodal image fusion.

     

/

返回文章
返回