Abstract:
To address the challenges of multi-scene visual perception under complex and fluctuating lighting conditions, this study proposes a novel illumination condition-adaptive granularity progressive multimodal image fusion method. Visual perception in environments with varying lighting, such as urban areas at night or during harsh weather conditions, presents significant challenges for traditional imaging systems. This method integrates advanced techniques to ensure robust image fusion that dynamically adapts to different scene characteristics. First, a large model-based scene information embedding module is designed to effectively capture scene context from the input visible light image. This module leverages a pretrained image encoder to model the scene, generating scene vectors that are processed through various linear layers. The processed scene vectors are then progressively embedded into the fusion image reconstruction network, providing the fusion model with the ability to perceive scene information. This integration allows the fusion network to adjust its behavior according to contextual lighting conditions, resulting in more accurate image fusion. To overcome the limitations of existing feature extraction methods, an innovative feature extraction module based on state-space equations is proposed. This module enables global feature perception with linear computational complexity, minimizing the loss of critical information during transmission. The proposed feature extraction method enhances the visual quality of the fused images by reducing information loss and preserving the clarity of the reconstructed images. This approach maintains visual fidelity even under challenging lighting conditions, making it well-suited for dynamic environments. Finally, a granularity progressive fusion module is introduced. This module first employs state-space equations to globally aggregate multimodal features, then applies a cross-modal coordinate attention mechanism to fine-tune the aggregated features. This approach enables multi-stage fusion, from global to local granularity, enhancing the model’s ability to integrate information across various modalities. The multistage fusion process improves the coherence and detail of the output image, facilitating better scene interpretation and boosting model performance. During the training phase, prior knowledge is used to generate augmented images as pseudo-labels. Homogeneous and heterogeneous loss functions are constructed based on different environmental conditions, enabling adaptive learning. This method optimizes the performance of scene-adaptive multimodal image fusion by adjusting the fusion model to varying illumination conditions. Experimental results demonstrate the effectiveness of the proposed method. Extensive experiments across several benchmark datasets—including MSRS and LLVIP for dark-light scenarios, TNO for mixed lighting conditions, RoadScene for continuous scenes, and M3FD for hazy conditions—show that the proposed method outperforms 11 state-of-the-art algorithms in qualitative and quantitative evaluations. The method achieves superior visual effects and higher quantitative metrics across all test scenarios, demonstrating its robustness and versatility. Furthermore, when compared with a two-stage method, the proposed approach still outperforms it in terms of visual effects and quantitative metrics. The proposed scene-adaptive fusion framework holds significant potential for applications in fields such as autonomous driving, military reconnaissance, and environmental surveillance, where reliable visual perception under complex lighting conditions is essential. These results highlight the method’s promise for real-world tasks involving dynamic lighting changes, setting a new benchmark in multimodal image fusion.