Abstract:
Robust object detection under adverse weather conditions remains a pressing challenge associated with autonomous driving and intelligent transportation, because single-sensor systems are prone to performance degradation in rain, fog, or snow. To address this issue, we propose SeparateFusion, a novel multisensor fusion framework that integrates four-dimensional (4D) millimeter-wave radar and LiDAR data
via a deep neural network. By leveraging the resilience of radar to weather interference and the high spatial resolution of LiDAR, SeparateFusion delivers accurate, stable perception across diverse environments. The core idea is to treat geometry and semantics as complementary signals that should be modeled along separate but interacting paths to preserve the strengths of each modality, while noise and misalignment are mitigated early in the pipeline. The architecture comprises two key modules: the geometry–semantic enhancement (GSE) encoder for early three-dimensional (3D) fusion, and the bird’s-eye-view (BEV) feature enhancement module (BMM) for two-dimensional feature refinement. In practice, BMM includes a lightweight multiscale gating unit that operates alongside a Mamba-based mixer to refine BEV features. In the first stage, LiDAR and radar point clouds are independently projected into a shared pillar grid, ensuring spatial alignment. The GSE encoder enhances geometric and semantic information of each modality separately. Geometric features capture structural layouts from point coordinates, while semantic features encode attributes like intensity, Doppler velocity, and reflectivity. The encoder applies neighborhood-aware updates that preserve spatial continuity in the geometric stream, while allowing semantic cues to guide cross-modal correspondence. Restricting cross-modal interaction primarily to the semantic subspace mitigates discretization and registration errors that might otherwise propagate through deeper layers, while the geometric stream preserves the neighborhood structure for stable aggregation. Following this enhancement, pillar-level features are extracted, enabling early-stage multimodal fusion that aligns and preserves modality-specific advantages. In the second stage, the fused features are transformed into a BEV representation. The BMM module processes this representation using the MambaVisionMixer structure to capture both local and long-range dependencies in the spatial domain. In addition, a gating mechanism suppresses redundant or noisy signals, allowing the network to focus on discriminative information for detection. This two-stage design provides a balance between fine-grained geometry–semantic modeling in 3D space and high-level spatial reasoning in BEV space, contributing to strong robustness against weather-related degradation. Extensive experiments on the View-of-Delft (VoD) dataset show that our method consistently outperforms both state-of-the-art single-sensor detectors and existing multisensor fusion approaches. It achieves a mean average precision of 71.47% across the entire test area and one of 85.74% within the driving corridor, demonstrating notable gains in both global and lane-focused detection scenarios. Category-wise analysis further indicates consistent improvements for vehicles and vulnerable road users, with clearer benefits at longer ranges, where LiDAR sparsification and reflectivity decay are more severe. We follow the standard VoD protocol for training and evaluation and provide implementation details to facilitate reproducibility using the same splits and metrics. Additional evaluations on fog and snow simulation datasets confirm that SeparateFusion maintains clear advantages over previous methods in low-visibility conditions, indicating strong generalization capability. Ablation studies further validate the contributions of the GSE encoder and BMM module, showing that removing either component results in a significant drop in detection accuracy. This highlights the complementary nature of early 3D geometry–semantic enhancement and later-stage BEV feature gating. In summary, SeparateFusion introduces a structured two-stage fusion approach for integrating radar and LiDAR data, incorporating both early 3D geometry–semantic enhancement and later-stage BEV refinement with adaptive gating. The method achieves significant improvements over powerful single-sensor and existing fusion-based object detection methods under challenging weather, laying a promising foundation for next-generation all-weather intelligent perception intended for safety-critical applications.