基于频域感知与无损特征传输的无人机小目标检测网络

董加钧; 张学锋; 刘小龙; 耿明亮; 储岳中

doi:10.13374/j.issn2095-9389.2026.01.08.003

基于频域感知与无损特征传输的无人机小目标检测网络

UAV tiny object detection network based on frequency-domain perception and lossless feature transmission

摘要

摘要: 针对无人机航拍图像中微小目标特征易消失以及背景遮挡的难题，本文提出精细化空间感知分布网络(RSD-Net). 该架构旨在解决现有检测器频域感知缺失及下采样信息丢失的结构性缺陷. 具体而言：(1)设计阶段自适应特征提取模块(SA-C3k2)，利用显式边缘锐化与频域滤波，自适应增强浅层高频纹理并抑制深层背景噪声；(2)构建重参数化空间无损分布颈部网络(RSD-Neck)，结合SPD-Conv无损下采样与全局上下文建模，防止跨尺度特征融合中的语义稀释；(3)引入双先验感知预测头(DP-Head)，融合显式视觉与隐式几何分布先验，实现鲁棒的定位质量评估. 实验表明，RSD-Net在VisDrone2019-DET和NWPU VHR-10上，mAP50分别提升了4.99和5.08个百分点，同时mAP50:95分别提升了3.82和7.20个百分点. 在TinyPerson极限场景鲁棒性测试中，RSD-Net在召回率和精度上均优于最新的YOLOv12n，验证了该架构在解决微小目标检测难题上的有效性.

Abstract: Unmanned aerial vehicle (UAV)-based aerial photography has demonstrated immense potential in burgeoning applications such as urban inspection, disaster rescue, and environmental monitoring. However, object detection in UAV imagery remains a significant challenge owing to extreme scale variations, complex background interference, and the prevalence of tiny objects. Unlike natural-scene images, objects in aerial images, such as pedestrians and vehicles, typically occupy only a few pixels (e.g., less than 16×16 pixels). Existing one-stage detectors are affected by structural limitations, specifically frequency-agnostic feature extraction and irreversible information loss during downsampling, thus resulting in the annihilation of fine-grained features for pixel-level targets. To address these issues, this study proposes a refined spatial-aware distribution network (RSD-Net)—a novel end-to-end architecture designed to establish a full-link spatial awareness mechanism for robust tiny-object detection. Our contribution is threefold. First, to resolve the mismatch between feature extraction and physical attributes, we design a stage-adaptive feature-extraction module (SA-C3k2). Unlike conventional static convolutions that share weights across all spatial locations, SA-C3k2 incorporates a frequency-domain adaptation mechanism regulated by learnable coefficients. In shallow layers, it introduces a parallel edge extractor using the Scharr operator to explicitly sharpen the edges of tiny objects, thereby adaptively enhancing high-frequency texture signals to prevent feature loss. Conversely, in deep layers, it employs a learnable Gaussian kernel as a low-pass filter to smooth out complex background noise, such as tree branches and sea waves, thus effectively suppressing false positives. Second, to prevent semantic dilution during cross-scale feature fusion, we construct a rep-parameterized spatial-preserving distribution neck (RSD-Neck). Addressing the violation of the Nyquist sampling theorem by conventional strided convolutions, this module integrates space-to-depth convolution (SPD-Conv) to achieve lossless downsampling. By shifting spatial information to the channel dimension through periodic sampling, SPD-Conv ensures that fine-grained details are preserved. Additionally, the neck incorporates a rep-parameterized local adjacent fusion block, which comprises spatial-detail and semantic-context modules. This design allows the network to model global context and optimize feature interaction pathways, thus establishing a high-fidelity pipeline for feature transmission. Third, a dual-prior perception head (DP-Head) is introduced to address the failure of conventional intersection over union（IoU）-based metrics on tiny objects. Existing quality-estimation methods typically rely solely on regression statistics, which can be unreliable for pixel-level targets. To mitigate this, DP-Head fuses explicit visual texture priors (derived from the omnidirectional gradient magnitude) with implicit geometric distribution priors (derived from the top-k statistics of a general distribution). This establishes a robust “visual–statistical” dual-verification mechanism that significantly improves localization quality estimation in ambiguous scenarios. Extensive experiments on the VisDrone2019-DET and NWPU VHR-10 datasets demonstrate the effectiveness of the proposed method. Compared with the baseline YOLOv11n, the RSD-Net achieves significant improvements. Specifically, it increases the mAP50 by 4.99 and 5.08 percentage points, and the mAP50:95 by 3.82 and 7.20 percentage points, respectively, while maintaining an extremely lightweight parameter size of only 6.04\times 10^6 . Notably, the RSD-Net outperforms the medium-scale YOLOv11m (which has 3× more parameters) in terms of detection accuracy. Furthermore, robustness evaluation on the TinyPerson dataset reveals that although current mainstream detectors (including YOLOv8n, YOLOv11n, and the latest YOLOv12n) stagnate at a recall bottleneck of approximately 16%, the RSD-Net achieves a breakthrough recall rate of 22.95% (a 6.87 percentage points improvement over YOLOv12n). These results rigorously validate the superior cross-domain adaptability of the RSD-Net and its capability to efficiently manage pixel-level tiny objects in diverse and complex aerial environments.

HTML全文

参考文献(38)

施引文献

资源附件(0)