Abstract:
Unmanned aerial vehicle (UAV)-based aerial photography has demonstrated immense potential in burgeoning applications such as urban inspection, disaster rescue, and environmental monitoring. However, object detection in UAV imagery remains a significant challenge owing to extreme scale variations, complex background interference, and the prevalence of tiny objects. Unlike natural-scene images, objects in aerial images, such as pedestrians and vehicles, typically occupy only a few pixels (e.g., less than 16×16 pixels). Existing one-stage detectors are affected by structural limitations, specifically frequency-agnostic feature extraction and irreversible information loss during downsampling, thus resulting in the annihilation of fine-grained features for pixel-level targets. To address these issues, this study proposes a refined spatial-aware distribution network (RSD-Net)—a novel end-to-end architecture designed to establish a full-link spatial awareness mechanism for robust tiny-object detection. Our contribution is threefold. First, to resolve the mismatch between feature extraction and physical attributes, we design a stage-adaptive feature-extraction module (SA-C3k2). Unlike conventional static convolutions that share weights across all spatial locations, SA-C3k2 incorporates a frequency-domain adaptation mechanism regulated by learnable coefficients. In shallow layers, it introduces a parallel edge extractor using the Scharr operator to explicitly sharpen the edges of tiny objects, thereby adaptively enhancing high-frequency texture signals to prevent feature loss. Conversely, in deep layers, it employs a learnable Gaussian kernel as a low-pass filter to smooth out complex background noise, such as tree branches and sea waves, thus effectively suppressing false positives. Second, to prevent semantic dilution during cross-scale feature fusion, we construct a rep-parameterized spatial-preserving distribution neck (RSD-Neck). Addressing the violation of the Nyquist sampling theorem by conventional strided convolutions, this module integrates space-to-depth convolution (SPD-Conv) to achieve lossless downsampling. By shifting spatial information to the channel dimension through periodic sampling, SPD-Conv ensures that fine-grained details are preserved. Additionally, the neck incorporates a rep-parameterized local adjacent fusion block, which comprises spatial-detail and semantic-context modules. This design allows the network to model global context and optimize feature interaction pathways, thus establishing a high-fidelity pipeline for feature transmission. Third, a dual-prior perception head (DP-Head) is introduced to address the failure of conventional intersection over union(IoU)-based metrics on tiny objects. Existing quality-estimation methods typically rely solely on regression statistics, which can be unreliable for pixel-level targets. To mitigate this, DP-Head fuses explicit visual texture priors (derived from the omnidirectional gradient magnitude) with implicit geometric distribution priors (derived from the top-k statistics of a general distribution). This establishes a robust “visual–statistical” dual-verification mechanism that significantly improves localization quality estimation in ambiguous scenarios. Extensive experiments on the VisDrone2019-DET and NWPU VHR-10 datasets demonstrate the effectiveness of the proposed method. Compared with the baseline YOLOv11n, the RSD-Net achieves significant improvements. Specifically, it increases the mAP50 by 4.99 and 5.08 percentage points, and the mAP50:95 by 3.82 and 7.20 percentage points, respectively, while maintaining an extremely lightweight parameter size of only 6.04\times 10^6 . Notably, the RSD-Net outperforms the medium-scale YOLOv11m (which has 3× more parameters) in terms of detection accuracy. Furthermore, robustness evaluation on the TinyPerson dataset reveals that although current mainstream detectors (including YOLOv8n, YOLOv11n, and the latest YOLOv12n) stagnate at a recall bottleneck of approximately 16%, the RSD-Net achieves a breakthrough recall rate of 22.95% (a 6.87 percentage points improvement over YOLOv12n). These results rigorously validate the superior cross-domain adaptability of the RSD-Net and its capability to efficiently manage pixel-level tiny objects in diverse and complex aerial environments.