Authors: Author1, Author2, Author3, Author4, Author5*
Affiliation: School of Computer Science and Technology, Anhui University of Technology, Ma’anshan 243032, China
*Corresponding Author: email@example.com
Abstract: This paper addresses the critical challenges of feature disappearance and background occlusion for tiny objects in images captured by unmanned aerial vehicles (UAVs). We propose a novel end-to-end architecture named the Refined Spatial-aware Distribution Network (RSD-Net), designed to overcome the structural deficiencies of existing detectors, namely the lack of frequency-domain perception and the irreversible information loss during downsampling. Specifically: (1) We design a Stage-Adaptive feature extraction module (SA-C3k2), which utilizes explicit edge sharpening and frequency-domain filtering to adaptively enhance high-frequency textures in shallow layers and suppress background noise in deep layers. (2) We construct a Rep-parameterized Spatial-preserving Distribution neck network (RSD-Neck), integrating SPD-Conv for lossless downsampling with global context modeling to prevent semantic dilution during cross-scale feature fusion. (3) We introduce a Dual-Prior perception head (DP-Head), fusing explicit visual and implicit geometric distribution priors to achieve robust localization quality assessment. Extensive experiments demonstrate the superiority of RSD-Net. On the VisDrone2019-DET and NWPU VHR-10 datasets, RSD-Net improves mAP50 by 4.99 and 5.08 percentage points, and mAP50:95 by 3.82 and 7.20 percentage points, respectively, while maintaining an extremely lightweight parameter count. In robustness testing on the challenging TinyPerson dataset, RSD-Net outperforms the latest YOLOv12n in both recall and precision, validating the effectiveness of our architecture in tackling the难题 of tiny object detection for unmanned drone applications.
Keywords: Spatial-aware distribution; Frequency-aware feature; Lossless feature transmission; Dual-prior quality estimation; Tiny object detection; Unmanned aerial vehicle
1. Introduction
In recent years, unmanned aerial vehicle (UAV) aerial photography has demonstrated immense potential in burgeoning applications such as urban inspection, disaster rescue, and environmental monitoring. However, object detection in UAV imagery remains a significant challenge owing to extreme scale variations, complex background interference, and the prevalence of tiny objects. Unlike natural-scene images, objects in aerial images captured by unmanned drones, such as pedestrians and vehicles, typically occupy only a few pixels (e.g., less than 16 × 16 pixels). Existing one-stage detectors are affected by structural limitations, specifically frequency-agnostic feature extraction and irreversible information loss during downsampling, thus resulting in the annihilation of fine-grained features for pixel-level targets.

To address these issues, this study proposes a Refined Spatial-aware Distribution Network (RSD-Net). Unlike external optimization strategies like SAHI, RSD-Net builds an internal, full-link spatial awareness mechanism for robust tiny-object detection in unmanned drone scenarios. Our contribution is threefold. First, to resolve the mismatch between feature extraction and physical attributes, we design a Stage-Adaptive feature extraction module (SA-C3k2). Second, to prevent semantic dilution during cross-scale feature fusion, we construct a Rep-parameterized Spatial-preserving Distribution neck (RSD-Neck). Third, to address the failure of conventional IoU-based metrics on tiny objects, we introduce a Dual-Prior perception head (DP-Head).
2. Related Work
Frequency-Aware Feature Extraction & Architectural Reparameterization. Traditional CNNs process all pixels with shared weights, struggling to distinguish high-frequency textures from low-frequency backgrounds. In the domain of frequency-aware features, some works attempt to decouple high and low-frequency information in the transform domain, proving the importance of high-frequency components for localizing tiny objects from unmanned drones. Concurrently, structural re-parameterization techniques enhance feature expression without increasing inference latency. For weak-texture targets, recent detectors directly incorporate explicit geometric edge extraction modules. However, existing backbones still lack a stage-adaptive regulation mechanism to dynamically balance shallow sharpening and deep smoothing.
Multi-Scale Feature Fusion & Lossless Transmission. Multi-scale feature fusion aims to address drastic scale variations. Beyond basic FPN, advanced structures introduce recursive feature pyramids or explore efficient information interaction paradigms via “Heavy-Neck” designs. To capture long-range dependencies, mechanisms like Gather-and-Distribute (GD) are employed. Nonetheless, the “downsampling” operation remains a critical bottleneck for tiny object information. Maintaining high-resolution representations has been emphasized to avoid irreversible spatial precision loss. Our work seeks to organically combine global context modeling with lossless transmission.
Localization Quality Estimation & Distribution Modeling. Accurate localization quality estimation is crucial for dense object detection in unmanned drone imagery. Pioneering works reformulated bounding box regression as a general probability distribution, using distribution statistics to implicitly infer localization quality. Subsequent research leveraged this distribution information for task alignment and knowledge distillation. While these distribution-based methods represent the state-of-the-art, they are inherently implicit inferences that ignore explicit visual cues (e.g., edge gradients) from the image itself, making them prone to misjudgment in complex scenarios typical of unmanned drone operations.
3. Methodology
The overall architecture of the proposed Refined Spatial-aware Distribution Network (RSD-Net) is designed to tackle the aforementioned challenges in unmanned drone-based detection. The network comprises a backbone enhanced with stage-adaptive feature extraction, a neck for lossless multi-scale feature distribution, and a head with dual-prior verification.
3.1 Stage-Adaptive Feature Extraction Module (SA-C3k2)
In unmanned drone imagery, targets of different scales exhibit distinct feature representations across network depths. The standard CSPNet backbone employs identical residual structures across all levels, a static mechanism with clear limitations for drone-based perception: (1) Shallow Feature Loss: Tiny objects primarily rely on high-frequency edge information, but conventional convolutions can cause edge blurring during downsampling, leading to missed detections. (2) Deep Noise Interference: As the receptive field expands, complex background textures like trees or waves are preserved. These high-frequency noises can混淆 with target semantics, causing false positives.
To solve this “homogeneous hierarchical processing” issue, we propose the Stage-Adaptive C3k2 (SA-C3k2) module. Its core unit, the SA-Bottleneck, builds upon the standard bottleneck structure by incorporating a parallel Physical Prior Branch. This branch operates in parallel with the main convolutional path, with fusion achieved via a dual residual structure. The SA-Bottleneck operates in two distinct modes depending on the feature hierarchy level.
Shallow Mode (N2, N3): In shallow layers, to enhance the geometric saliency of tiny objects from unmanned drones, we introduce an Edge Extractor parallel to the 1×1 convolution. This branch captures high-frequency contour information of the input feature $$X_{in}$$ using the Scharr operator, adaptively scaled by a learnable coefficient $$\alpha$$.
$$Y_{\text{shallow}} = F_{\text{conv}}(X_{\text{in}}) + \alpha \cdot F_{\text{edge}} + X_{\text{in}}$$
Here, $$F_{\text{conv}}$$ represents the main path consisting of 1×1 and 3×3 convolutions. $$F_{\text{edge}}$$ is the output of the edge extraction branch. $$Y_{\text{shallow}}$$ denotes the output feature in shallow mode. The addition operation injects sharpened edge information into the feature flow, effectively preventing feature loss for small targets. Crucially, $$\alpha$$ is a learnable parameter, allowing the model to adaptively adjust the intensity of edge information injection based on the feature distribution at different layers via backpropagation: $$\alpha \leftarrow \alpha – \eta \frac{\partial \mathcal{L}}{\partial \alpha}$$, where $$\eta$$ is the learning rate and $$\mathcal{L}$$ is the total loss function.
Deep Mode (N4, N5): In deep layers, to suppress false responses from background texture, the parallel branch is replaced by a Gaussian Kernel. This branch acts as a learnable low-pass filter, smoothing the input feature, with the smoothing strength regulated by coefficient $$\beta$$.
$$Y_{\text{deep}} = F_{\text{conv}}(X_{\text{in}}) + \beta \cdot (X_{\text{in}} * K_{\text{gauss}}) + X_{\text{in}}$$
$$Y_{\text{deep}}$$ represents the output feature in deep mode, $$*$$ denotes the convolution operation, and $$K_{\text{gauss}}$$ is the learnable Gaussian convolution kernel performing low-pass filtering. This design allows the SA-Bottleneck to automatically “blur” irrelevant background clutter in deep layers, significantly enhancing the classifier’s focus on main semantics.
The SA-C3k2 module stacks multiple SA-Bottleneck units and uses a final 1×1 convolution for channel fusion. This design does not破坏 the original gradient propagation path but internalizes physical prior knowledge as an inductive bias in a “plug-in” manner.
3.2 Rep-parameterized Spatial-Preserving Distribution Neck (RSD-Neck)
While the Gather-and-Distribute (GD) mechanism effectively addresses long-range feature fusion, its Feature Alignment Module (FAM) has a physical flaw: it relies on average pooling to强制 unify feature sizes from different levels, causing information loss for tiny objects at the very beginning of fusion.
To achieve “full information preservation” with manageable computation, we propose the Rep-parameterized Spatial-preserving Distribution neck (RSD-Neck). It consists of two core parallel branches: the Spatial Detail Stream and the Semantic Context Stream.
Feature Alignment via Lossless Downsampling: We abandon pooling operations and introduce Space-to-Depth Convolution (SPD-Conv) for lossless cross-level alignment. For an input feature $$X \in \mathbb{R}^{S \times S \times C_1}$$, the SPD layer performs periodic sampling with a scale factor, reorganizing it into $$X’ \in \mathbb{R}^{\frac{S}{\text{scale}} \times \frac{S}{\text{scale}} \times (\text{scale}^2 C_1)}$$. This process is described as:
$$X'(i, j, 🙂 = \text{Concat}_{u=0}^{\text{scale}-1} \text{Concat}_{v=0}^{\text{scale}-1} X(i \cdot \text{scale} + u, j \cdot \text{scale} + v, :)$$
Here, $$(i, j)$$ are spatial indices in the output feature map, $$\text{scale}$$ is the downsampling ratio (set to 2), and $$\text{Concat}$$ denotes concatenation along the channel dimension. This operation transfers potentially filtered weak pixel signals from the spatial dimension to the channel dimension, ensuring subsequent fusion modules receive more original image information crucial for unmanned drone scenes.
Feature Fusion via Reparameterized Blocks: To enhance non-linear expression能力 while maintaining inference speed, we design a Rep-parameterized Local Adjacent Fusion (Rep-LAF) module. During training, it contains multiple parallel convolutional branches (e.g., 3×3 conv, 1×1 conv, identity) to fit complex feature distributions. During inference, leveraging the additivity of convolutions, the parameters of all branches are condensed into a single 3×3 kernel via transformation:
$$W_{\text{eq}} = \frac{\gamma_3}{\sigma_3} W^{(3\times3)} + \text{Pad}\left(\frac{\gamma_1}{\sigma_1} W^{(1\times1)}\right)$$
$$b_{\text{eq}} = \left(b_3 – \frac{\mu_3 \gamma_3}{\sigma_3}\right) + \left(b_1 – \frac{\mu_1 \gamma_1}{\sigma_1}\right)$$
where $$W_{\text{eq}}$$ and $$b_{\text{eq}}$$ are the equivalent weight and bias for inference. $$W^{(3\times3)}$$ and $$W^{(1\times1)}$$ are the convolutional kernel weights from corresponding training branches. $$\mu, \sigma, \gamma$$ represent the mean, variance, and scale factor of the Batch Normalization layers. $$\text{Pad}(\cdot)$$ pads the 1×1 kernel with zeros to match the 3×3 size.
The RSD-Neck physically separates the feature flow into two paths with topological constraints: (1) Spatial Detail Stream: This path is强制 connected to shallow, high-resolution features (N2~N3). The Rep-LAF modules here are constrained to extract and preserve high-frequency spatial details, injecting them into layers like B3/B4 to supplement localization information for unmanned drone targets. (2) Semantic Context Stream: This path is强制 connected to deep, low-resolution features (B3~B5). The Rep-LAF modules here are constrained to model long-range semantic context, injecting it into layers like P4/P5 to enhance category discrimination.
3.3 Dual-Prior Perception Head (DP-Head)
While the SA-C3k2 module significantly enhances feature expression, a critical challenge remains at the detection stage:如何 accurately select the candidate box with the highest localization quality from thousands of proposals.
Existing heads (e.g., GFLV2) introduce statistical features based on regression distribution to assist quality assessment, but this is essentially a “blind inference” that only focuses on the variance of regression values, ignoring the spatial texture of the feature map itself. To compensate for this “blind spot” and maintain design consistency with SA-C3k2’s focus on edges, we propose the Dual-Prior Perception Head (DP-Head).
Its core is the embedded Dual-Prior Quality Module (DPQM), which innovatively constructs two parallel perception paths for complementary “explicit visual perception” and “implicit geometric statistics” verification.
Explicit Texture Prior Branch: To maintain design consistency with the front-end, we reintroduce the Scharr operator within the detection head. This branch operates directly on the regression feature map $$F_{\text{reg}}$$. It calculates the omnidirectional gradient magnitude as an explicit physical basis for衡量 localization quality.
$$M_{\text{grad}} = \sqrt{(F_{\text{reg}} * K_x)^2 + (F_{\text{reg}} * K_y)^2}$$
Here, $$K_x$$ and $$K_y$$ are the horizontal and vertical Scharr operator kernels. The computed gradient map $$M_{\text{grad}}$$ is processed by Global Average Pooling (GAP) and an MLP to generate the texture prior score $$S_{\text{tex}}$$, quantifying the edge sharpness within the predicted box.
Implicit Geometric Prior Branch: This branch inherits the statistical idea of GFLV2 to capture regression prediction uncertainty. For the generalized distribution prediction $$G$$ of a bounding box, we extract Top-k statistical features from its probability curve. Intuitively, a “sharper” curve indicates higher confidence in boundary location. These statistics are encoded by an MLP to generate the geometric prior score $$S_{\text{geo}}$$.
The final IoU quality prediction score $$I$$ is generated by channel-wise concatenation of $$S_{\text{tex}}$$ and $$S_{\text{geo}}$$ followed by a Sigmoid activation:
$$I = \sigma \left( \mathcal{F}_{\text{fusion}}([S_{\text{tex}}; S_{\text{geo}}]) \right)$$
where $$\mathcal{F}_{\text{fusion}}$$ denotes the fusion operation involving concatenation and an MLP. This score $$I$$ is multiplied with the classification confidence to serve as the final criterion for NMS ranking. This dual-verification mechanism effectively addresses the failure of单一 statistical features in ambiguous scenarios common in unmanned drone imagery.
4. Experiments and Analysis
4.1 Datasets and Implementation Details
We conducted training and evaluation on the VisDrone2019-DET and NWPU VHR-10 datasets. For generalization验证, we performed robustness tests on the TinyPerson dataset.
VisDrone2019-DET is a standard benchmark for UAV-based object detection, featuring dense, tiny objects with over 60% of targets smaller than 32×32 pixels. NWPU VHR-10 is a high-resolution optical remote sensing dataset with objects ranging from tiny vehicles to large structures. TinyPerson focuses on extremely small pedestrian targets (average ~18×18 pixels) with challenging background interference like sea waves, making it ideal for testing极限 feature conditions in unmanned drone contexts.
All experiments were conducted on an NVIDIA GeForce A5000 GPU using PyTorch. Input images were resized to 640×640. The SGD optimizer was used with an initial learning rate of 0.01. Models were trained for 200 epochs on VisDrone and 360 epochs on NWPU VHR-10 and TinyPerson, with early stopping. Standard evaluation metrics including Precision (P), Recall (R), F1, AP, mAP50, and mAP50:95 were employed.
$$P = \frac{\text{TP}}{\text{TP} + \text{FP}} \times 100\%$$
$$R = \frac{\text{TP}}{\text{TP} + \text{FN}} \times 100\%$$
$$\text{mAP} = \frac{1}{C}\sum_{i=1}^{C} \text{AP}_i$$
where C is the number of classes.
4.2 Comparative Experiments
We compared RSD-Net against a wide range of state-of-the-art detectors on VisDrone2019-DET and NWPU VHR-10. The selected models cover classical two-stage and one-stage detectors, lightweight real-time detectors for deployment, and end-to-end Transformer-based detectors. The results are summarized in Table 1.
| Dataset | Model | Params (M) | GFLOPs | mAP50 (%) | mAP50:95 (%) |
|---|---|---|---|---|---|
| VisDrone2019-DET | Faster R-CNN | 41.39 | 208.0 | 32.96 | – |
| RetinaNet-FPN | 36.52 | 210.0 | 27.64 | – | |
| RTMDet-Tiny | 4.88 | 8.0 | 31.22 | – | |
| YOLOv5n | 1.90 | 4.5 | 26.54 | – | |
| YOLOv8n | 3.00 | 8.1 | 25.92 | – | |
| YOLOv11n (Baseline) | 2.62 | 6.3 | 32.65 | 18.83 | |
| YOLOv11m | 20.04 | 67.7 | 35.09 | – | |
| RT-DETR-R18 | 20.00 | 60.0 | 33.32 | – | |
| DAB-DETR-R50 | 43.50 | 94.4 | 43.35 | – | |
| DINO | 47.56 | 274.0 | 44.52 | – | |
| RSD-Net (Ours) | 6.04 | 10.8 | 37.64 | 22.65 | |
| NWPU VHR-10 | Faster R-CNN | 41.39 | 208.0 | 82.34 | – |
| RetinaNet-FPN | 36.52 | 210.0 | 85.47 | – | |
| RTMDet-Tiny | 4.88 | 8.0 | 84.88 | – | |
| YOLOv11n (Baseline) | 2.62 | 6.3 | 84.55 | 49.41 | |
| YOLOv11m | 20.04 | 67.7 | 86.82 | – | |
| RT-DETR-R18 | 20.00 | 60.0 | 88.71 | – | |
| DINO | 47.56 | 274.0 | 93.12 | – | |
| RSD-Net (Ours) | 6.04 | 10.8 | 89.63 | 56.61 |
Table 1 demonstrates that RSD-Net achieves a compelling balance between accuracy and model size. Notably, it surpasses the medium-sized YOLOv11m in detection accuracy on VisDrone while using only about one-third of the parameters (6.04M vs. 20.04M), highlighting the high parameter efficiency of our architecture for unmanned drone applications.
4.3 Ablation Studies
We performed systematic ablation studies on the VisDrone2019-DET dataset to validate the contribution of each proposed component, using YOLOv11n as the baseline. The modules are denoted as: A (RSD-Neck), B (SA-C3k2), C (DP-Head). Results are shown in Table 2.
| Components | Precision (%) | Recall (%) | mAP50 (%) | mAP50:95 (%) | Params (M) |
|---|---|---|---|---|---|
| Baseline (YOLOv11n) | 43.15 | 33.08 | 32.65 | 18.83 | 2.62 |
| Baseline + A | 44.73 | 33.66 | 33.99 | 20.08 | 5.93 |
| Baseline + B | 43.82 | 34.42 | 33.88 | 19.92 | 2.64 |
| Baseline + C | 44.76 | 35.55 | 35.66 | 21.26 | 2.67 |
| Baseline + A + B | 46.27 | 34.59 | 35.22 | 21.44 | 5.99 |
| Baseline + A + C | 46.99 | 34.72 | 36.01 | 21.31 | 5.99 |
| Baseline + B + C | 47.47 | 36.06 | 37.02 | 22.64 | 2.72 |
| Baseline + A + B + C (Full RSD-Net) | 48.27 | 36.92 | 37.64 | 22.65 | 6.04 |
All three modules independently and significantly improve performance. DP-Head provides the largest single-module gain in mAP50 (+3.01%). Although RSD-Neck increases parameters, its role in reconstructing deep features is indispensable. The synergistic effect of all three boosts mAP50 by 4.99 percentage points over the baseline.
Ablation on RSD-Neck Components: We further dissected the RSD-Neck. As shown in Table 3, replacing the original aggregation and injection modules with our SPD-FAM and Rep-LAF brings substantial gains. Their combination yields the best performance.
| Aggregation Module | Injection Module | mAP50 (%) | mAP50:95 (%) |
|---|---|---|---|
| AvgPool (Baseline) | Conv+Add | 32.65 | 18.83 |
| SPD-FAM (Ours) | Conv+Add | 33.88 | 19.92 |
| AvgPool | Rep-LAF (Ours) | 35.66 | 21.26 |
| SPD-FAM (Ours) | Rep-LAF (Ours) | 37.64 | 22.65 |
Ablation on Downsampling Operators: We evaluated different downsampling operators within the RSD-Neck framework (Table 4). Removing SPD-Conv (Baseline N) causes mAP50 to plummet below the original YOLOv11n, proving the necessity of an effective mechanism. While standard strided convolution recovers performance, our proposed SPD-Conv achieves the best results, especially in Recall (36.92%), confirming its effectiveness in preserving fine-grained information for unmanned drone tiny objects.
| Downsampling Method | Precision (%) | Recall (%) | mAP50 (%) | mAP50:95 (%) |
|---|---|---|---|---|
| None (Baseline N) | 41.22 | 30.68 | 29.84 | 18.96 |
| Conv 3×3, stride=2 | 47.15 | 35.85 | 36.55 | 21.82 |
| Max Pooling | 46.50 | 35.42 | 36.12 | 21.45 |
| Average Pooling | 46.65 | 35.58 | 36.25 | 21.58 |
| SPD-Conv (Ours) | 48.27 | 36.92 | 37.64 | 22.65 |
4.4 Robustness Evaluation on Extreme Scenarios (TinyPerson)
To validate adaptability in极限 conditions, we trained and tested models on the TinyPerson dataset. The results in Table 5 show that while all models struggle with the extreme challenges (low absolute mAP), RSD-Net demonstrates significant relative superiority. It breaks through the recall bottleneck of ~16% seen in mainstream detectors like YOLOv12n, achieving a recall of 22.95% (a 6.87 percentage point improvement). This confirms the superior cross-domain robustness of RSD-Net for managing pixel-level tiny objects in diverse and complex unmanned drone environments.
| Model | Precision (%) | Recall (%) | mAP50 (%) | mAP50:95 (%) |
|---|---|---|---|---|
| YOLOv8n | 30.44 | 15.67 | 12.49 | 4.35 |
| YOLOv11n | 31.83 | 16.05 | 13.47 | 4.76 |
| YOLOv12n | 31.61 | 16.08 | 13.07 | 4.52 |
| RSD-Net (Ours) | 32.56 | 22.95 | 18.40 | 6.40 |
5. Conclusion
This paper proposed the Refined Spatial-aware Distribution Network (RSD-Net) to address the challenges of feature annihilation and background interference in unmanned aerial vehicle-based tiny object detection. Specifically, the SA-C3k2 module enables frequency-domain adaptive feature extraction. The RSD-Neck introduces lossless downsampling via SPD-Conv and global context modeling. The DP-Head establishes a “visual-statistical” dual-verification mechanism for reliable localization quality assessment. Experimental results on VisDrone2019-DET and NWPU VHR-10 demonstrate significant performance improvements over the baseline. Robustness evaluation on TinyPerson confirms the architecture’s capability to handle pixel-level targets in极限 unmanned drone scenarios. Future work will focus on further optimizing inference latency and exploring hardware acceleration deployment on FPGA or edge-side chips to meet the practical engineering requirements of real-time UAV inspection.
