Vision-based surveying drone detection systems play critical roles in aerial surveillance and defense applications. Small targets in complex environments present significant challenges: limited feature representation, low detection rates, and computational inefficiency. Traditional fusion strategies like FPN and PANet fail to address semantic gaps between non-adjacent feature layers, causing feature degradation during propagation. To overcome these limitations, we introduce PHE-DETR – a novel framework featuring Progressive Hybrid Encoding that decouples intra-scale feature refinement and cross-scale integration.

Architectural Framework
The PHE-DETR pipeline comprises three components: a multi-scale feature extractor, progressive hybrid encoder, and IoU-aware decoder. The HGNet backbone processes input images $x \in \mathbb{R}^{H \times W \times C}$ generating feature pyramids:
$$ \{P_1, P_2, P_3, P_4, P_5\} = \text{HGNet}(x) $$
Resolution progression follows $320^2 \rightarrow 160^2 \rightarrow 80^2 \rightarrow 40^2 \rightarrow 20^2$. For surveying UAV detection, we selectively process $P_3$, $P_4$, and $P_5$ to balance spatial detail and semantic richness.
Progressive Hybrid Encoding
Our encoder resolves feature misalignment through two specialized modules:
Attention-Based Intra-Scale Interaction (AIFI)
Processes deepest features $P_5$ via self-attention:
$$ Q = K = V = \text{Flatten}(P_5) $$
$$ A_5 = \text{Reshape}(\text{Attention}(Q,K,V)) $$
This captures global contextual relationships critical for identifying obscured surveying drones.
Adaptive Cross-Scale Feature Fusion (ACFM)
ACFM operates in two cascaded stages:
- Stage-1 (ACFM-2): Fuses adjacent layers
$$ A_3 = \text{ACFM-2}(P_3, P_4) $$
$$ A_4 = \text{ACFM-2}(P_4, P_3) $$ - Stage-2 (ACFM-3): Integrates $A_3$, $A_4$, $A_5$
$$ S_l = \sum_{n=3}^5 \alpha_{i,j}^l \cdot \text{Align}(A_n) $$
where $\alpha_{i,j}^l + \beta_{i,j}^l + \gamma_{i,j}^l = 1$ denotes position-wise adaptive weights. Feature alignment uses bilinear interpolation (upsampling) or strided convolution (downsampling).
Dual-Flow Gradient Harmonizer (DFGH)
Replaces conventional RepC3 with optimized gradient flow:
$$ \text{DFGH}(X) = \text{Concat}[X_{\text{identity}}, \text{Bottleneck}(\text{Bottleneck}(X_{\text{processed}}))] $$
Compared to RepC3, DFGH reduces parameters by 23% while enhancing gradient propagation for small surveying UAV features.
Experimental Validation
We evaluate performance on Det-Fly (13,271 images) and ARD-MAV (106,665 frames) datasets containing surveying drones under diverse conditions. Evaluation metrics include:
- mAP@50: Primary accuracy measure
- Parameters: Model complexity
- GFLOPs: Computational load
Detection Performance Comparison
| Method | Det-Fly mAP@50 | ARD-MAV mAP@50 | Params (M) | GFLOPs |
|---|---|---|---|---|
| YOLOv8-M | 94.4 | 77.1 | 25.8 | 78.7 |
| RT-DETR | 92.4 | 82.2 | 32.0 | 103.4 |
| MHAF-YOLO | 96.1 | – | 21.8 | 103.7 |
| PHE-DETR | 97.0 | 86.7 | 25.8 | 70.6 |
Ablation Studies
Component analysis demonstrates critical contributions:
| ACFM-2 | ACFM-3 | mAP@50 | Δ Params |
|---|---|---|---|
| ✗ | ✗ | 95.8 | -1.8M |
| ASFF only | DFGH only | 96.2 | -1.5M |
| Full | Full | 97.0 | 0 |
The progressive fusion strategy reduces semantic gaps between non-adjacent layers by 63% compared to single-stage fusion. On NVIDIA RTX3090 hardware, PHE-DETR achieves 30.2 FPS at 640p resolution – 6.7% faster than RT-DETR while maintaining precision for surveying UAV identification.
Conclusion
PHE-DETR establishes a new state-of-the-art for surveying drone detection through innovative multi-scale feature management. The Progressive Hybrid Encoding architecture demonstrates:
- 30% parameter/computation reduction versus RT-DETR
- 4.5% mAP@50 improvement on challenging ARD-MAV
- Real-time performance (30.2 FPS) without accuracy compromise
Future work will explore temporal modeling for tracking surveying UAVs across consecutive frames and multi-spectral feature fusion for all-weather operation. The framework shows significant promise for deployment in border surveillance and critical infrastructure protection scenarios.
