A Comprehensive Review of UAV Object Detection with Progressive Hybrid Encoding

Vision-based surveying drone detection systems play critical roles in aerial surveillance and defense applications. Small targets in complex environments present significant challenges: limited feature representation, low detection rates, and computational inefficiency. Traditional fusion strategies like FPN and PANet fail to address semantic gaps between non-adjacent feature layers, causing feature degradation during propagation. To overcome these limitations, we introduce PHE-DETR – a novel framework featuring Progressive Hybrid Encoding that decouples intra-scale feature refinement and cross-scale integration.

Architectural Framework

The PHE-DETR pipeline comprises three components: a multi-scale feature extractor, progressive hybrid encoder, and IoU-aware decoder. The HGNet backbone processes input images $x \in \mathbb{R}^{H \times W \times C}$ generating feature pyramids:

$$ \{P_1, P_2, P_3, P_4, P_5\} = \text{HGNet}(x) $$

Resolution progression follows $320^2 \rightarrow 160^2 \rightarrow 80^2 \rightarrow 40^2 \rightarrow 20^2$. For surveying UAV detection, we selectively process $P_3$, $P_4$, and $P_5$ to balance spatial detail and semantic richness.

Progressive Hybrid Encoding

Our encoder resolves feature misalignment through two specialized modules:

Attention-Based Intra-Scale Interaction (AIFI)

Processes deepest features $P_5$ via self-attention:

$$ Q = K = V = \text{Flatten}(P_5) $$
$$ A_5 = \text{Reshape}(\text{Attention}(Q,K,V)) $$

This captures global contextual relationships critical for identifying obscured surveying drones.

Adaptive Cross-Scale Feature Fusion (ACFM)

ACFM operates in two cascaded stages:

  1. Stage-1 (ACFM-2): Fuses adjacent layers
    $$ A_3 = \text{ACFM-2}(P_3, P_4) $$
    $$ A_4 = \text{ACFM-2}(P_4, P_3) $$
  2. Stage-2 (ACFM-3): Integrates $A_3$, $A_4$, $A_5$
    $$ S_l = \sum_{n=3}^5 \alpha_{i,j}^l \cdot \text{Align}(A_n) $$

where $\alpha_{i,j}^l + \beta_{i,j}^l + \gamma_{i,j}^l = 1$ denotes position-wise adaptive weights. Feature alignment uses bilinear interpolation (upsampling) or strided convolution (downsampling).

Dual-Flow Gradient Harmonizer (DFGH)

Replaces conventional RepC3 with optimized gradient flow:

$$ \text{DFGH}(X) = \text{Concat}[X_{\text{identity}}, \text{Bottleneck}(\text{Bottleneck}(X_{\text{processed}}))] $$

Compared to RepC3, DFGH reduces parameters by 23% while enhancing gradient propagation for small surveying UAV features.

Experimental Validation

We evaluate performance on Det-Fly (13,271 images) and ARD-MAV (106,665 frames) datasets containing surveying drones under diverse conditions. Evaluation metrics include:

  • mAP@50: Primary accuracy measure
  • Parameters: Model complexity
  • GFLOPs: Computational load

Detection Performance Comparison

Method Det-Fly mAP@50 ARD-MAV mAP@50 Params (M) GFLOPs
YOLOv8-M 94.4 77.1 25.8 78.7
RT-DETR 92.4 82.2 32.0 103.4
MHAF-YOLO 96.1 21.8 103.7
PHE-DETR 97.0 86.7 25.8 70.6

Ablation Studies

Component analysis demonstrates critical contributions:

ACFM-2 ACFM-3 mAP@50 Δ Params
95.8 -1.8M
ASFF only DFGH only 96.2 -1.5M
Full Full 97.0 0

The progressive fusion strategy reduces semantic gaps between non-adjacent layers by 63% compared to single-stage fusion. On NVIDIA RTX3090 hardware, PHE-DETR achieves 30.2 FPS at 640p resolution – 6.7% faster than RT-DETR while maintaining precision for surveying UAV identification.

Conclusion

PHE-DETR establishes a new state-of-the-art for surveying drone detection through innovative multi-scale feature management. The Progressive Hybrid Encoding architecture demonstrates:

  1. 30% parameter/computation reduction versus RT-DETR
  2. 4.5% mAP@50 improvement on challenging ARD-MAV
  3. Real-time performance (30.2 FPS) without accuracy compromise

Future work will explore temporal modeling for tracking surveying UAVs across consecutive frames and multi-spectral feature fusion for all-weather operation. The framework shows significant promise for deployment in border surveillance and critical infrastructure protection scenarios.

Scroll to Top