Enhanced Small Target Detection in Camera Drone Imagery via YOLO11 Optimization

Camera drones and camera UAVs have revolutionized aerial imaging across diverse sectors including precision agriculture, disaster response, and infrastructure inspection. Despite their versatility, detecting small targets in drone-captured imagery remains challenging due to minimal pixel coverage, dense clustering, background complexity, and environmental interference. To address these limitations, we propose comprehensive enhancements to the YOLO11 architecture specifically optimized for camera UAV applications.

Architectural Innovations

Our improved framework modifies three critical components of YOLO11, as illustrated below:

1. Progressive Multi-Scale Feature Extraction

We replace standard convolutions in C3k2 bottlenecks with our Channel Split Multi-scale Aggregation Convolution (CSMAConv) module. This structure preserves original features while expanding receptive fields through hierarchical processing:

$$ \begin{aligned} &(X_{1\_1}, X_{1\_2}) = \text{Split}(\text{Conv}_{3\times3}(X)) \\ &(X_{2\_1}, X_{2\_2}) = \text{Split}(\text{Conv}_{5\times5}(X_{1\_1})) \\ &X_3 = \text{Conv}_{7\times7}(X_{2\_1}) \\ &Y = \text{Conv}_{1\times1}(\text{Concat}(X_3, X_{2\_2}, X_{1\_2})) + X \end{aligned} $$

CSMAConv’s channel-splitting strategy progressively applies 3×3, 5×5, and 7×7 kernels, retaining shallow details while capturing contextual relationships—essential for distinguishing small targets from complex backgrounds in camera UAV imagery.

2. Hierarchical Feature Fusion

Our Hierarchical Efficient Fusion Feature Pyramid Network (HEFFPN) enhances multi-scale integration for camera drone applications:

  • Added 160×160 detection head for micro-targets
  • Cross-scale connections between backbone layers
  • Selective Boundary Aggregation (SBA) modules replacing Concat operations

SBA leverages dual Re-calibration Attention Units (RAU) for adaptive feature fusion:

$$ \text{RAU}(T_1, T_2) = (T_1 \odot \theta(W_\theta T_1)) + (T_2 \odot \phi(W_\phi T_2)) $$
$$ \text{SBA}(F_l, F_h) = \text{Conv}_{3\times3}\left(\text{Concat}\left( \text{RAU}(F_l, F_h), \text{RAU}(F_h, F_l) \right)\right) $$

This structure significantly boosts spatial-semantic integration for densely packed targets in camera UAV footage.

3. Lightweight Detection Head

The Shared Enhanced Detection Head (SED-Head) employs reparameterized Detail-Enhanced Convolution (DEConv):

$$ \text{DEConv}(F_{\text{in}}) = \sum_{i=1}^{5} F_{\text{in}} \ast K_i \xrightarrow{\text{reparam}} F_{\text{in}} \ast K_{\text{cvt}} $$

DEConv combines standard convolution with four differential operators (central, angular, horizontal, vertical) to amplify edge responses critical for small targets. Group Normalization replaces BatchNorm for stability in varying camera drone operating conditions.

Experimental Validation

We evaluated our model on the VisDrone2019 dataset containing 10,209 camera UAV images across 10 object categories under diverse conditions.

Implementation Specifications

Parameter Value
Image Resolution 640×640 pixels
Batch Size 8
Epochs 200
Optimizer SGD (lr=0.01, momentum=0.937)
Weight Decay 5e-4

Performance Comparison

Model Params (M) mAP@0.5 mAP@0.5:0.95
YOLOv5-s 7.2 32.3% 17.8%
YOLOv8-s 11.1 39.0% 23.2%
YOLO11-s 9.4 39.0% 23.3%
YOLO11-l 25.3 44.3% 27.4%
Ours 4.6 48.9% 30.3%

Our model achieves a 9.9% mAP@0.5 improvement over YOLO11-s while reducing parameters by 51.1%. Performance gains are most pronounced for challenging categories:

  • Pedestrians: 57.6% mAP (+18.6% over YOLO11-s)
  • Motorcycles: 57.7% mAP (+16.2% over YOLO11-s)
  • Bicycles: 41.3% mAP (+12.1% over YOLO11-s)

Ablation Studies

Components Params (M) mAP@0.5 mAP@0.5:0.95
Baseline (YOLO11-s) 9.4 39.0% 23.3%
+ CSMAConv only 11.7 40.2% 24.2%
+ HEFFPN only 3.8 47.0% 29.0%
+ SED-Head only 9.8 39.4% 23.8%
CSMA + HEFFPN 3.9 48.0% 29.6%
HEFFPN + SED-Head 4.5 48.2% 29.8%
Full Combination 4.6 48.9% 30.3%

Key observations from ablation studies:

  1. HEFFPN contributes most significantly to performance gains (+8.0% mAP@0.5)
  2. SED-Head reduces computational load by 18.5% when combined with HEFFPN
  3. Full integration achieves optimal accuracy/efficiency balance

Operational Scenarios Analysis

Visual assessments across challenging camera drone environments demonstrate our model’s robustness:

  1. Vertical Overhead Shots: 37% reduction in missed detections for distant vehicles and pedestrians
  2. Long-Range Surveillance: 42% improvement in occlusion handling for crowded pedestrian zones
  3. Night Operations: 29% accuracy increase for low-light targets compared to YOLO11-l

Grad-CAM visualizations confirm enhanced attention on distant, small targets that conventional models overlook in camera UAV imagery.

Conclusion

Our optimized architecture substantially advances small target detection for camera drone applications through three key innovations: 1) CSMAConv’s progressive multi-scale feature extraction, 2) HEFFPN’s hierarchical feature fusion with boundary-aware processing, and 3) SED-Head’s efficient detail enhancement. The model achieves state-of-the-art 48.9% mAP@0.5 on VisDrone2019 while reducing parameters by 51.1% versus YOLO11-s—critical for deployment on resource-constrained camera UAV platforms. Future work will focus on computational optimization for edge deployment and adaptation to specialized camera drone imaging sensors.

Scroll to Top