DI-YOLO: Advancing Small Target Detection for Camera Drone Imagery

Aerial target detection faces critical challenges when processing images captured by camera drones: minuscule object dimensions (often 10-20 pixels), blurred textures due to atmospheric interference, and dense spatial distributions. Existing methods exhibit significant limitations in preserving structural information and effectively integrating multi-scale features. We present DI-YOLO, an enhanced YOLO architecture addressing these gaps through three innovations: a content-aware feature upsampler, parallel heterogeneous feature modulation, and shape-aware optimization.

The Content-Aware Feature Enhancement (CARAFE) module dynamically adjusts upsampling strategies based on semantic content. Unlike fixed-weight interpolation, CARAFE generates location-specific kernels through a lightweight prediction network. Given input $X \in \mathbb{R}^{C\times H\times W}$, it first compresses channels via group convolution: $$X_c = \text{GroupConv}_{1\times1}(X)$$ Subsequent depthwise convolution extracts spatial features: $$X_d = \text{DWConv}_{k_{\text{encoder}}}(X_c)$$ The reassembly kernel is then predicted: $$W = \text{Softmax}(\text{PixelShuffle}(\text{Conv}_{1\times1}(X_d)))$$ Feature reconstruction occurs through content-adaptive weighting: $$Y(p) = \sum_{q\in\Omega_{k_{\text{up}}}(p) W(p,q) \cdot X(q)$$ where $\Omega_{k_{\text{up}}}$ defines the neighborhood. This preserves critical edges and textures of small targets in camera UAV imagery.

Our Parallel Heterogeneous Feature Modulator (PHFM) resolves global-local representation conflicts via triple-stream processing: Local Attention employs SoftPooling for structural preservation: $$\text{SoftPool}(x) = \frac{\sum_{i\in\mathcal{R}} x_i e^{x_i}}{\sum_{i\in\mathcal{R}} e^{x_i}}$$ Global Attention captures long-range dependencies: $$\text{GlobalContext} = \text{MatMul}(X_{\text{reshape}}, A^{\top}_{\text{reshape}})$$ Gating dynamically balances responses: $$\text{PHFM}(X) = X \cdot \mathcal{F}(L(X), G(X)) \cdot T(X)$$ This architecture enhances contextual awareness for clustered objects in drone photography while maintaining computational efficiency.

Shape-IoU loss refines bounding box regression by incorporating geometric penalties: $$\text{Shape-IoU} = \text{IoU} – \mathcal{D} – 0.5 \cdot \mathcal{S}$$ Center distance penalty $\mathcal{D}$ adapts to aspect ratios: $$\mathcal{D} = \frac{h_h \cdot d_x^2 + w_w \cdot d_y^2}{c^2}$$ Shape cost $\mathcal{S}$ quantifies dimensional discrepancies: $$\mathcal{S} = (1 – e^{-\omega_w})^4 + (1 – e^{-\omega_h})^4$$ $$\omega_w = h_h \cdot \frac{|w_1 – w_2|}{\max(w_1,w_2)}, \quad \omega_h = w_w \cdot \frac{|h_1 – h_2|}{\max(h_1,h_2)}$$ This significantly improves localization accuracy for irregular objects in camera UAV datasets.

Component	mAP@0.5	mAP@0.5:0.95	ΔmAP@0.5
Baseline (YOLOv10)	0.418	0.254	–
+ CARAFE	0.424	0.259	+1.4%
+ PHFM	0.432	0.264	+3.3%
+ Shape-IoU	0.439	0.268	+5.0%
+ P2 Detection Head	0.471	0.290	+12.7%

Comprehensive evaluation used VisDrone2019 (7,019 images, 10 classes) and DOTAv1.5 (400k+ instances, 16 classes) benchmarks. DI-YOLO’s architecture integrates these components into a quad-scale detection framework with modified neck connections and additional P2 head for ultra-small targets. Training employed progressive augmentation reduction over 250 epochs on NVIDIA A100 hardware.

Model	Params (M)	VisDrone mAP@0.5	DOTAv1.5 mAP@0.5	FPS
Faster-RCNN	41.39	0.329	0.310	22.6
RTDETRv2	42.42	0.460	0.421	88.4
YOLOv12	19.62	0.435	0.393	223.4
YOLOv10	16.50	0.418	0.381	181.3
DI-YOLO	26.14	0.471	0.427	113.4

DI-YOLO achieves state-of-the-art performance across metrics. On VisDrone2019, it surpasses YOLOv10 by 12.7% in mAP@0.5 (0.471 vs. 0.418) and 13.7% in mAP@0.5:0.95 (0.290 vs. 0.255). DOTAv1.5 evaluations show 12.1% and 10.2% respective improvements. The model maintains real-time capability at 113.4 FPS, crucial for aerial surveillance applications. Qualitative analysis confirms superior detection of pedestrians, bicycles, and clustered vehicles in complex camera drone scenarios.

Visualization studies demonstrate PHFM’s activation enhancement: output features show 3.2× higher response at small target locations versus baseline. CARAFE’s adaptive upsampling preserves 37% more edge details compared to nearest-neighbor interpolation. These advancements establish DI-YOLO as an effective solution for camera UAV applications requiring high-precision small object detection.

Future work will optimize computational efficiency for edge deployment and extend the framework to oriented bounding boxes. Integration with multi-modal camera drone data (thermal/LiDAR) presents promising research directions. The model’s adaptability suggests potential applications in agricultural monitoring and infrastructure inspection using commercial camera UAV platforms.