Enhanced Small Target Detection for Non-Motorized Vehicles in Dense Aerial Scenes

Urban traffic monitoring faces unprecedented challenges with the exponential growth of non-motorized vehicles. Traditional surveillance systems struggle with limited coverage areas and occlusion issues, making camera drones indispensable for comprehensive traffic management. This research addresses the critical challenge of detecting small, densely clustered non-motorized targets in aerial imagery captured by camera UAVs, where targets often occupy less than 0.055% of total pixels and exhibit blurred contours. Existing detection frameworks suffer significant performance degradation under these conditions due to three primary limitations: insufficient high-frequency feature extraction, class imbalance during training, and inadequate representation of small targets in datasets.

We present LE-YOLOX, an optimized detection framework that systematically addresses these challenges through three key innovations. First, our Laplace-Enhanced Multi-Scale Attention (LE-MSA) mechanism explicitly extracts high-frequency features critical for small target identification. Traditional attention mechanisms like CBAM and SE rely on global pooling operations that suppress high-frequency components in Fourier space:

$$ \nabla^2 f(i,j) = f(i-1,j) + f(i+1,j) + f(i,j-1) + f(i,j+1) – 4f(i,j) $$

This discrete Laplace operator acts as a high-pass filter, amplifying high-frequency components where \( -(u^2 + v^2) \) grows with increasing spatial frequencies. Our LE-MSA integrates this operator with dual attention pathways, as shown in Table 1.

Component Operation Feature Enhancement
Channel Attention Concat(APorig, MPorig, APHF, MPHF) High-frequency feature preservation
Spatial Attention f” = f + f’ → Concat(APf”, MPf”) Fused spatial detail enhancement

Second, we introduce a Composite Loss Function (CLF) combining VarifocalLoss and BCEWithLogitsLoss to handle extreme class imbalance:

$$ L_{CLF} = \alpha \left[ -q\log(p) – (1-q)p^\gamma \log(1-p) \right] + \beta \left[ -\frac{1}{N}\sum_{i=1}^{N} y_i \log(\sigma(x_i)) + (1-y_i)\log(1-\sigma(x_i)) \right] $$

where \( \alpha \) and \( \beta \) are balancing coefficients, \( \gamma \) modulates focus on hard samples, and \( \sigma \) denotes the sigmoid function. This dual-loss approach dynamically adjusts sample weighting during training.

Third, our multi-strategy augmentation pipeline specifically enhances small target representation. Adaptive Small Target Augmentation duplicates non-motorized vehicles meeting \( \frac{Area_{target}}{Area_{image}} < 0.00055 \) after IOU-based filtering, while Region Augmentation boosts saturation/brightness in low-attention zones identified through activation heatmaps from Darknet-53’s final layer.

Comprehensive evaluations were conducted using imagery captured by camera UAVs at 50-80m altitudes over complex urban intersections. Table 2 compares LE-YOLOX against state-of-the-art detectors across 2,021 annotated images containing six vehicle classes under dense traffic conditions.

Model mAP@0.5 Precision Recall Params (M)
Faster R-CNN 71.30 61.55 79.95 137.00
YOLOv5 88.15 93.02 89.12 7.03
YOLOv8 87.63 93.48 88.92 11.13
YOLOv10 86.10 89.28 86.87 8.04
YOLOX 87.82 95.62 89.18 8.95
LE-YOLOX 90.78 94.64 91.38 10.33

Our ablation studies quantified individual contributions: LE-MSA alone boosted mAP@0.5 by 2.73% over baseline YOLOX, while CLF and augmentation strategies contributed 1.98% and 0.68% gains respectively. The integrated framework achieved 5-30% confidence score improvements for non-motorized vehicles in occlusion-heavy scenarios. Crucially, LE-YOLOX maintained real-time performance at 41.84 FPS on NVIDIA GTX 1650 hardware, demonstrating practical viability for camera drone deployment.

Visual comparisons in dense traffic scenarios reveal LE-YOLOX’s superior performance. Where baseline detectors exhibited 15-22% miss rates on clustered bicycles and tricycles, our approach correctly segmented adjacent instances while suppressing false positives in complex backgrounds. The camera UAV perspective particularly benefits from our high-frequency enhancement, as atmospheric haze and motion blur disproportionately affect small targets.

This research establishes a robust framework for non-motorized vehicle detection in challenging aerial environments. Future work will extend evaluation to adverse conditions like nighttime and precipitation, where camera drone imagery presents additional noise challenges. The integration of temporal tracking mechanisms could further enhance counting accuracy during peak traffic flows observed in urban centers.

Scroll to Top