Unmanned Aerial Vehicle-Based YOLOv8 Enhancement for Electric Rider Helmet Detection via Intelligent Communication

1. Introduction
The proliferation of electric vehicles (EVs) as a primary mode of short-distance transportation underscores their environmental and economic benefits. However, the critical safety issue of riders neglecting helmet use persists, contributing significantly to traffic injury and fatality rates. Traditional surveillance methods, reliant on manual patrols or fixed cameras, suffer from prohibitive costs, low efficiency in congested areas, and inadequate real-time performance. The integration of Unmanned Aerial Vehicle (UAV) platforms with intelligent communication technologies offers a transformative solution. Unmanned Aerial Vehicle systems provide unparalleled mobility and high-resolution aerial perspectives, while intelligent communication enables real-time, low-latency transmission of captured video streams to ground processing units. This synergy facilitates immediate analysis using advanced deep learning algorithms, addressing the dynamic nature of traffic scenarios. Our work leverages this UAV-intelligent communication framework, proposing significant enhancements to the YOLOv8 object detector specifically optimized for real-time helmet detection on EV riders from aerial footage, tackling challenges like small target sizes (due to altitude/viewpoint shifts), complex backgrounds, and high-resolution image processing.

2. Algorithmic Foundation: YOLOv8
YOLOv8, a state-of-the-art single-stage detector, forms the basis of our approach due to its inherent efficiency and accuracy balance. Its architecture comprises four key components:

Input: Employs Mosaic data augmentation and adaptive image scaling (640×640) for robust training. Dynamic anchor box generation mitigates mismatch issues common with varying target sizes.
Backbone (CSPNet++): Extracts multi-scale features using convolutional blocks, C2f modules, and SPPF (Spatial Pyramid Pooling Fast). The C2f structure enhances feature representation through bottleneck modules, while SPPF expands the receptive field to capture both global (rider) and local (helmet) features.
Neck (FPN+PAN): Combines Feature Pyramid Network (FPN) for top-down semantic feature enrichment and Path Aggregation Network (PAN) for bottom-up positional information enhancement. Concatenation operations fuse these pathways, bolstering multi-scale detection capability crucial for diverse rider/helmet sizes.
Head: Utilizes an anchor-free mechanism, directly predicting target center points and bounding box dimensions. This simplifies detection, reduces computational overhead (beneficial for UAV-derived data), and improves adaptability to scale variation. The optimized CIoU loss function enhances boundary regression.

The efficiency and real-time performance of YOLOv8 make it particularly suitable for integration within the Unmanned Aerial Vehicle pipeline, where rapid processing of transmitted video streams is paramount.

3. Proposed Enhancements for UAV Helmet Detection
Our methodology introduces three core innovations to the standard YOLOv8 architecture, specifically targeting the limitations encountered in Unmanned Aerial Vehicle-based helmet detection: handling small targets under viewpoint changes, precise feature reconstruction in high-resolution images, and robust bounding box regression in complex scenes.

3.1 Outlook-C2f: Enhanced Feature Focus
Standard attention mechanisms like CA or CBAM exhibit limited global modeling capacity, hindering performance on small targets and complex scenes prevalent in UAV footage. We integrate the Outlook attention mechanism into the backbone’s C2f module, replacing the standard bottleneck with a novel Bottleneck_OA to form the Outlook-C2f structure. Outlook excels by:

Local Window Attention: Computes attention weights within local windows, efficiently aggregating global context layer-by-layer.
High-Resolution Handling: Efficiently processes high-resolution UAV images by operating within windows.
Robust Feature Discrimination: Combines local details with global context, improving target-background separation in cluttered traffic scenes.

The mechanism operates as follows:

Linear projections generate attention weights AA and value features VV from the input feature map X∈RH×W×CX∈RH×W×C:A=XWA,V=XWV;WA∈RC×K∗,WV∈RC×CA=XWA,V=XWV;WA∈RC×K∗,WV∈RC×C
For each spatial location (i,j)(i,j), extract the local window Vi,j∈RK×K×CVi,j∈RK×K×C centered at (i,j)(i,j). Apply softmax normalization to the corresponding attention weights Ai,jAi,j and compute the weighted sum within the window:Yi,j=SoftMax(Ai,j)⋅Vi,jYi,j=SoftMax(Ai,j)⋅Vi,j
Reconstruct the global output feature map YY by aggregating the local window outputs Yi,jYi,j:Y=∑i,jYi,j(conceptually, via unfolding/aggregation)Y=i,j∑Yi,j(conceptually, via unfolding/aggregation)
Apply layer normalization and a residual connection to produce the final enhanced features X′X′:X′=LayerNorm(Y)+XX′=LayerNorm(Y)+X

This integration within Bottleneck_OA significantly boosts the model’s ability to discern critical helmet features, especially small or partially occluded ones, without imposing substantial computational burdens, making it ideal for the Unmanned Aerial Vehicle application context.

3.2 CARAFE: Content-Aware Feature Upsampling
Traditional upsampling methods (e.g., nearest-neighbor, bilinear interpolation) lack awareness of image content, leading to blurred features and suboptimal fusion, particularly detrimental for reconstructing small helmets in high-resolution UAV images. We replace standard upsampling in the FPN with CARAFE (Content-Aware ReAssembly of FEatures), a dynamic operator that generates adaptive kernels based on local content for superior feature reconstruction. CARAFE operates in three stages:

Kernel Prediction: A lightweight convolutional layer (e.g., 3×3) processes the input feature map XX to predict a normalized, content-aware kernel WW for each target location:W=SoftMax(Conv(X))W=SoftMax(Conv(X))
Feature Reassembly: For each target location (i,j)(i,j), the corresponding predicted kernel Wi,jWi,j is applied to the relevant local region Ui,jUi,j of the input feature map (obtained via unfolding) to compute the upsampled value:Yi,j=∑m,n∈ΩUm,n⋅Wm,nYi,j=m,n∈Ω∑Um,n⋅Wm,nHere, ΩΩ defines the kernel window neighborhood.
Channel-to-Space Transformation: A PixelShuffle operation rearranges the channel dimension of the reassembled feature map YY to achieve the final higher spatial resolution output Y′Y′:Y′=PixelShuffle(Y)Y′=PixelShuffle(Y)

By dynamically generating reconstruction weights based on semantic content, CARAFE preserves fine details crucial for small helmet detection during the feature fusion process within the FPN. Its lightweight kernel prediction ensures computational efficiency aligns well with the real-time demands of Unmanned Aerial Vehicle-based systems.

3.3 WIoU: Dynamic Bounding Box Regression
While CIoU improves upon standard IoU loss, its static formulation struggles with low-quality anchors (common with small or ambiguous UAV targets) and fails to effectively handle cases where aspect ratio differences are insignificant. To enhance localization precision, we adopt WIoUv1 (Wise-IoU) as the bounding box regression loss. WIoU introduces a dynamic non-monotonic focusing mechanism:

Minimum Bounding Box: Determine the smallest enclosing rectangle covering both the predicted box (x,y,w,h)(x,y,w,h) and the target anchor box (xg,yg,wg,hg)(xg,yg,wg,hg). Its width WgWg and height HgHg are computed as:Wg=max⁡(x+w/2,xg+wg/2)−min⁡(x−w/2,xg−wg/2)Wg=max(x+w/2,xg+wg/2)−min(x−w/2,xg−wg/2)Hg=max⁡(y+h/2,yg+hg/2)−min⁡(y−h/2,yg−hg/2)Hg=max(y+h/2,yg+hg/2)−min(y−h/2,yg−hg/2)
Distance Attention Weight: Define a weighting factor RWIoURWIoU based on the normalized distance between box centers relative to the size of the minimum bounding box:RWIoU=exp⁡((x−xg)2+(y−yg)2(Wg2+Hg2)∗)RWIoU=exp((Wg2+Hg2)∗(x−xg)2+(y−yg)2)The superscript ∗∗ denotes detaching WgWg and HgHg from the computation graph to prevent harmful gradients. RWIoU∈[1,e]RWIoU∈[1,e].
WIoU Loss: The final loss combines the standard IoU loss LIoULIoU with the dynamic weight:LWIoU=RWIoU⋅LIoU,whereLIoU=1−IoU∈[0,1]LWIoU=RWIoU⋅LIoU,whereLIoU=1−IoU∈[0,1]

This mechanism dynamically adjusts the loss contribution:

Low-Quality Anchors: RWIoURWIoU is large, amplifying LIoULIoU and forcing the model to focus on improving these difficult predictions.
High-Quality Anchors: RWIoURWIoU is close to 1, reducing to near standard LIoULIoU, minimizing unnecessary computation and stabilizing training.
Center Distance Focus: When boxes overlap well (LIoU≈0LIoU≈0), RWIoURWIoU primarily reflects center distance error.

WIoU significantly improves the model’s ability to precisely locate helmets, especially small or partially obscured ones captured by the Unmanned Aerial Vehicle, leading to higher mAP without compromising speed.

4. Experimental Setup
4.1 Implementation Details

Hardware: Intel Xeon E5-2680v4 CPU, NVIDIA RTX A4000 (16GB) GPU, 30GB RAM.
Software: Windows 10 OS, Python 3.10, Anaconda, PyTorch 2.0.1.
Training: Initialized with YOLOv8s weights (Backbone/Head transfer). Adam optimizer, initial learning rate 1e-3 with cosine annealing. Key training parameters:Training ParameterValueBatch Size32Epochs100Momentum0.937Weight Decay5e-4
Rationale: Batch size 32 balances memory constraints and training stability/performance. 100 epochs ensure sufficient learning for the dataset size. Momentum 0.937 optimizes convergence speed/stability. Weight decay 5e-4 provides effective L2 regularization.

4.2 Dataset Construction
Lack of suitable UAV-specific helmet datasets (e.g., VisDrone2021 lacks relevant annotations) necessitated creating a dedicated real-world dataset:

Capture: Using UAVs, footage was collected around intersections in Guilin, China (Guomao Rd, Huacheng, Convention Center Plaza). Focus: EV lanes.
Variability: Captured across diverse conditions:
- Time: Morning peak (40%), Noon (20%), Evening peak (40%).
- Weather: Sunny (70%), Cloudy (20%), Rainy (10%).
- Angles/Resolution: Multiple UAV angles, high-resolution emphasis.
Annotation: Manual labeling using labelImg. Two classes: Helmet_Worn, No_Helmet.
Splitting & Balancing:DatasetImagesHelmet_WornNo_HelmetSplit RatioTrain423060%40%8Val59158%42%1Test59259%41%1
Augmentation: Flip, crop, brightness adjustment, Mosaic augmentation applied to training set to enhance robustness against variations expected in Unmanned Aerial Vehicle operations.

4.3 Evaluation Metrics
Performance is rigorously assessed using:

FPS (Frames Per Second): Measures real-time processing capability. Higher FPS indicates faster detection, critical for live Unmanned Aerial Vehicle video analysis.
mAP (mean Average Precision): Primary accuracy metric. Computes the area under the Precision-Recall (P-R) curve averaged over all classes.
- Precision PP: P=NTPNTP+NFPP=NTP+NFPNTP
- Recall RR: R=NTPNTP+NFNR=NTP+NFNNTP
- Average Precision (AP) for a class: Approximated by ηAP=∑k=1NP(k)Δr(k)ηAP=∑k=1NP(k)Δr(k)
- mAP: ηmAP=1Nc∑c=1NcηAP(c)ηmAP=Nc1∑c=1NcηAP(c), where NcNc is the number of classes (2 here). mAP ∈[0,1]∈[0,1], higher is better.

5. Results and Analysis
5.1 Ablation Study
We systematically evaluate the contribution of each proposed enhancement on our custom UAV helmet dataset using the same hardware/software setup. Results demonstrate the effectiveness of each module:

| Improvement Module     | mAP    | FPS    | mAP \(\Delta\) | FPS \(\Delta\) |
| :--------------------- | :----- | :----- | :------------ | :------------ |
| Baseline (YOLOv8)      | 0.953  | 26.33  | -             | -             |
| + Outlook-C2f          | 0.966  | 26.31  | +0.013        | -0.02         |
| + CARAFE               | 0.961  | 26.47  | +0.008        | +0.14         |
| + WIoUv1               | 0.965  | 26.33  | +0.012        | 0.00          |
| **Proposed (All)**     | **0.967** | **26.91** | **+0.014**    | **+0.58**     |

Outlook-C2f: Achieves a significant +1.3% mAP boost with negligible FPS drop (-0.02). This confirms its efficacy in enhancing focus on small helmet features under complex backgrounds without hindering real-time performance vital for Unmanned Aerial Vehicle systems.
CARAFE: Provides a +0.8% mAP gain and a slight FPS increase (+0.14). Its content-aware upsampling improves feature fusion quality for small targets, and its lightweight design contributes positively to speed.
WIoUv1: Delivers a +1.2% mAP improvement with no FPS penalty. The dynamic focusing mechanism effectively improves localization accuracy, particularly for challenging anchors common in UAV perspectives.
Combined Effect: Integrating all three enhancements yields the best performance: 96.7% mAP and 26.91 FPS, representing a +1.4% mAP and +0.58 FPS gain over the baseline YOLOv8. This synergy confirms the complementary nature of the improvements for robust UAV-based helmet detection. Computational cost remains practical (13.4 GFLOPs, 6.8M parameters).

5.2 Comparative Evaluation
Our proposed algorithm is benchmarked against prominent Two-stage and One-stage detectors on the UAV helmet test set:

| Algorithm       | Helmet_Worn Acc. | No_Helmet Acc. | mAP    | FPS    |
| :-------------- | :--------------- | :------------- | :----- | :----- |
| YOLOv5          | 0.935            | 0.911          | 0.949  | -      |
| YOLOv6          | 0.943            | 0.926          | 0.952  | -      |
| YOLOv8 (Baseline)| 0.951            | 0.932          | 0.953  | 26.33  |
| Faster R-CNN    | 0.892            | 0.883          | 0.904  | 19.24  |
| SSD             | 0.917            | 0.866          | 0.926  | 22.74  |
| **Proposed**    | **0.963**        | **0.927**      | **0.967** | **26.91** |

Superiority over Two-stage: Faster R-CNN suffers from high computational complexity (lowest FPS: 19.24) and struggles with small targets in high-res UAV images (low mAP: 90.4%). Our method’s single-stage efficiency + multi-scale fusion yields significantly higher speed (FPS 26.91) and accuracy (mAP 96.7%).
Advantage over One-stage: While SSD offers reasonable speed (FPS 22.74), its reliance on fixed priors harms small target detection (Helmet_Worn Acc. 91.7% vs our 96.3%). Our use of CARAFE specifically addresses this limitation. Compared to the YOLO family (v5, v6, v8 baseline), our enhancements (Outlook, WIoU) consistently boost accuracy (Helmet_Worn Acc. 96.3%, mAP 96.7%) while maintaining or slightly improving FPS (26.91 vs 26.33). This demonstrates the optimal balance between speed and precision achieved for Unmanned Aerial Vehicle deployment.
Qualitative Performance: Visual assessment confirms significant advantages:
- Reduced Missed Detections: The enhanced model reliably detects small or distant helmets missed by baseline YOLOv8, especially against complex backgrounds.
- Reduced False Positives: Misclassifications (e.g., pedestrians as No_Helmet riders) are substantially reduced.
- Higher Confidence: Detections in complex scenes exhibit higher confidence scores.
- Robustness: Low false alarm rate (5.2%) for non-helmet headwear (sun hats, hoods).

6. Conclusion
This work presents a robust and efficient solution for detecting helmet usage among electric vehicle riders leveraging Unmanned Aerial Vehicle platforms and intelligent communication. By integrating UAV mobility and real-time data transmission with a significantly enhanced YOLOv8 algorithm, we address critical challenges in aerial surveillance: small target detection under viewpoint changes, high-resolution image processing, and complex background clutter. Our core innovations are:

Outlook-C2f Architecture: Replaces standard C2f, integrating Outlook attention into bottleneck modules. This dramatically improves focus on small helmet features within dense or complex scenes captured by the Unmanned Aerial Vehicle, verified by a +1.3% mAP gain.
CARAFE Upsampling: Replaces conventional upsampling in the FPN. Its content-aware, dynamically weighted reconstruction preserves crucial spatial details for small helmets, yielding a +0.8% mAP and slight FPS improvement.
WIoUv1 Loss: Substitutes CIoU for bounding box regression. Its dynamic non-monotonic focusing mechanism prioritizes difficult samples and refines localization, resulting in a +1.2% mAP increase.

Comprehensive experiments on a meticulously curated real-world UAV dataset demonstrate the effectiveness of each module and their synergistic combination. Our final model achieves state-of-the-art performance: 96.7% mAP at 26.91 FPS, outperforming established detectors like YOLOv5, YOLOv6, YOLOv8, Faster R-CNN, and SSD. This balance of high accuracy and real-time speed is essential for practical Unmanned Aerial Vehicle-assisted traffic safety enforcement. Future work will focus on further model lightweighting for direct onboard Unmanned Aerial Vehicle processing and exploring multi-UAV collaborative detection networks. The integration of intelligent communication ensures this system provides timely feedback, significantly enhancing road safety regulation efficacy.