YOLOv8-Based Helmet Detection Method for Electric Vehicle Riders Combining Intelligent Communication and UAV-Assistance

In recent years, the rapid proliferation of electric vehicles (EVs) as a sustainable and cost-effective mode of transportation has highlighted critical safety concerns, particularly regarding rider helmet usage. Statistics indicate that non-compliance with helmet regulations significantly contributes to high injury and fatality rates in traffic accidents. Traditional helmet detection methods, such as manual patrols or static surveillance systems, often fall short in dynamic and congested traffic environments due to their inefficiency, high operational costs, and inability to provide real-time feedback. To address these limitations, this paper proposes an innovative approach that integrates intelligent communication with Unmanned Aerial Vehicle (UAV) technology, leveraging advanced deep learning algorithms for real-time helmet detection. The synergy between drone technology and intelligent communication enables seamless data transmission and processing, facilitating rapid analysis of high-resolution aerial imagery in complex scenarios. By employing an enhanced YOLOv8 model, this method achieves superior accuracy and speed in identifying helmet-wearing EV riders, even under challenging conditions like occlusions, varying lighting, and small target sizes. The integration of drone technology not only enhances surveillance coverage but also ensures adaptability to evolving traffic patterns, making it a pivotal solution for modern urban safety management.

The core of this research lies in optimizing the YOLOv8 architecture to tackle the unique challenges posed by UAV-based detection. UAVs, or Unmanned Aerial Vehicles, equipped with high-definition cameras, capture vast amounts of visual data from aerial perspectives, which are then transmitted via intelligent communication networks for real-time analysis. However, factors such as rapid viewpoint changes, complex backgrounds, and the small size of helmet targets in high-resolution images often degrade detection performance. To overcome these issues, this work introduces three key modifications: an Outlook-C2f module to enhance focus on dense small targets, a CARAFE-based upsampling strategy in the feature pyramid network for improved feature reconstruction, and a WIoU loss function for precise bounding box regression. These innovations collectively boost the model’s robustness and efficiency, as validated through extensive experiments on a custom dataset collected using drone technology. The results demonstrate significant improvements in mean average precision (mAP) and frames per second (FPS), underscoring the potential of this approach for practical deployment in intelligent traffic monitoring systems.

Drone technology has revolutionized various domains, including surveillance, agriculture, and disaster management, by providing unparalleled aerial perspectives and mobility. In the context of traffic safety, Unmanned Aerial Vehicles offer a versatile platform for monitoring helmet compliance among EV riders, enabling authorities to cover large areas efficiently. Unlike fixed cameras, UAVs can adapt to real-time traffic flow, capturing data from multiple angles and altitudes. This flexibility is crucial for detecting small targets like helmets, which may appear minuscule in wide-area shots. Moreover, intelligent communication protocols, such as 5G networks, facilitate low-latency data transfer between UAVs and ground stations, ensuring that detection algorithms can process information swiftly. This combination of drone technology and communication systems forms the backbone of the proposed helmet detection framework, allowing for continuous, real-time oversight without human intervention. By harnessing these advancements, this study aims to set a new benchmark in automated safety enforcement, reducing reliance on labor-intensive methods and minimizing response times in critical situations.

Previous research on helmet detection has primarily focused on static images or videos from ground-based cameras, with limited exploration of UAV applications. For instance, methods based on R-CNN and its variants, such as Faster R-CNN, have shown high accuracy but suffer from slow inference speeds, making them unsuitable for real-time scenarios. Similarly, one-stage detectors like YOLO and SSD offer faster processing but may struggle with small objects and complex backgrounds. Recent efforts, such as those incorporating Swin Transformers or lightweight networks like VanillaNet, have improved detection under occlusions; however, they often overlook the dynamic nature of UAV-captured footage. In contrast, this work specifically addresses the challenges of drone technology by refining YOLOv8, a state-of-the-art one-stage detector known for its balance between speed and accuracy. The proposed enhancements are designed to handle the intricacies of aerial imagery, such as scale variations and motion blur, thereby filling a gap in existing literature. Furthermore, the use of a custom dataset, gathered through Unmanned Aerial Vehicle operations in diverse conditions, ensures that the model is trained on realistic data, enhancing its generalizability and practical utility.

The methodology of this study centers on an improved YOLOv8 model, which comprises four main components: input preprocessing, backbone for feature extraction, neck for multi-scale feature fusion, and head for detection output. The input stage employs Mosaic data augmentation and adaptive scaling to standardize images to 640×640 resolution, enhancing diversity and consistency. The backbone utilizes CSPNet++ with integrated C2f modules and SPPF to extract rich, multi-scale features. However, to better capture small target details, we replace the standard C2f with an Outlook-C2f architecture, which incorporates Outlook attention mechanisms. This attention mechanism operates by computing weighted averages within local windows, allowing the model to focus on critical regions without significant computational overhead. Mathematically, for an input feature map $X \in \mathbb{R}^{H \times W \times C}$, the Outlook attention generates weights $A \in \mathbb{R}^{H \times W \times K^4}$ and value features $V \in \mathbb{R}^{H \times W \times C}$ through linear projections:

$$A = X W_A, \quad V = X W_V$$

where $W_A \in \mathbb{R}^{C \times K^4}$ and $W_V \in \mathbb{R}^{C \times C}$ are learnable matrices. For each spatial position $(i, j)$, a local window $V_{\Delta_{i,j}} \in \mathbb{R}^{K^2 \times C}$ is extracted, and the output is computed as:

$$Y_{\Delta_{i,j}} = \text{SoftMax}(A_{\Delta_{i,j}}) \cdot V_{\Delta_{i,j}}$$

These local outputs are then aggregated into a global feature map $Y$, which is combined with the input via layer normalization to produce the final output $X’ = \text{LayerNorm}(Y + X)$. This process enhances the model’s ability to discern helmet features in cluttered environments, a common scenario in drone technology applications.

In the neck section, the feature pyramid network (FPN) and path aggregation network (PAN) are enhanced by replacing conventional upsampling with CARAFE (Content-Aware ReAssembly of FEatures). CARAFE dynamically generates kernel weights based on input content, leading to more accurate feature reconstruction. Specifically, for an input feature map $X$, it first computes context-aware weights $W$ using a lightweight convolution and softmax:

$$W = \text{SoftMax}(\text{Conv}(X))$$

These weights are then applied to reassemble features through a weighted sum over local regions, followed by a PixelShuffle operation to increase spatial resolution:

$$Y_{i,j} = \sum_{m,n} U_{m,n} \cdot W_{i,j,m,n}, \quad Y’ = \text{PixelShuffle}(Y)$$

where $U$ represents unfolded local patches. This approach preserves fine details, which is vital for detecting small helmets in high-resolution UAV imagery. Additionally, the detection head employs a Wise-IoU (WIoU) loss function to refine bounding box regression. Unlike CIoU, WIoU introduces a distance-aware weighting factor $R_{WIoU}$ that adapts to anchor quality, defined as:

$$L_{WIoU} = R_{WIoU} \cdot L_{IoU}$$

where $L_{IoU} = 1 – IoU$, and $R_{WIoU}$ is calculated based on the minimum enclosing box dimensions $W_g$ and $H_g$:

$$R_{WIoU} = \exp\left(\frac{(x – x_{gt})^2 + (y – y_{gt})^2}{(W_g + H_g)^2}\right)$$

Here, $(x, y)$ and $(x_{gt}, y_{gt})$ denote the center coordinates of the anchor and ground-truth boxes, respectively. This dynamic focusing mechanism prioritizes difficult samples, improving localization accuracy for helmets in varied poses and sizes, a key advantage in Unmanned Aerial Vehicle-based monitoring.

To validate the proposed method, experiments were conducted on a custom dataset comprising 5,413 images captured by a DJI Mavic 2 UAV in urban areas of Guilin, China. The dataset includes diverse conditions—such as different times of day (morning, noon, evening) and weather (sunny, cloudy, rainy)—ensuring robustness. Images were annotated with two classes: helmet-wearing riders and non-helmet-wearing riders, and split into training (4,230 images), validation (591 images), and test (592 images) sets. Data augmentation techniques, including flipping, cropping, and Mosaic, were applied to enhance generalization. The experimental setup used an NVIDIA RTX A4000 GPU, with training parameters detailed in Table 1. The Adam optimizer was employed with an initial learning rate of 1e-3 and cosine annealing scheduling over 100 epochs.

Table 1: Training Parameters Configuration
Parameter	Value
Batch Size	32
Epochs	100
Momentum	0.937
Weight Decay	5e-4

Evaluation metrics included mean average precision (mAP) and frames per second (FPS). The mAP is derived from the precision-recall curve, where precision $P$ and recall $R$ are defined as:

$$P = \frac{N_{TP}}{N_{TP} + N_{FP}}, \quad R = \frac{N_{TP}}{N_{TP} + N_{FN}}$$

Here, $N_{TP}$, $N_{FP}$, and $N_{FN}$ represent true positives, false positives, and false negatives, respectively. The average precision (AP) for each class is approximated by:

$$AP \approx \sum_{k=1}^{N} P(k) \Delta r(k)$$

and mAP is the mean over all classes. FPS measures real-time performance, with higher values indicating faster detection. Ablation studies were performed to assess the individual contributions of each modification, as summarized in Table 2. The baseline YOLOv8 achieved an mAP of 0.953 and FPS of 26.33. Adding the Outlook-C2f module increased mAP to 0.966 with minimal FPS drop (26.31), highlighting its efficacy in small target attention. Incorporating CARAFE improved mAP to 0.961 and FPS to 26.47, demonstrating its efficiency in feature reconstruction. Using WIoU alone boosted mAP to 0.965 while maintaining FPS at 26.33. The full integration of all components yielded the best results: mAP of 0.967 and FPS of 26.91, confirming their synergistic benefits for drone technology applications.

Table 2: Ablation Study Results
Modification	mAP	FPS
Baseline	0.953	26.33
+ Outlook-C2f	0.966	26.31
+ CARAFE	0.961	26.47
+ WIoU	0.965	26.33
All Combined	0.967	26.91

Comparative experiments with state-of-the-art detectors further validated the superiority of the proposed method. As shown in Table 3, our approach outperformed YOLOv5 (mAP: 0.949, FPS: 28.48), YOLOv6 (mAP: 0.952, FPS: 28.12), YOLOv8 (mAP: 0.953, FPS: 26.33), Faster R-CNN (mAP: 0.904, FPS: 19.24), and SSD (mAP: 0.926, FPS: 22.74). Specifically, the proposed algorithm achieved an mAP of 0.967 and FPS of 26.91, indicating a optimal balance between accuracy and speed. The improvements are attributed to the tailored enhancements for UAV-based scenarios, such as the Outlook attention mechanism for localized focus and CARAFE for detailed upsampling. Visual comparisons, as illustrated in sample detection outputs, revealed that the proposed method reduced false positives and missed detections in complex backgrounds, such as distinguishing helmets from similar objects like sun hats or clothing hoods. For instance, in one test case, the baseline YOLOv8 failed to detect a helmet in a crowded scene, while our method correctly identified it with high confidence. These results underscore the practical viability of integrating drone technology with advanced deep learning for real-world safety enforcement.

Table 3: Comparative Experiment Results
Algorithm	Helmet-Wearing Accuracy	Non-Helmet Accuracy	mAP	FPS
YOLOv5	0.935	0.911	0.949	28.48
YOLOv6	0.943	0.926	0.952	28.12
YOLOv8	0.951	0.932	0.953	26.33
Faster R-CNN	0.892	0.883	0.904	19.24
SSD	0.917	0.866	0.926	22.74
Proposed Method	0.963	0.927	0.967	26.91

In conclusion, this study presents a robust helmet detection framework for electric vehicle riders by combining intelligent communication with Unmanned Aerial Vehicle technology and an enhanced YOLOv8 model. The proposed innovations—Outlook-C2f for attention-driven feature extraction, CARAFE for content-aware upsampling, and WIoU for refined localization—collectively address the challenges of small target detection in dynamic aerial environments. Experimental results on a real-world dataset demonstrate significant gains in both accuracy and speed, with the full model achieving an mAP of 96.7% and FPS of 26.91. These findings highlight the potential of drone technology to revolutionize traffic safety monitoring, offering a scalable and efficient alternative to traditional methods. Future work will focus on further model lightweighting for onboard UAV deployment and extending the approach to multi-object tracking and behavioral analysis. By continuously refining these technologies, we aim to contribute to safer urban mobility and reduce accident risks associated with non-helmet usage, ultimately saving lives through intelligent automation.

The integration of drone technology into public safety systems represents a paradigm shift, enabling proactive monitoring and rapid response. Unmanned Aerial Vehicles, with their aerial vantage points, provide comprehensive coverage that ground-based systems cannot match. When coupled with intelligent communication networks, they facilitate real-time data exchange, allowing authorities to enforce helmet regulations effectively. This study underscores the importance of adapting deep learning models to the unique demands of UAV operations, such as handling scale variations and motion artifacts. The success of the proposed method opens avenues for broader applications, including pedestrian safety, crowd management, and emergency response. As drone technology continues to evolve, its synergy with artificial intelligence will undoubtedly lead to more sophisticated and reliable solutions for enhancing public safety and operational efficiency in diverse scenarios.