EAD-YOLO: A Novel Approach for Human Posture Recognition in Drone Technology

1. Introduction

The rapid advancement of drone technology has opened new frontiers in computer vision applications, particularly in human behavior analysis for public safety. In densely populated areas such as squares and parks, the risk of dangerous events like pedestrian falls is significantly high. Traditional surveillance systems suffer from limited field of view and fixed positioning, making them inadequate for comprehensive monitoring. Drone technology, with its remarkable flexibility and wide aerial perspective, offers a superior alternative for human posture recognition in such dynamic environments.

However, applying human pose estimation from drone-captured imagery presents unique challenges. The varying flight altitudes, which typically range from 10 to 18 meters in our study, cause significant scale variations in human appearance. Occlusion between pedestrians in crowded scenes and varying illumination conditions further compound the difficulty. These factors frequently lead to missed detections and false positives in keypoint localization, which are critical for accurate posture recognition.

Deep learning has achieved remarkable breakthroughs in computer vision, providing effective methodologies for human posture recognition from aerial perspectives. Recent approaches such as AlphaPose combined with LightGBM have shown improvements in low-quality image scenarios. Spatiotemporal graph convolutional networks incorporating attention mechanisms have been proposed for precise recognition of violent behaviors from aerial footage. Multi-UAV frameworks leveraging data from multiple orientations, altitudes, and distances have enhanced keypoint detection accuracy through diverse viewpoint aggregation.

Among existing methods, YOLO-Pose has demonstrated exceptional real-time detection capabilities by integrating human pose estimation within the YOLO framework. It inherits the single neural network structure of YOLO, simultaneously performing object detection and human keypoint localization, thereby significantly improving computational efficiency. However, the standard YOLOv8n-Pose model struggles with keypoint detection under the challenging conditions typical of drone technology applications, particularly when dealing with multi-scale targets, occluded joints, and poor illumination.

To address these limitations, we propose EAD-YOLO, an enhanced human keypoint detection method specifically designed for drone technology viewpoints. Our approach introduces three key innovations:

  • An improved SPPF module integrated with Efficient Layer Aggregation Networks (ELAN) to enhance multi-scale feature extraction capability, enabling robust detection across varying flight altitudes.
  • A Weighted Channel Split and Shuffling module (ALSS) to strengthen the recognition of occluded keypoints through adaptive feature weighting and cross-channel information exchange.
  • A lightweight dynamic upsampler named Dysample to reduce background noise interference while maintaining computational efficiency, particularly beneficial under low-light conditions.

The experimental results demonstrate that our improved model achieves a 3.02% increase in mAP@50 while maintaining only 3.59 million parameters and 9.3 GFLOPS, successfully meeting the stringent requirements of real-time human keypoint detection in drone technology applications.

2. Related Work

2.1 Human Pose Estimation in Aerial Imagery

Human pose estimation from aerial imagery has gained increasing attention due to the proliferation of drone technology. Traditional approaches relied on hand-crafted features and geometric constraints, but their performance degraded significantly under the complex conditions typical of aerial footage. The advent of deep convolutional neural networks has revolutionized this field, enabling end-to-end learning of pose representations directly from image data.

Several notable works have addressed the specific challenges of aerial human pose estimation. Song et al. proposed AlphaPose combined with LightGBM to improve recognition accuracy in low-quality images. Shao et al. introduced an attention-enhanced spatiotemporal graph convolutional network to handle motion blur and scale anomalies in aerial footage. Fu et al. developed a multi-UAV framework that leverages synchronized video sequences from multiple orientations to improve keypoint detection accuracy. Yu designed a comprehensive software system for human behavior recognition from drone footage, addressing the frequent viewpoint changes inherent in drone technology.

Despite these advances, existing methods still face significant challenges. Multi-altitude and multi-angle flight operations cause substantial variations in human body scale. Complex backgrounds and pedestrian occlusion frequently result in missed keypoint detections. Environmental illumination changes affect aerial image quality, making human keypoint extraction more difficult. These challenges necessitate specialized solutions tailored to the unique characteristics of drone technology.

2.2 YOLO-Based Pose Estimation

The YOLO (You Only Look Once) family of object detectors has been widely adopted for real-time applications due to its excellent balance between speed and accuracy. YOLO-Pose extends this capability to human pose estimation by integrating keypoint detection into the unified YOLO architecture. This approach inherits the efficiency of single-stage detection while simultaneously outputting both bounding boxes and keypoint coordinates.

YOLOv8n-Pose represents a lightweight variant of this family, making it particularly suitable for deployment on computational resource-constrained platforms such as those found in drone technology systems. However, when applied to aerial footage, YOLOv8n-Pose exhibits several limitations. The standard SPPF module provides insufficient multi-scale feature representation for the widely varying human sizes encountered at different flight altitudes. The C2f module in the neck network lacks adaptive mechanisms to handle occluded keypoints effectively. The conventional upsampling method fails to preserve fine-grained details under poor illumination conditions.

These observations motivate our research to develop targeted improvements to YOLOv8n-Pose, creating a variant specifically optimized for the challenges of drone technology-based human posture recognition.

3. Proposed Method: EAD-YOLO

EAD-YOLO is designed to overcome the key challenges faced by YOLOv8n-Pose in drone technology applications. Our network architecture incorporates three major innovations: an enhanced SPPELAN-H module for multi-scale feature extraction, an ALSS module for robust occlusion handling, and a Dysample upsampler for improved detail preservation under varying illumination. The overall architecture is carefully designed to balance detection accuracy with computational efficiency, making it suitable for real-time deployment on drone platforms.

3.1 Improved SPPF Module with ELAN

The original Spatial Pyramid Pooling Fast (SPPF) module in YOLOv8n-Pose employs sequential max-pooling operations at different scales to capture multi-resolution features. However, for drone technology applications where human body sizes vary dramatically with flight altitude, the standard SPPF module provides insufficient feature representation capability. To address this limitation, we propose SPPELAN-H, which integrates Efficient Layer Aggregation Networks (ELAN) into the SPPF structure.

The SPPELAN-H module introduces multiple max-pooling layers at different scales and employs hierarchical aggregation mechanisms to fuse features across layers. By using a CSP (Cross Stage Partial) structure, the module maintains computational efficiency while preserving input features from the previous layer. After convolution and pooling operations, feature maps undergo multi-scale feature extraction through cascade operations, generating more expressive feature representations that enhance the network’s ability to describe complex morphologies and multi-scale targets.

The modified structure incorporates CBH blocks, which consist of Conv2d, Batch Normalization, and HardSwish activation functions. The use of HardSwish instead of the original SiLU activation function provides more stable gradient computation during training. The SiLU activation function involves exponential calculations whose nonlinear characteristics can cause significant gradient variations across different intervals, potentially leading to training instability or gradient vanishing:

$$ \text{SiLU}(x) = x \cdot \frac{1}{1 + e^{-x}} $$

HardSwish offers smoother gradient computation and eliminates the potential numerical precision losses associated with different implementations of the Sigmoid approximation:

$$ \text{HardSwish}(x) = \begin{cases} 0, & x \leq -3 \\ x \cdot \frac{x+3}{6}, & -3 < x < 3 \\ x, & x \geq 3 \end{cases} $$

The piecewise linear characteristics of HardSwish ensure stable gradient computation, with smooth backpropagation gradients that contribute to improved training stability. This is particularly important for drone technology applications where models must generalize across diverse environmental conditions.

3.2 Adaptive Lightweight Channel Split and Shuffling Module (ALSS)

Pedestrian occlusion and overlap present significant challenges for keypoint detection in drone technology applications. To address these issues, we replace the C2f module in the neck network with the Adaptive Lightweight Channel Split and Shuffling (ALSS) module. The ALSS module employs a weighted channel splitting strategy to optimize feature extraction and enhances information exchange through channel shuffling operations, thereby improving detection accuracy for occluded human keypoints.

The ALSS module takes the output feature tensor from the previous layer as input, with total feature dimension denoted as C_input. Different features in the input are weighted, with α and 1-α representing the weights assigned to the first and second parts respectively. The input feature map is split into H × W × αC and H × W × (1-α)C, where C represents the total number of input channels. By dynamically allocating weights to retain different feature maps, the module selectively focuses on critical image information, enhancing the flexibility and adaptability of feature learning.

In our implementation, α is set to 0.2, ensuring that only a small portion of channel features enters Branch A, while the majority of channel features are directed to the more complex multi-level Branch B for processing. This configuration enables the network to focus more attention on occluded human keypoints. In Branch A, the input feature map channels undergo average pooling to reduce feature map size while preserving primary feature information, followed by convolution operations to extract local features, and finally channel shuffling.

The convolution process in Branch A is expressed as:

$$ X_{A}^{out} = \text{Conv}(\text{Pool}(X_{A}^{in})) $$

Branch B adopts a bottleneck structure to avoid branch redundancy, connecting feature maps along the channel direction. This design not only enhances computational efficiency but also improves information interaction and feature expression capabilities. The branch first applies a 3×3 convolution kernel for dimensionality reduction, with stride 1 for all subsequent convolution layers to ensure equal resolution between input and output feature maps. After dimensionality reduction, a dimension reduction coefficient β is introduced to help the model establish balanced relationships among multiple feature sources. At layers with fewer channels, smaller β values ensure that the model maintains sufficient feature extraction capability while preserving critical information in feature maps.

A 3×3 depthwise convolution processes each channel independently to enhance network nonlinearity, followed by another 3×3 pointwise convolution kernel to set the output dimension to C_out – αC. The entire process can be represented as:

$$ \begin{cases} X_{B}^{phase1} = \text{conv}(X_{B}^{in}) \\ X_{B}^{phase2} = \text{DWconv}_n(X_{B}^{phase1}) \\ X_{B}^{out} = \text{conv}(X_{B}^{phase2}) \end{cases} $$

In the final stage, the output features from Branch A and Branch B undergo channel shuffling operations to enhance information exchange between different feature channels. This operation captures and integrates multi-scale and multi-angle feature information by rearranging the input feature map channels, allowing previously isolated feature branches to share learned information and improving the diversity of feature expression.

3.3 Lightweight Dynamic Upsampler Dysample

To improve the resolution of data captured under low-light conditions while maintaining computational real-time performance, we introduce the lightweight efficient dynamic sampler Dysample into the model. The original upsampler in YOLOv8n-Pose employs nearest-neighbor and bilinear interpolation methods to achieve higher resolution. However, these methods ignore image edge information, potentially causing loss of detailed human features in aerial images.

Dysample performs upsampling by incorporating spatial context information from the image, better preserving detailed features and edge information of drone-captured images without requiring additional high-resolution feature inputs. This approach maintains high performance during the upsampling process while reducing model computational complexity.

Given a feature map X_1 with dimensions C × H × W and a sampling set S with dimensions 2 × sH × sW (where s is the scaling factor), the dynamic upsampling process dynamically adjusts the proportion of the input feature map X_1 to ensure that post-processing can be optimized for different resolutions and scales. The grid_sample function resamples and interpolates the feature map X_1 according to positions in the sampling set S, while utilizing reverse mapping to associate each pixel in the output image with multiple pixels in the input feature map, obtaining more accurate feature expressions.

The Dysample sampling process involves the following equations:

$$ X_2 = \text{gridsample}(X_1, S) $$

$$ O = \text{linear}(X) $$

$$ O = 0.5 \cdot \text{sigmoid}(\text{linear}_1(X)) \cdot \text{linear}_2(X) $$

In these equations, S represents the output of the sampling point generator, and O represents the offset. Assuming the upsampling scale factor is k and the original sampling grid is G, the feature map X with dimensions C × H × W passes through a linear layer with input channel C and output channel 2s², generating an offset O with dimensions 2s² × H × W. By increasing the dimensionality, the module effectively learns more feature representations. Through pixel shuffling, the dimensions are reshaped to 2 × sH × sW, transforming low-resolution feature maps into high-resolution images by rearranging information.

The offset O can be modulated using a dynamic range factor, with the sigmoid function and a static factor of 0.5 controlling the dynamic range within [0, 0.5], ensuring sampling effectiveness and feature extraction stability. This is particularly beneficial for drone technology applications where varying illumination conditions significantly impact image quality.

3.4 Keypoint-Based Human Posture Recognition

For posture recognition, we select two common human behaviors: falling and standing. The recognition process combines three categories of features derived from the detected keypoints. The first feature is the angle between the human body centerline and the horizontal axis of the image. Using the keypoint coordinates detected by the algorithm model, we identify the left and right shoulder keypoints and the left and right hip keypoints, calculate the midpoint coordinates of each pair, connect the two midpoints to obtain the human body centerline, and compute the angle θ between the centerline and the horizontal axis:

$$ \theta = \arccos\left( \frac{\text{hips}_{c_x} – \text{shoulders}_{c_x}}{\sqrt{(\text{hips}_{c_x} – \text{shoulders}_{c_x})^2 + (\text{hips}_{c_y} – \text{shoulders}_{c_y})^2}} \right) $$

The second feature is the aspect ratio of the human bounding rectangle. When a person is standing upright, the aspect ratio is relatively small; when falling, the aspect ratio increases. The aspect ratio (Ar) is calculated as:

$$ Ar = \frac{|Y_2 – Y_1|}{|X_2 – X_1|} $$

where |Y_2 – Y_1| represents the width of the bounding rectangle and |X_2 – X_1| represents the height.

The third feature is the vertical distance difference between the shoulder center point and the hip center point:

$$ d = |\text{shoulders}_{c_y} – \text{hips}_{c_y}| $$

Since actual postures rarely satisfy all three feature conditions simultaneously, we employ a feature scoring mechanism. When the angle between the human centerline and the horizontal axis falls between 65° and 115°, the standing posture score increases by 0.8; when the angle is less than 30°, the falling posture score increases by 0.8. When the aspect ratio is less than 0.7, the standing posture score increases by 0.5; when the aspect ratio exceeds 1, the falling posture score increases by 0.5. The threshold for the vertical distance between shoulder and hip centers is set to 25; when the distance exceeds this threshold, the standing posture score increases by 0.5; when below the threshold, the falling posture score increases by 0.5. The final scores for falling and standing behaviors are calculated by accumulating the scores for each feature, and the behavior with the highest score is determined as the recognized posture.

4. Experimental Setup

4.1 Experimental Environment

All experiments were conducted in a controlled environment with consistent hardware and software configurations. The experimental setup was designed to evaluate the performance of our proposed EAD-YOLO model under conditions representative of real-world drone technology applications. The configuration details are summarized in Table 1.

Configuration Item Specification
Operating System Ubuntu 22.04
CPU Intel(R) Core(TM) i7-12700K
GPU Nvidia GeForce RTX 4080S
Deep Learning Framework PyTorch 1.12.1
CUDA Version 11.9
Programming Language & Libraries Python 3.9, NumPy, Matplotlib

All training images were resized to 640×640 pixels. The initial learning rate was set to 0.01, batch size to 8, and the Adam optimizer was employed with a momentum factor of 0.937. The maximum number of training epochs was set to 500, ensuring sufficient convergence for all compared models.

4.2 Dataset Description

Existing aerial human pose datasets such as UAV-Human, due to their limited shooting angles and单一 environments, cannot adequately meet the requirements of our human posture recognition task. Therefore, we constructed a dedicated dataset by拍摄 in campus environments and surrounding square parks. Our dataset comprises 1,500 drone-captured images of pedestrian postures, with shooting heights ranging from 10 m to 18 m. Specifically, 1,085 images were captured at heights of 15 m to 18 m, and 415 images at heights of 10 m to 15 m.

The dataset was collected during both sunny and cloudy periods to ensure diverse illumination conditions, and includes various scenarios such as different angles, occlusions, single targets, and multiple targets. The captured human behaviors include falling, climbing, and standing. All images have a resolution of 1920×1080 pixels. The dataset was split into training and validation sets in an 8:2 ratio.

Each image was annotated with 17 human keypoints using Labelme: nose, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, and right ankle. This comprehensive annotation scheme enables detailed posture analysis for various drone technology applications.

4.3 Evaluation Metrics

We employed a comprehensive set of evaluation metrics to assess model performance. These metrics include F1-Score, Precision, Recall, mean Average Precision (mAP), parameter count (Params), and floating-point operations (GFLOPS). The F1-Score, Precision, Recall, and mAP are calculated as follows:

$$ \begin{cases} R = \frac{TP}{TP + FN} \\ P = \frac{TP}{TP + FP} \\ F1 = \frac{2PR}{P + R} \\ AP = \int_{0}^{1} P(R) dR \\ mAP = \frac{1}{N} \sum_{n=1}^{N} AP(n) \end{cases} $$

In these equations, TP represents the number of correctly predicted positive samples, FP represents the number of incorrectly predicted positive samples, FN represents the number of incorrectly predicted negative samples, R denotes recall, P denotes precision, AP represents the average precision for each class, and mAP represents the mean of all class AP values. Higher mAP values indicate better detection performance.

5. Experimental Results and Analysis

5.1 ALSS Module Position Ablation Study

To determine the optimal placement of the ALSS module within the network, we conducted comparative experiments by replacing C2f modules at different positions: in the Backbone only, in the Neck only, in both Backbone and Neck, and in neither (baseline). The results are presented in Table 2.

Backbone Neck mAP@50 (%) Params (×10⁶) GFLOPS
93.66 3.30 9.2
93.45 2.86 6.7
94.13 2.98 7.8
93.78 2.94 7.4

Analysis of Table 2 reveals several important insights. Although replacing all C2f modules with ALSS results in the lowest parameter count and computational complexity, the detection accuracy decreases most significantly compared to the baseline model. When ALSS is used to replace only the C2f modules in the neck network, the model achieves the highest keypoint detection accuracy while satisfying lightweight requirements. Therefore, in our EAD-YOLO model, we employ ALSS to replace only the C2f modules in the neck network, striking an optimal balance between accuracy and efficiency for drone technology applications.

5.2 Ablation Study

To validate the effectiveness of each improvement component, we designed a comprehensive ablation study. The experimental results are presented in Table 3, where Model 1 represents the original YOLOv8n-Pose, Model 2 incorporates SPPELAN-H, Model 3 adds both SPPELAN-H and ALSS modules, and Model 4 (EAD-YOLO) integrates SPPELAN-H, ALSS modules, and Dysample lightweight dynamic upsampler.

Model F1 (%) P (%) R (%) mAP@50 (%) Params (×10⁶) GFLOPS
1 (Baseline) 93.67 94.03 93.31 93.35 3.30 9.2
2 (+SPPELAN-H) 95.31 95.87 94.76 95.26 3.78 9.8
3 (+ALSS) 95.88 95.94 95.82 95.81 3.73 9.4
4 (EAD-YOLO) 96.03 96.42 95.64 96.37 3.59 9.3

The results demonstrate that our improved EAD-YOLO model achieves significant performance gains over the baseline. The F1 score reaches 96.03%, precision reaches 96.42%, recall reaches 95.64%, and mAP@50 reaches 96.37%, representing a 3.02% improvement over the original model. Importantly, these gains are achieved while maintaining computational efficiency, with only 3.59 million parameters and 9.3 GFLOPS. This confirms that EAD-YOLO successfully enhances keypoint detection accuracy without compromising real-time performance, meeting the demanding requirements of drone technology-based human keypoint detection tasks.

During the ablation study training process, the mAP@50 curves reveal that the improved model converges faster and achieves higher accuracy compared to the original model, further validating the effectiveness of our proposed enhancements for drone technology applications.

5.3 Comparison with Different Scale Models

To further validate the effectiveness of EAD-YOLO, we compared it with several existing pose estimation models of different scales, including DeepPose, OpenPose, HRNet-W32, YOLOv8s-Pose, and YOLOv9-Pose. The results are presented in Table 4.

Model mAP@50 (%) Params (×10⁶) GFLOPS FPS
DeepPose 80.35 180 31.2 27.1
OpenPose 90.24 185 40.2 19.2
HRNet-W32 89.32 28.5 47.9 55.3
YOLOv8s-Pose 94.67 11.6 30.4 63.7
YOLOv9-Pose 97.42 32.3 122.4 46.9
EAD-YOLO (Ours) 96.37 3.59 9.3 78.5

As shown in Table 4, compared to DeepPose, OpenPose, HRNet-W32, and YOLOv8s-Pose, our EAD-YOLO achieves mAP@50 improvements of 16.02%, 6.13%, 7.05%, and 1.70% respectively, while simultaneously exhibiting lower detection time and computational complexity. Compared to YOLOv9-Pose, our algorithm shows slightly lower detection accuracy. YOLOv9-Pose employs Programmable Gradient Information (PGI), which introduces multiple gradient propagation paths to reduce feature loss during layer-by-layer feature extraction and spatial transformation, maintaining high feature representation throughout the network. However, this comes at the cost of significantly increased computational requirements.

Balancing detection accuracy and real-time inference requirements, our improved algorithm still effectively meets the demands of aerial pedestrian keypoint detection at flight altitudes of 10 m to 18 m. The results demonstrate that EAD-YOLO achieves superior efficiency in computational resource utilization, validating the effectiveness of our proposed improvements for practical drone technology deployment.

5.4 Detection Result Analysis

To provide直观 visualization of the improved model’s effectiveness, we selected scenarios with different flight altitudes, occlusion conditions, and illumination settings for human keypoint recognition. The comparison between the original model and EAD-YOLO reveals significant improvements across all challenging conditions.

At flight altitudes of 10 m and 15 m, the original model exhibits unsatisfactory recognition of small target pedestrians and human keypoints, with low detection confidence, incomplete human detection, and missing human keypoints. In contrast, EAD-YOLO accurately identifies pedestrians and keypoints under these conditions. Under occlusion and low-light conditions, the original model produces false keypoint detections and incorrect connections between keypoints. EAD-YOLO maintains high accuracy and robustness, demonstrating its effectiveness for challenging drone technology scenarios.

Heatmap visualization further confirms the superiority of EAD-YOLO. Compared to the original model, our improved model accurately concentrates heat on human keypoints with broader heat distribution ranges, further validating the effectiveness of our approach for drone technology applications.

5.5 Posture Recognition Result Analysis

We tested the complete posture recognition pipeline, combining keypoint detection with the feature scoring mechanism for behavior classification. The results under different flight altitudes, strong illumination, and low-light conditions demonstrate the practical utility of our approach for drone technology applications.

At a flight altitude of 10 m, the original model produces inaccurate localization of hip and ankle keypoints. At 15 m altitude, the original model mislocates hand keypoints and exhibits low detection confidence. Under strong illumination, the original model incorrectly identifies the right shoulder keypoint, causing the vertical distance between shoulder and hip centers to exceed the threshold and the angle between the human centerline and horizontal axis to deviate excessively, resulting in incorrect behavior classification, mistaking falling for standing. Under low-light conditions, the original model’s identification of shoulder and hip keypoints differs significantly from ground truth.

While the original model may not produce incorrect behavior classifications in some cases, the inaccuracies in keypoint localization can easily lead to behavior misjudgment in practical applications. In contrast, EAD-YOLO combined with our posture recognition algorithm accurately locates human keypoints and correctly identifies standing and falling behaviors according to the decision criteria.

To quantify the posture recognition performance, we selected 450 images containing falling behavior from the UAV-Human dataset and calculated the angle information of pedestrian keypoint connections for each image. The human posture recognition results are presented in Table 5.

Method Posture AP (%) AR (%)
YOLOv8n-Pose + Our Posture Recognition Standing 90.2 88.0
YOLOv8n-Pose + Our Posture Recognition Falling 86.5 83.4
EAD-YOLO + Our Posture Recognition Standing 96.5 92.2
EAD-YOLO + Our Posture Recognition Falling 95.0 88.6

From Table 5, it can be observed that compared to the baseline model, EAD-YOLO combined with our posture recognition method improves the AP for standing posture by 6.3% and AR by 4.2%; for falling posture, AP improves by 8.5% and AR by 3.2%. These results demonstrate that our proposed method achieves superior performance for human posture recognition in drone technology applications.

6. Conclusion

In this paper, we have presented EAD-YOLO, a novel human keypoint detection method specifically designed to address the challenges of human posture recognition from drone technology viewpoints. Our approach tackles three critical issues that frequently cause missed and false detections: varying flight altitudes, pedestrian occlusion, and changing illumination conditions.

Through the introduction of the SPPELAN-H module, we have successfully resolved the difficulty of multi-scale feature extraction caused by different flight altitudes. The integration of Efficient Layer Aggregation Networks with the SPPF structure enhances the network’s ability to represent human bodies at varying scales, ensuring robust detection performance across the 10 m to 18 m altitude range typical of drone technology applications.

The replacement of C2f modules with the Adaptive Lightweight Channel Split and Shuffling module (ALSS) has significantly improved the network’s ability to handle occluded keypoints. By employing weighted channel splitting and cross-channel information exchange through shuffling operations, our model maintains high detection accuracy even when pedestrians overlap or are partially occluded, a common occurrence in crowded scenes captured by drone technology.

The adoption of the lightweight dynamic upsampler Dysample has enhanced the model’s capability to preserve fine-grained details under varying illumination conditions. By incorporating spatial context information during upsampling, our approach maintains feature integrity without increasing computational overhead, making it particularly suitable for drone technology systems that must operate under diverse environmental conditions.

Furthermore, we have developed a robust posture recognition framework that combines the detected keypoints with a feature scoring mechanism to distinguish between standing and falling behaviors. This framework leverages geometric relationships among keypoints to provide reliable behavior classification, enhancing the practical utility of our approach for public safety applications enabled by drone technology.

Extensive experimental results demonstrate that EAD-YOLO achieves a 3.02% improvement in mAP@50 compared to the baseline YOLOv8n-Pose model, while maintaining only 3.59 million parameters and 9.3 GFLOPS. The model achieves an F1 score of 96.03%, precision of 96.42%, and recall of 95.64%, with a real-time inference speed of 78.5 FPS. These performance characteristics make EAD-YOLO highly suitable for deployment on computational resource-constrained platforms typical of drone technology systems.

Our comprehensive ablation studies validate the contribution of each proposed component, and comparisons with existing methods including DeepPose, OpenPose, HRNet-W32, YOLOv8s-Pose, and YOLOv9-Pose demonstrate the superior balance between accuracy and efficiency achieved by our approach. The posture recognition results further confirm the practical applicability of our method for real-world drone technology applications, with AP improvements of 6.3% for standing and 8.5% for falling postures compared to the baseline.

Despite these promising results, we acknowledge some limitations of our current work. The dataset used in this study is primarily collected from校园 and surrounding areas, which may not fully represent the diversity of scenarios encountered in real-world drone technology applications. Different geographic regions, cultural environments, and seasonal variations could introduce additional challenges that our model may not have encountered during training.

From the comparison results, we also note that YOLOv9-Pose achieves higher detection accuracy, albeit with significantly higher computational requirements. Future research could explore ways to incorporate the strengths of Programmable Gradient Information into our lightweight framework, potentially achieving better accuracy without sacrificing efficiency.

Looking forward, several directions for future research emerge from this work. First, we plan to expand our dataset to include more diverse scenarios, including different geographic locations, weather conditions, and cultural settings, to enhance the generalization capability of our model for widespread drone technology deployment. Second, we aim to explore multi-task learning approaches that simultaneously optimize for keypoint detection, posture recognition, and behavior prediction, creating a more comprehensive framework for human activity analysis from aerial perspectives. Third, we are interested in investigating temporal modeling techniques that leverage video sequences rather than single frames to improve the robustness and accuracy of posture recognition in dynamic drone technology scenarios.

Additionally, we recognize the importance of model compression and deployment optimization for practical drone technology systems. Future work will explore techniques such as quantization, pruning, and knowledge distillation to further reduce model size and computational requirements while maintaining detection accuracy, enabling deployment on even more resource-constrained drone platforms. We also plan to investigate the integration of our method with drone flight control systems, enabling adaptive altitude and angle adjustments based on detected human postures to optimize monitoring effectiveness.

In conclusion, EAD-YOLO represents a significant advancement in the application of drone technology for human posture recognition. By addressing the specific challenges of multi-scale detection, occlusion handling, and illumination variation, our approach provides a reliable and efficient solution for real-time human keypoint detection and behavior analysis from aerial perspectives. We believe that this work contributes valuable insights to the field of computer vision for drone technology and provides a solid foundation for future research and practical applications in public safety, surveillance, and human-computer interaction.

Scroll to Top