EAD-YOLO: A Robust Human Pose Recognition Method for China UAVs

In recent years, the application of China UAV systems for public safety and surveillance has become increasingly vital. The ability to accurately recognize human poses from aerial viewpoints is crucial for real-time monitoring in crowded environments such as squares and parks. However, challenges such as variations in flight altitude, occlusions, and changing illumination conditions often lead to missed or false detections of keypoints. To address these issues, we propose an enhanced human keypoint detection model based on improved YOLOv8n-Pose, named EAD-YOLO, specifically designed for China UAV platforms.

Our proposed model integrates three key improvements. First, we fuse an efficient layer aggregation network (ELAN) into the SPPF module to enhance multi-scale feature extraction capability, allowing the network to better handle scale variations caused by different flight heights typical in China UAV operations. Second, we introduce a weighted channel segmentation and transformation module (ALSS) to improve recognition of occluded keypoints, which is a common problem in dense pedestrian scenes. Third, we employ a lightweight dynamic upsampler called Dysample to reduce background noise while maintaining low computational cost. Extensive experiments demonstrate that our model achieves 3.59 million parameters and 9.3 GFLOPS, with a mAP@50 improvement of 3.02% over the baseline, meeting the real-time requirements for China UAV-based pedestrian keypoint detection.

The rest of this paper is organized as follows. Section I introduces the background and motivation. Section II details the improved architecture. Section III presents the experimental protocol and ablation study. Section IV provides results and analysis. Section V concludes the paper.

I. Introduction

Population-dense public areas such as squares often witness dangerous events like pedestrian falls. The flexibility and wide field of view of China UAV systems offer significant advantages over traditional fixed surveillance cameras for such scenarios. However, the varying flight altitudes—commonly between 10 to 18 meters in our study—pose significant challenges for keypoint detection. Existing methods, such as AlphaPose combined with LightGBM, have shown effectiveness in low-quality images, but they often struggle with the unique characteristics of aerial imagery, including scale changes, occlusion, and dynamic background motion.

To overcome these limitations, we base our work on YOLOv8n-Pose, a top-down approach known for its real-time efficiency. We propose three modifications to adapt it for China UAV viewpoints. The first modification is the SPPELAN-H module, which enhances the spatial pyramid pooling section to capture features at multiple scales. The second is the ALSS module, designed to focus on occluded keypoints by using weighted channel splitting. The third is the Dysample upsampler, which reduces computation while preserving edge information important for small keypoints. These contributions collectively improve detection accuracy without significantly increasing the computational burden, making our model suitable for edge deployment on China UAV hardware.

II. Related Work

Human pose estimation from aerial platforms has been an active research area. DeepPose and OpenPose were among the first to apply deep learning to this task, but their computational demands inhibit real-time performance on resource-constrained China UAV systems. HRNet-W32 maintains high-resolution representations, yet its complexity often limits deployment in lightweight scenarios. Recent models like YOLOv9-Pose leverage programmable gradient information to maintain high feature representation, but at the cost of increased parameters and FLOPs. Our work builds on the YOLO series, which is widely adopted for real-time tasks, and we specifically target the unique challenges of China UAV perspectives, such as varying flight altitudes and occlusion patterns.

Previous studies have attempted to address these challenges. For example, spatial-temporal graph convolutional networks have been used for violent behavior recognition from aerial images. However, these approaches often require complex pre-processing and are sensitive to motion blur. In contrast, our model processes raw video frames directly, making it more practical for real-world China UAV deployments.

III. Proposed Method

In this section, we detail the three key modifications introduced in EAD-YOLO: the SPPELAN-H module, the ALSS module, and the Dysample upsampler. The overall network architecture is illustrated conceptually, with a focus on how these modules interact to improve keypoint detection under varied conditions.

A. Enhanced SPPF Module with ELAN (SPPELAN-H)

To handle the scale variations commonly encountered in China UAV imagery, we augment the standard SPPF module with an efficient layer aggregation network. The SPPELAN-H module replaces the original activation function with HardSwish to stabilize gradient propagation. The HardSwish activation function is defined as:

$$
\text{HardSwish}(x) = \begin{cases} 0, & x \leq -3 \\ x \cdot \frac{x+3}{6}, & -3 < x < 3 \\ x, & x \geq 3 \end{cases}
$$

This piecewise linear formulation avoids the vanishing gradient issues of SiLU while retaining non-linear expressiveness. The module uses multiple max-pooling layers with different kernel sizes, followed by concatenation and convolution, to build a more robust multi-scale feature representation. The CSP structure ensures computational efficiency while preserving input features. Experimental results confirm that this module contributes to higher mAP under varying flight altitudes.

B. ALSS Module for Occlusion Handling

Occlusions are prevalent in crowded scenes captured by China UAV platforms. We replace the original C2f module in the neck of the network with the Adaptive Lightweight Channel Split and Shuffling Network (ALSS). This module divides the input feature tensor into two branches with weights α and 1−α, respectively. The value of α is set to 0.2, ensuring that only a small portion of channels enters the lightweight branch A, while the majority are processed by the more complex branch B. The branch A performs average pooling followed by convolution, while branch B uses a bottleneck structure with depth-wise convolutions. The outputs are then combined via channel shuffle to enhance information interaction. The convolution operation in branch A can be expressed as:

$$
X_{Aout} = Conv(Pool(X_{Ain}))
$$

This design forces the network to focus on difficult keypoints, such as those occluded by other pedestrians or objects, resulting in improved performance in keypoint localization under occlusion.

C. Dysample Upsampler for Low-light Robustness

Low-light conditions are common in dawn, dusk, or nighttime surveillance missions with China UAVs. The original nearest-neighbor upsampler fails to preserve edge information, leading to loss of detail for small keypoints. We introduce Dysample, a lightweight dynamic upsampler that leverages spatial context. The upsampling process is defined as:

$$
X_2 = \text{grid\_sample}(X_1, S)
$$

where S is the sampling set generated by a dynamic offset. The offset O is computed as:

$$
O = \text{linear}(X)
$$

$$
O = 0.5 \cdot \text{sigmoid}(\text{linear}_1(X)) \cdot \text{linear}_2(X)
$$

This approach reduces computational cost while improving performance in challenging illumination, making it ideal for real-time deployment on China UAV platforms.

Figure 1 shows a typical China UAV system used for aerial data acquisition. The overhead view presents unique challenges, including scale and perspective distortions, which our model is designed to mitigate.

IV. Experimental Setup

A. Dataset and Implementation Details

We constructed a custom dataset of aerial pedestrian poses captured from China UAV platforms at two altitude ranges: 10–15 m and 15–18 m. The dataset consists of 1500 images, with 1085 captured at 15–18 m and 415 at 10–15 m, under variable weather conditions including sunlight and overcast skies. The images contain diverse scenes with varying numbers of pedestrians and occlusions. Frame resolution is 1920×1080. We split the data into training (80%) and validation (20%) sets. Annotation uses the 17-keypoint COCO format. Training hyperparameters are summarized in the table below.

Training Configuration
Hyperparameter	Value
Input Size	640×640
Initial Learning Rate	0.01
Batch Size	8
Optimizer	Adam
Momentum Factor	0.937
Epochs	500

B. Evaluation Metrics

We evaluate our models using Precision, Recall, F1-Score, mAP@50, and computational complexity (Params and GFLOPS). Precision and recall are defined as:

$$
P = \frac{TP}{TP+FP}, \quad R = \frac{TP}{TP+FN}
$$

$$
F1 = 2 \cdot \frac{P \cdot R}{P+R}
$$

$$
AP = \int_0^1 P(R)\, dR, \quad mAP = \frac{1}{N} \sum_{n=1}^N AP_n
$$

V. Results and Discussion

A. Ablation Study

We conduct ablation experiments to evaluate each proposed component. Model 1 represents the baseline YOLOv8n-Pose, Model 2 adds SPPELAN-H, Model 3 adds both SPPELAN-H and ALSS, and Model 4 (EAD-YOLO) includes all three improvements. Results are shown in Table I.

Table I: Ablation Study Results
Model	F1 (%)	P (%)	R (%)	mAP@50 (%)	Params (10^6)	GFLOPS
1	93.67	94.03	93.31	93.35	3.30	9.2
2	95.31	95.87	94.76	95.26	3.78	9.8
3	95.88	95.94	95.82	95.81	3.73	9.4
4	96.03	96.42	95.64	96.37	3.59	9.3

The results show that each module contributes positively to detection accuracy. Model 4 achieves a 3.02% improvement in mAP@50 over the baseline while maintaining a relatively low parameter count and FLOPs. Notably, the F1 score reaches 96.03%, demonstrating balanced precision and recall. The inclusion of ELAN in SPPF helps capture scale variations; ALSS enhances occlusion handling; and Dysample improves upsampling quality under different conditions.

B. Comparison with State-of-the-Art Models

We compare our model with other popular architectures. The results are summarized in Table II.

Table II: Comparison with Different Scale Models
Model	mAP@50 (%)	Params (10^6)	GFLOPS	FPS
DeepPose	80.35	180	31.2	27.1
OpenPose	90.24	185	40.2	19.2
HRNet-W32	89.32	28.5	47.9	55.3
YOLOv8s-Pose	94.67	11.6	30.4	63.7
YOLOv9-Pose	97.42	32.3	122.4	46.9
EAD-YOLO (Ours)	96.37	3.59	9.3	78.5

Our model achieves competitive mAP@50 (96.37%) with significantly fewer parameters and FLOPs compared to YOLOv9-Pose and HRNet-W32. The inference speed (78.5 FPS) is superior, making it ideal for real-time applications on China UAV platforms. Although YOLOv9-Pose has higher accuracy, its computational complexity hinders deployment on lightweight embedded systems typical of China UAVs. Our approach strikes an optimal balance between accuracy and efficiency.

C. Qualitative Results

We visualize detection results under various conditions. Under low-altitude (10 m) conditions, the baseline model often misses keypoints on small pedestrians, while our model accurately detects all keypoints. In occlusion scenarios, such as when pedestrians overlap, the baseline model produces misplaced edges, whereas our model maintains keypoint consistency. Under low-light conditions, edge information is better preserved by the Dysample upsampler, leading to fewer false detections.

D. Pose Recognition after Keypoint Detection

We apply a decision algorithm to classify standing and falling poses based on keypoint geometry. The features include: (1) the angle between the body midline and horizontal axis, (2) the aspect ratio of the bounding box, and (3) the vertical distance between shoulder and hip centers. The angle is computed as:

$$
\theta = \arccos\left( \frac{\text{hip}_{sc_x} – \text{shoulder}_{sc_x}}{\sqrt{(\text{hip}_{sc_x} – \text{shoulder}_{sc_x})^2 + (\text{hip}_{sc_y} – \text{shoulder}_{sc_y})^2}} \right)
$$

The aspect ratio:

$$
Ar = \frac{|Y_2 – Y_1|}{|X_2 – X_1|}
$$

And vertical distance:

$$
d = |\text{shoulder}_{sc_y} – \text{hip}_{sc_y}|
$$

These three features are scored to decide the pose. We tested on 450 images containing falling and standing actions. Results are shown in Table III.

Table III: Pose Recognition Results
Method	Pose	AP (%)	AR (%)
Baseline + Decision	Standing	90.2	88.0
Baseline + Decision	Falling	86.5	83.4
EAD-YOLO + Decision	Standing	96.5	92.2
EAD-YOLO + Decision	Falling	95.0	88.6

The improved keypoint detection directly boosts pose recognition accuracy. For falling pose AP increases by 8.5% and AR by 3.2%, confirming the effectiveness of our model in real-world China UAV applications.

VI. Conclusion

This paper presents EAD-YOLO, a lightweight and accurate human keypoint detection model tailored for China UAV perspectives. By incorporating SPPELAN-H for scale adaptability, ALSS for occlusion handling, and Dysample for illumination robustness, our model achieves state-of-the-art performance with only 3.59 million parameters and 9.3 GFLOPS. The model attains 96.37% mAP@50, a 3.02% improvement over the baseline, while maintaining real-time inference. Our work demonstrates that careful architectural modifications can significantly enhance detection in challenging aerial conditions, providing a practical solution for public safety monitoring via China UAVs. Future work will focus on expanding the dataset to cover more diverse scenarios and exploring deployment on low-power embedded devices.