EAD-YOLO: Human Pose Recognition from China Drone Viewpoints

We propose an enhanced human pose recognition method, termed EAD-YOLO, specifically tailored for China drone perspectives. The deployment of China drones in public surveillance and safety applications demands robust keypoint detection under variable flight altitudes, occlusions, and illumination changes. Our approach improves upon the YOLOv8n-Pose baseline by integrating three core innovations: a modified SPPELAN-H module for multi-scale feature extraction, an ALSS module for occlusion handling, and a Dysample upsampler for dynamic resolution enhancement. Experimental results demonstrate that EAD-YOLO achieves superior performance with minimal computational overhead, making it highly suitable for real-time China drone-based monitoring.

The primary challenges in China drone-based human pose estimation arise from the wide range of viewing angles and distances. For instance, at heights of 10 to 18 meters, human figures become small and scale-variant, while occlusions from crowds or objects hinder keypoint detection. Additionally, lighting variations such as strong sunlight or low-light conditions degrade image quality. To address these issues, we systematically modify the network architecture of YOLOv8n-Pose. Our contributions are validated on a custom dataset captured using China drones over campus and public squares, containing 1500 images with 17 keypoints per subject.

The first enhancement is the SPPELAN-H module, which combines the spatial pyramid pooling fast (SPPF) structure with an efficient layer aggregation network (ELAN). The original SPPF suffers from limited feature scale interactions, especially under varying China drone altitudes. By integrating ELAN, we fuse features from multiple pooling scales through a cross-stage partial (CSP) connection, preserving gradient flow and enriching feature representations. The activation function is replaced with HardSwish to stabilize training, as it provides piecewise linear gradients. The HardSwish function is defined as:

$$HardSwish(x) = \begin{cases} 0, & x \leq -3 \\ \frac{x(x+3)}{6}, & -3 < x < 3 \\ x, & x \geq 3 \end{cases}$$

This substitution eliminates the numerical instability of SiLU in large-scale China drone image processing, where exponential calculations can lead to gradient vanishing. The SPPELAN-H module processes the input through a convolution-batch-normalization-HardSwish (CBH) unit, followed by multiple max-pooling layers with kernel sizes 5, 9, and 13. The outputs are concatenated and further refined by a convolutional layer, ensuring that both fine-grained and coarse features are captured. This design is crucial for detecting small keypoints such as wrists and ankles in high-altitude China drone views.

To combat occlusion-induced keypoint miss-detection, we introduce the Adaptive Lightweight Channel Split and Shuffling (ALSS) module in the neck network, replacing the original C2f blocks. Occlusion is a frequent issue in crowded China drone scenarios where humans overlap partially. The ALSS module dynamically splits the input feature map into two branches with learnable weights. Let the total channel count be C, and the splitting ratio be α. We set α = 0.2, meaning only 20% of channels go through a lightweight branch (A) that performs average pooling followed by a 3×3 convolution, while the remaining 80% enter a more complex bottleneck branch (B). The operations are expressed as:

$$X_{Aout} = Conv(Pool(X_{Ain}))$$
$$X_{Bphase1} = Conv(X_{Bin})$$
$$X_{Bphase2} = DWConv(X_{Bphase1})$$
$$X_{Bout} = Conv(X_{Bphase2})$$

Here, DWConv denotes depthwise convolution with stride 1 to preserve spatial resolution. The outputs from both branches are then concatenated and passed through a channel shuffle operation, which interleaves feature maps from different branches. This process enhances information exchange between the two paths, allowing the network to focus on occluded regions that would otherwise be suppressed. For example, when a person’s left shoulder is blocked by another individual in China drone imagery, the ALSS module leverages the richer features from branch B to compensate for the lost information in branch A. The final ALSS output is given by:

$$X_{ALSS} = Shuffle(Concat(X_{Aout}, X_{Bout}))$$

This design is particularly effective for maintaining keypoint accuracy in cluttered environments common in China drone applications.

The third improvement addresses illumination variation, which significantly impacts China drone images captured during dawn, dusk, or cloudy weather. Standard nearest-neighbor or bilinear upsampling methods in the neck network discard edge information, leading to blurred features for low-light keypoints. We replace the traditional upsampler with Dysample, a content-aware dynamic sampler that learns sampling offsets from the input feature map. Given an input feature X of size C×H×W, and a desired scale factor s, Dysample generates a sampling set S of size 2×sH×sW via a linear layer and pixel shuffle. The process is defined as:

$$S = PixelShuffle(Linear(X))$$
$$X_2 = gridsample(X, S)$$

The offset tensor O is computed with a sigmoid-gated mechanism to constrain offsets within [0, 0.5], preventing spatial distortions. The offset calculation is:

$$O = 0.5 \cdot sigmoid(Linear_1(X)) \cdot Linear_2(X)$$

This dynamic sampling preserves high-frequency details such as joint contours, which are critical for accurate keypoint localization in low-contrast China drone images. Compared to fixed interpolation, Dysample reduces floating-point operations by 15% while improving recall under dim lighting by over 4% in our experiments.

We conduct extensive experiments on a dataset collected with China drones at altitudes ranging from 10 to 18 meters. The dataset comprises 1500 labeled frames, with 1085 images from 15-18m and 415 from 10-15m, covering diverse weather conditions (sunny, cloudy) and activities (standing, walking, falling). Seventeen keypoints are annotated per subject: nose, eyes, ears, shoulders, elbows, wrists, hips, knees, and ankles. The training set uses an input size of 640×640, a batch size of 8, an initial learning rate of 0.01, the Adam optimizer with momentum 0.937, and 500 epochs. The computing environment features an Intel i7-12700K CPU, an RTX 4080S GPU, and PyTorch 1.12.1.

To verify the contribution of each component, we perform an ablation study. The baseline YOLOv8n-Pose achieves a mAP@50 of 93.35% with 3.30 million parameters and 9.2 GFLOPS. Adding SPPELAN-H increases mAP@50 to 95.26% and parameters to 3.78 million. Further integration of ALSS raises mAP@50 to 95.81% while slightly reducing parameters to 3.73 million. Finally, incorporating Dysample yields the full EAD-YOLO model with 96.37% mAP@50, 3.59 million parameters, and 9.3 GFLOPS. The complete results are summarized in Table 1.

Model	F1 (%)	Precision (%)	Recall (%)	mAP@50 (%)	Params (10^6)	GFLOPS
Baseline (YOLOv8n-Pose)	93.67	94.03	93.31	93.35	3.30	9.2
+SPPELAN-H	95.31	95.87	94.76	95.26	3.78	9.8
+SPPELAN-H + ALSS	95.88	95.94	95.82	95.81	3.73	9.4
+SPPELAN-H + ALSS + Dysample (EAD-YOLO)	96.03	96.42	95.64	96.37	3.59	9.3

Table 1 demonstrates that each module improves detection accuracy without sacrificing efficiency. Notably, the final model reduces parameters compared to the intermediate versions due to Dysample’s lightweight design, which eliminates redundant upsampling layers. The F1 score increase of 2.36% and recall improvement of 2.33% confirm enhanced robustness in challenging conditions typical of China drone footage.

We also compare our method with other state-of-the-art pose estimators on the same dataset, including DeepPose, OpenPose, HRNet-W32, YOLOv8s-Pose, and YOLOv9-Pose. The results are listed in Table 2.

Model	mAP@50 (%)	Params (10^6)	GFLOPS	FPS
DeepPose	80.35	180	31.2	27.1
OpenPose	90.24	185	40.2	19.2
HRNet-W32	89.32	28.5	47.9	55.3
YOLOv8s-Pose	94.67	11.6	30.4	63.7
YOLOv9-Pose	97.42	32.3	122.4	46.9
EAD-YOLO (Ours)	96.37	3.59	9.3	78.5

Table 2 shows that EAD-YOLO achieves the second-best mAP@50 while being significantly lighter and faster than all competitors. It surpasses YOLOv8s-Pose by 1.70% in mAP with only one-third of the parameters and one-third of the GFLOPS. Although YOLOv9-Pose has a slightly higher mAP@50 (97.42%), its computational cost (32.3 million parameters, 122.4 GFLOPS) is prohibitive for real-time China drone applications where energy efficiency and throughput are critical. EAD-YOLO runs at 78.5 FPS on our RTX 4080S system, making it ideal for embedded China drone platforms.

To further analyze the practical benefits, we evaluate pose classification for standing and falling behaviors using a scoring mechanism based on three geometric features derived from keypoints. These features are the angle θ between the body centerline and horizontal, the aspect ratio Ar of the bounding box, and the vertical distance d between shoulder and hip centers. The formulas are:

$$\theta = \arccos\left(\frac{hip_{sc,x} – shoulder_{sc,x}}{\sqrt{(hip_{sc,x} – shoulder_{sc,x})^2 + (hip_{sc,y} – shoulder_{sc,y})^2}}\right)$$
$$Ar = \frac{|Y_2 – Y_1|}{|X_2 – X_1|}$$
$$d = |shoulder_{sc,y} – hip_{sc,y}|$$

For each frame, scores are computed as follows: if θ is between 65° and 115°, standing gets 0.8 points; if θ < 30°, falling gets 0.8 points. If Ar < 0.7, standing gets 0.5 points; if Ar > 1, falling gets 0.5 points. If d > 25 pixels, standing gets 0.5 points; otherwise, falling gets 0.5 points. The pose with the highest total score is selected. We test this scheme on 450 falling-action images from the UAV-Human dataset, comparing baseline YOLOv8n-Pose with EAD-YOLO. The results are presented in Table 3.

Method	Pose	AP (%)	AR (%)
YOLOv8n-Pose + scoring	Standing	90.2	88.0
YOLOv8n-Pose + scoring	Falling	86.5	83.4
EAD-YOLO + scoring	Standing	96.5	92.2
EAD-YOLO + scoring	Falling	95.0	88.6

Table 3 reveals that EAD-YOLO yields substantial gains in both AP and AR for both poses. For falling detection, AP improves by 8.5% and AR by 3.2%, which is critical in China drone applications for timely alert generation. The enhanced keypoint precision from our model ensures that geometric features are computed accurately, reducing false alarms.

Qualitative analyses further support our findings. Under varying China drone altitudes (10m and 15m), baseline models often miss keypoints on small figures or produce incorrect connections under occlusion. In contrast, EAD-YOLO consistently localizes all 17 keypoints with high confidence. In low-light conditions near dusk, the baseline suffers from blurred feature maps, while Dysample preserves edge information, leading to accurate wrist and ankle detection. These improvements are visually confirmed by heatmap visualizations, where EAD-YOLO generates sharper activation peaks at correct anatomical locations.

In summary, EAD-YOLO effectively addresses the primary bottlenecks of human pose recognition from China drone perspectives: scale variation, occlusion, and illumination degradation. The SPPELAN-H module enhances multi-scale feature aggregation via ELAN and HardSwish activation. The ALSS module uses weighted channel splitting and shuffling to handle occluded keypoints. The Dysample upsampler dynamically adapts to content, improving low-light performance without extra computational burden. The model achieves 96.37% mAP@50 with only 3.59 million parameters and 9.3 GFLOPS, operating at 78.5 FPS. This balance of accuracy and efficiency makes EAD-YOLO an excellent choice for real-time surveillance tasks using China drones. Future work will focus on expanding the dataset to include more diverse environments and action classes to further improve generalization in China drone applications.