Precise Detection of Post-Hail Fallen Pears from a UAV Drone Perspective Using Enhanced YOLOv12

The agricultural sector, particularly specialty fruit cultivation, is increasingly vulnerable to the impacts of extreme weather events. Accurate and rapid assessment of crop damage is crucial for effective disaster response, insurance claim processing, and yield loss estimation. Traditional methods relying on manual field surveys are not only labor-intensive and time-consuming but also prone to subjective error and sampling bias, making them inefficient for large-scale and timely assessments. In this context, unmanned aerial vehicle (UAV) drone-based remote sensing has emerged as a powerful tool, offering a scalable, non-invasive, and efficient means to monitor agricultural fields. The aerial perspective provided by a UAV drone allows for the rapid capture of high-resolution imagery over extensive orchard areas, facilitating detailed analysis. However, automating the detection of specific damage indicators, such as fallen fruit in complex orchard environments, presents significant computer vision challenges. This study addresses the critical problem of automatically detecting pears knocked to the ground by hailstorms using imagery captured by a UAV drone. The targets are typically small, exhibit colors similar to surrounding leaves and soil, and are often partially occluded by debris or other fallen fruit, creating a highly unstructured and challenging detection scenario for standard models.

To tackle this problem, we propose a novel detection framework based on an enhanced version of the state-of-the-art YOLOv12 object detector. Our approach integrates several architectural improvements specifically designed to boost performance for small, occluded targets in cluttered backgrounds, which are characteristic of the UAV drone imagery of post-hazard orchards. The core contributions of our method are threefold: First, we incorporate a Coordinate Attention (CA) module into the backbone feature extraction layers to enhance the model’s ability to focus on spatially meaningful features of small pear targets. Second, we replace the conventional Path Aggregation Network (PANet) in the neck with a Bidirectional Feature Pyramid Network (BiFPN) structure to better fuse multi-scale features and preserve fine-grained details essential for small object detection from the UAV drone’s vantage point. Finally, we introduce the Shape-IoU loss function to replace the standard Complete-IoU (CIoU), improving the bounding box regression accuracy for irregularly shaped fruits by considering their inherent shape and scale properties. Comprehensive experiments demonstrate that our enhanced YOLOv12 model significantly outperforms several baseline and contemporary YOLO variants in terms of detection precision, recall, and mean average precision on a dedicated dataset of post-hail fallen pears captured by a UAV drone.

1. Dataset Curation and Preprocessing for UAV Drone Imagery

The foundation of any robust deep learning model is a high-quality, representative dataset. Our data collection campaign was designed to capture the real-world conditions of a pear orchard after a hailstorm using a commercial UAV drone platform. A quadcopter equipped with a high-resolution RGB camera was deployed. Flights were conducted at an altitude of 15 meters above ground level during midday hours to ensure consistent and ample lighting. A high overlap rate (85% frontal, 75% side) was maintained during the flight path to facilitate subsequent orthomosaic generation. The primary output was a large, georeferenced orthophoto map of the affected area.

This orthomosaic, while valuable for visualization, is too large for direct input into a neural network. Therefore, a critical preprocessing step involved tiling the large orthomosaic into smaller, manageable patches. We segmented the image into non-overlapping tiles of 512×512 pixels. This size represents a balance between providing sufficient contextual information for the model and maintaining a resolution where small fallen pears (often only a few dozen pixels in diameter) remain discernible. Each tile was then meticulously annotated by human experts using a polygon annotation tool to precisely outline every visible fallen pear. This process resulted in a substantial collection of annotated image tiles. To further enhance the dataset’s diversity and robustness—simulating various conditions a UAV drone might encounter, such as varying light angles, slight motion blur, or sensor noise—we applied a suite of data augmentation techniques exclusively to the training set. The final distribution of our dataset is summarized in Table 1.

Table 1: Distribution of the Fallen Pear Fruit Image Dataset
Split	Original Samples	After Augmentation
Training Set	4,096	24,576
Validation Set	878	878
Test Set	878	878
Total	5,852	26,332

The augmentation techniques included random 45-degree rotations, horizontal flipping, Gaussian blurring, addition of Salt-and-Pepper noise, and HSV (Hue, Saturation, Value) color space perturbations. This process ensures the model learns invariant features and generalizes better to unseen data from future UAV drone surveys under slightly different conditions.

2. Methodology: The Enhanced YOLOv12 Architecture

Our detection pipeline is built upon the YOLOv12-nano (YOLOv12n) model, chosen for its favorable trade-off between speed and accuracy, which is suitable for potential real-time analysis on edge devices processing UAV drone streams. We introduce three targeted modifications to its architecture to address the specific challenges of fallen fruit detection.

2.1 Baseline: YOLOv12 Network

YOLOv12 represents an evolution in the YOLO series, emphasizing efficient real-time detection. Its innovations include an Area Attention (A2) module that uses spatial group convolution and channel re-calibration to maintain a large receptive field while reducing computational cost via dynamic sparse attention. It also incorporates a Residual Efficient Layer Aggregation Network (Residual ELAN) to mitigate gradient dispersion issues often associated with attention mechanisms, and employs memory-optimized Flash Attention for handling longer sequences. The model follows the common YOLO architecture pattern: a CNN backbone (CSPDarknet) for feature extraction, a neck for multi-scale feature aggregation, and a detection head for predicting bounding boxes and class probabilities.

2.2 Incorporating BiFPN for Multi-Scale Feature Fusion

The neck of the network is responsible for combining features from different levels of the backbone. The standard PANet (Path Aggregation Network) performs a bottom-up and then a top-down fusion. However, for detecting very small objects like fallen pears in UAV drone imagery, a more robust multi-scale fusion is beneficial. We replace PANet with the Bidirectional Weighted Feature Pyramid Network (BiFPN).

BiFPN introduces two key improvements: simplified bidirectional connections and learnable feature weighting. It first removes nodes that have only one input edge (deemed to have less contribution to feature fusion) and then adds an extra edge from the original input to the output node if they are at the same level. More importantly, it performs fast normalized fusion where different input features are assigned learnable weights during fusion. The fusion operation for a target node can be expressed as:

$$ O = \sum_i \frac{w_i}{\epsilon + \sum_j w_j} \cdot I_i $$

where $I_i$ are the input features, $w_i$ are learnable weights (applied per-feature, per-channel, or per-pixel), and $\epsilon$ is a small constant for numerical stability. This allows the network to dynamically emphasize more informative features from different scales. For our task, this means high-resolution, low-level features from earlier backbone layers—which contain fine details critical for spotting tiny pears—can be adaptively weighted higher when fusing with semantically rich but coarse high-level features. This bidirectional, weighted fusion significantly enhances the model’s capability to detect small targets prevalent in UAV drone imagery.

2.3 Embedding Coordinate Attention (CA) for Spatial Awareness

Attention mechanisms help models focus on relevant parts of an image. We embed the Coordinate Attention (CA) module into the backbone of YOLOv12. Unlike channel-only attention (e.g., Squeeze-and-Excitation) or spatial-only attention, CA decomposes the attention process into two parallel 1D feature encoding processes along the height and width axes. This captures long-range dependencies with precise positional information, which is vital for locating small, scattered pears.

The CA mechanism operates as follows. For an input feature map $X$ with dimensions $C \times H \times W$, it first performs average pooling along the horizontal and vertical directions, generating two sets of direction-aware feature maps:

$$ z_c^h(h) = \frac{1}{W} \sum_{0 \leq i < W} x_c(h, i) $$

$$ z_c^w(w) = \frac{1}{H} \sum_{0 \leq j < H} x_c(j, w) $$

These pooled features $z^h$ and $z^w$ encode global spatial information along one direction while preserving precise location information along the other. They are then concatenated and transformed via a shared $1 \times 1$ convolution, batch normalization, and a non-linear activation (like Sigmoid or Swish) to produce an intermediate feature map $f$. This map is split back into two separate tensors $f^h$ and $f^w$ along the spatial dimension. Two additional $1 \times 1$ convolutions, $F_h$ and $F_w$, followed by sigmoid activations, generate the final attention weights $g^h$ and $g^w$:

$$ g^h = \sigma(F_h(f^h)) $$

$$ g^w = \sigma(F_w(f^w)) $$

The final output $Y$ of the CA module is obtained by applying these attention weights multiplicatively to the input feature map:

$$ y_c(i, j) = x_c(i, j) \times g_c^h(i) \times g_c^w(j) $$

This process allows the network to selectively highlight important regions both horizontally and vertically, making it exceptionally effective for identifying small fruit targets that might be adjacent to distracting elements like leaves or soil clods in the UAV drone image.

2.4 Optimizing Localization with Shape-IoU Loss

The loss function for bounding box regression is crucial for precise localization. The default CIoU loss in YOLOv12 considers the overlap area, center-point distance, and aspect ratio difference between the predicted box $B$ and the ground truth box $B_{gt}$. However, it does not explicitly consider the inherent shape and scale properties of the target object itself. For irregularly shaped fruits, this can lead to suboptimal regression.

We adopt the Shape-IoU loss, which introduces a shape-aware term. The loss is defined as:

$$ \mathcal{L}_{\text{Shape-IoU}} = \mathcal{L}_{\text{IoU}} + \mathcal{L}_{\text{dis}} + \mathcal{L}_{\text{shp}} $$

Where the components are:

1. IoU Loss: $\mathcal{L}_{\text{IoU}} = 1 – IoU$

2. Distance Loss: This component considers the weighted distance between the centers of the two boxes.
$$ \mathcal{L}_{\text{dis}} = \frac{\rho_h}{h} (x – x_{gt})^2 + \frac{\rho_w}{w} (y – y_{gt})^2 $$
Here, $\rho_h$ and $\rho_w$ are weighting coefficients derived from the shapes of the boxes, and $h$, $w$ are the height and width of the predicted box. This makes the loss more sensitive to misalignment for certain box shapes.

3. Shape Loss: This is the key innovation, which directly penalizes differences in width and height proportionally to the box’s own scale.
$$ \mathcal{L}_{\text{shp}} = \sum_{t=\{w,h\}} \left( 1 – e^{-\frac{\rho_{t’}}{t} \cdot \frac{|t – t_{gt}|}{\max(t, t_{gt})}} \right)^{\theta} $$
where $t$ iterates over width and height, $t’$ is the complementary dimension, and $\theta$ is a shaping parameter. The term $\frac{\rho_{t’}}{t}$ dynamically adjusts the penalty based on the box’s own dimensions. This formulation encourages the predictor to pay more attention to getting the shape of small objects right, which is often a challenge in UAV drone imagery where targets occupy few pixels.

The overall architecture of our proposed enhanced YOLOv12 model, integrating BiFPN, CA, and Shape-IoU, is designed to be a powerful detector for the specific challenges posed by UAV drone-based agricultural monitoring.

3. Experimental Setup and Evaluation Metrics

All experiments were conducted in a consistent software and hardware environment to ensure fair comparisons. The models were implemented using PyTorch 1.13.1 and trained on a workstation with an NVIDIA GeForce RTX 4070 GPU. The training parameters were standardized across all models, as detailed in Table 2.

Table 2: Standardized Training Hyperparameters
Hyperparameter	Value
Input Image Size	640 x 640 pixels
Training Batch Size	16
Total Epochs	300
Optimizer	Adam
Initial Learning Rate	0.001
Momentum	0.937
Weight Decay	0.0005
Early Stopping Patience	30 epochs

To comprehensively evaluate model performance, we employed the following standard metrics calculated on the independent test set:
Precision (P): The proportion of correctly identified pears among all detections. $P = \frac{TP}{TP + FP}$
Recall (R): The proportion of actual pears that were successfully detected. $R = \frac{TP}{TP + FN}$
F1-Score: The harmonic mean of Precision and Recall. $F1 = 2 \cdot \frac{P \cdot R}{P + R}$
Mean Average Precision at IoU=0.5 (mAP@0.5): The average precision across all recall levels for an Intersection-over-Union (IoU) threshold of 0.5. This is the primary metric for overall detection accuracy.
Model Parameters and GFLOPs: Measures of model complexity and computational cost, relevant for deployment on systems processing continuous UAV drone feed.

4. Results and Analysis

4.1 Ablation Study

We first conducted an ablation study to validate the individual and combined contribution of our three proposed enhancements to the base YOLOv12n model. The results are systematically presented in Table 3.

Table 3: Ablation Study on the Fallen Pear Detection Dataset
Shape-IoU	BiFPN	CA	mAP@0.5 (%)	Precision (%)	Recall (%)	F1-Score (%)
✗	✗	✗	93.8	88.1	86.9	87.5
✓	✗	✗	94.2 (+0.4)	88.4	87.1	87.7
✗	✓	✗	93.9 (+0.1)	88.1	87.5 (+0.6)	87.8
✗	✗	✓	94.3 (+0.5)	88.4	87.6	88.0
✓	✓	✗	94.1 (+0.3)	88.3	87.5	87.9
✓	✗	✓	93.9 (+0.1)	89.3 (+1.2)	87.3	88.3
✗	✓	✓	94.5 (+0.7)	88.5	87.0	87.7
✓	✓	✓	94.9 (+1.1)	88.8 (+0.7)	87.6 (+0.7)	88.2 (+0.7)

The analysis reveals several key insights. Individually, each component provides a boost to different aspects of performance. The Shape-IoU loss consistently improves mAP@0.5, demonstrating its effectiveness in refining bounding box regression for this specific task. The CA attention module delivers the highest single-module mAP@0.5 gain of 0.5%, highlighting its power in enhancing feature representation for small targets captured by the UAV drone. BiFPN primarily boosts the recall, indicating its strength in reducing missed detections by better integrating fine-grained features. Crucially, when all three enhancements are combined, we observe synergistic improvements. The final model achieves a mAP@0.5 of 94.9%, which is a substantial 1.1 percentage point increase over the already strong baseline YOLOv12n. This confirms that our modifications address complementary challenges in the UAV drone-based fallen fruit detection pipeline.

4.2 Comparison of Attention Mechanisms

To further justify our choice of the Coordinate Attention module, we compared it against other popular attention mechanisms integrated into the same YOLOv12n backbone. The results, focusing on detection accuracy metrics, are shown in Table 4.

Table 4: Performance Comparison of Different Attention Mechanisms
Attention Mechanism	mAP@0.5 (%)	Precision (%)	Recall (%)	F1-Score (%)
ECA (Efficient Channel Attention)	94.1	88.2	87.0	87.6
CBAM (Convolutional Block Attention Module)	94.4	88.6	87.2	87.9
SimAM (Simple, Parameter-Free Attention)	94.0	88.3	86.8	87.5
CA (Coordinate Attention – Ours)	94.5	88.5	87.6	88.0

The CA module achieves the highest scores across all metrics, particularly recall and F1-score. This superiority can be attributed to its unique decomposition of spatial attention into vertical and horizontal directions. This explicit spatial encoding is especially beneficial for locating the small, discrete targets (fallen pears) within the expansive and cluttered scene of a UAV drone image, where channel-only (ECA) or sequential channel-then-spatial (CBAM) attention may not capture positional relationships as effectively.

4.3 Comparative Analysis with State-of-the-Art Models

We benchmarked our enhanced YOLOv12 model against a range of efficient, contemporary YOLO models that are potential candidates for deployment on systems analyzing UAV drone imagery. The comparative results, including efficiency metrics, are presented in Table 5.

Table 5: Comprehensive Comparison with State-of-the-Art YOLO Models
Model	Parameters (M)	GFLOPs	Precision (%)	Recall (%)	mAP@0.5 (%)	F1-Score (%)
YOLOv7-tiny	6.02	13.2	86.2	86.2	92.4	86.2
YOLOv8n	3.01	8.1	87.4	85.6	93.3	86.5
YOLOv9s	9.74	39.6	88.1	86.4	92.2	87.2
YOLOv10n	2.71	8.4	88.4	84.8	93.7	86.6
YOLOv11n	2.59	6.4	87.1	78.7	89.1	82.7
YOLOv12n (Baseline)	2.52	6.0	88.1	86.9	93.8	87.5
Our Enhanced YOLOv12	2.57	5.8	88.8	87.6	94.9	88.2

The results are conclusive. Our enhanced model achieves the highest mAP@0.5 (94.9%), precision (88.8%), recall (87.6%), and F1-score (88.2%) among all compared models. It outperforms the baseline YOLOv12n by 1.1% in mAP@0.5 while maintaining a nearly identical parameter count and even slightly reducing computational complexity (GFLOPs). This demonstrates the effectiveness of our architectural tweaks. Notably, it significantly surpasses other efficient models like YOLOv10n and YOLOv11n on this specific UAV drone-based task. The visual comparison of detection results further underscores this advantage. In challenging scenarios with dense clusters, strong shadows, or partial occlusion by leaves and debris—common in post-disaster UAV drone surveys—our model shows markedly fewer false negatives (missed pears) and false positives compared to other top performers like YOLOv8n and YOLOv10n. The model reliably identifies pears that are subtly different from the background and correctly rejects confusing elements, proving its robustness for real-world agricultural insurance and assessment applications using UAV drone technology.

5. Conclusion

In this study, we addressed the critical and practical challenge of automatically detecting pears fallen due to hailstorms, using imagery captured by UAV drones. The unstructured orchard environment, characterized by small target size, color similarity to background elements, and frequent occlusions, poses a significant hurdle for standard object detection models. To overcome this, we developed an enhanced detection framework based on the YOLOv12 architecture. Our key innovations include the integration of a Coordinate Attention module to boost spatial feature discrimination, the adoption of a BiFPN neck for superior multi-scale fusion tailored to small objects, and the implementation of a Shape-IoU loss function for more accurate bounding box regression that considers target shape. Extensive experimental evaluations on a dedicated dataset confirm the effectiveness of each component and their synergistic combination.

The proposed model achieved a mAP@0.5 of 94.9%, outperforming a suite of state-of-the-art lightweight detectors including YOLOv7-tiny, YOLOv8n, YOLOv9s, YOLOv10n, YOLOv11n, and the baseline YOLOv12n. It maintains high efficiency with only 2.57 million parameters, making it suitable for potential real-time or on-edge analysis of UAV drone video streams. This work provides a robust and accurate technical solution for rapid post-disaster assessment in pear orchards. By enabling fast, automated quantification of fallen fruit from UAV drone surveys, it can significantly support insurance claim processing, yield loss estimation, and informed decision-making for orchard managers, contributing to the resilience and sustainability of modern precision agriculture.

Future work will focus on expanding the dataset to include more varied conditions (e.g., different times of day, soil moisture levels) and other fruit types to improve generalizability. Furthermore, exploring the integration of this detection model into an end-to-end system that can process geotagged UAV drone imagery to generate direct loss estimation maps for entire orchards represents a promising direction for full-scale deployment.