Enhanced Multi-Target Apple Recognition in Orchard Environments Using Improved YOLOv8 on China UAV Drone Imagery

In modern precision agriculture, the accurate and efficient identification of fruit targets is foundational for growth monitoring, yield estimation, and intelligent orchard management. In recent years, small Unmanned Aerial Vehicles (UAVs), or drones, have gained widespread application in modern agriculture due to their flexibility, compact size, and low operational cost. Utilizing imagery captured by China UAV drones for precise and rapid fruit identification and counting can significantly reduce labor costs and enhance the level of intelligent orchard management. However, images collected by drones present unique challenges, including uneven illumination, severe occlusion between fruits or between fruits and leaves, dense distribution of targets, and variations in UAV flight altitude and attitude. These factors often lead to issues such as missed detections and false positives in existing recognition models. Therefore, developing precise recognition methods for fruit targets in complex orchard environments using imagery shot by China UAV drones is of great significance for the construction of smart orchards.

Current research in fruit recognition has progressed through traditional methods, which rely on manually designed features, and more recently, deep learning-based approaches. While methods like Faster R-CNN offer high accuracy, their complexity often compromises detection speed. Single-stage detectors like the YOLO series and SSD, known for their simplicity and fast inference, have shown good performance on tasks involving citrus, grapes, apples, and strawberries. However, many studies rely on datasets composed of static, high-quality images from smartphones or cameras, limiting the model’s generalization capability to the dynamic and complex scenarios encountered by China UAV drones. Although some research has begun to address drone-based fruit detection by modifying YOLO architectures or incorporating Transformer modules, challenges remain in balancing model lightweight design, accuracy improvement, and robustness against occlusion and scale variation prevalent in UAV imagery.

To address the specific challenges of multi-target apple recognition in complex orchard environments from China UAV drone perspectives, we propose an improved YOLOv8-based detection method. Our approach systematically enhances the YOLOv8 architecture across its backbone, neck, and head components to better handle issues like uneven lighting, occlusion, and small targets.

Materials and Methodological Framework

The core of our work involves a custom dataset and a novel model architecture termed LP-YOLOv8.

Dataset Curation and Preprocessing

The apple image dataset was autonomously collected in September 2023 in an apple planting base in Tianshui City, Gansu Province, China, using a DJI Phantom 4 RTK drone equipped with a high-resolution camera. A total of 2,000 images of mature apples were captured under various weather conditions (sunny, cloudy, overcast) and lighting scenarios (frontlighting, side-lighting, backlighting) between 8:00-11:00 and 13:00-18:00. To increase target prominence and augment data diversity, the original high-resolution images were cropped, resulting in 5,000 image samples. Approximately 1,000 images suffering from low-light or backlight conditions were enhanced using a histogram equalization algorithm in the HSI color space, which proved more effective than alternatives like the Retinex algorithm in improving contrast and detail clarity for subsequent model training. All apple instances in the preprocessed images were manually annotated, and the final dataset was randomly split into training and testing sets at an 8:2 ratio.

The LP-YOLOv8 Architecture

The proposed LP-YOLOv8 model introduces targeted improvements to the standard YOLOv8n architecture to enhance its feature extraction, fusion, and detection capabilities for challenging drone-captured apple imagery.

1. Backbone Network Optimization with GELAN and LSKA

The original YOLOv8 backbone uses multiple CBS and C2f modules for feature extraction. While effective, the C2f module can introduce computational redundancy. We replace the C2f modules with a Generalized Efficient Layer Aggregation Network (GELAN) structure. GELAN combines ideas from CSPNet and ELAN, using a split path to maintain gradient flow and a RepNCSP branch for efficient feature processing. This change significantly reduces the parameter count while improving the network’s ability to learn features of small, occluded targets, as visualized in feature heatmaps.

Furthermore, to enlarge the model’s receptive field and better capture global context information of apple targets against complex backgrounds, we integrate a Large Separable Kernel Attention (LSKA) mechanism into the SPPF module. LSKA decomposes a large 2D convolution kernel into cascaded horizontal and vertical 1D kernels, allowing the use of large-kernel depthwise convolutions without a quadratic increase in parameters. This design makes the module more attuned to the shape of targets rather than texture, improving robustness. The modified SPPF_LSKA module is placed after the concatenation operation in the SPPF block.

2. Neck Network Enhancement with DySample

The neck of YOLOv8 performs multi-scale feature fusion via upsampling. We replace the original bilinear interpolation-based Upsample operator with a lightweight, dynamic upsampling operator called DySample. DySample generates content-aware sampling point offsets through a small sub-network, allowing for adaptive feature sampling that better preserves details for small and occluded objects, reducing boundary blurring inherent in fixed interpolation methods. The sampling process can be described as follows, where a scope factor mechanism (static or dynamic) is used to manage offset ranges and prevent sampling conflicts:

$$O = linear(x)$$
$$S = G + O$$
$$x’ = grid\_sample(x, S)$$

Here, $x$ is the input feature map, $linear()$ is a linear mapping function, $G$ is the regular grid coordinate, $O$ is the generated offset, $S$ is the new sampling location, and $grid\_sample()$ is the differentiable sampling function producing the upsampled output $x’$.

3. Head Layer Optimization with an Additional Small-Object Detection Head

The original YOLOv8 detection head uses feature maps from three scales (e.g., 80×80, 40×40, 20×20). To strengthen the detection capability for small or highly occluded apples prevalent in China UAV drone images taken from higher altitudes, we add an extra detection head that takes a higher-resolution, shallower feature map (160×160) from the backbone (P2 layer). This provides richer spatial details crucial for distinguishing small targets. The neck network is accordingly modified to fuse features from four levels (P2, P3, P4, P5), outputting four feature maps of different scales (160×160, 80×80, 40×40, 20×20) to the detection heads.

Evaluation Metrics

The performance of the recognition model is evaluated using standard metrics: Precision (P), Recall (R), mean Average Precision (mAP@0.5), Frames Per Second (FPS), and the number of model parameters (Params). The formulas for key metrics are:

$$P = \frac{TP}{TP + FP} \times 100\%$$
$$R = \frac{TP}{TP + FN} \times 100\%$$
$$mAP = \frac{\sum AP}{N} \times 100\%$$

where $TP$, $FP$, and $FN$ represent the number of True Positives, False Positives, and False Negatives, respectively, $AP$ is the Average Precision for a class, and $N$ is the number of classes.

Experimental Results and Analysis

All experiments were conducted on a workstation with an NVIDIA GeForce RTX 3050Ti GPU. The models were trained using Stochastic Gradient Descent (SGD) with the Adam optimizer over 300 epochs.

Selection of Baseline Model

We first evaluated several YOLO series models on our custom China UAV drone apple dataset to select a suitable baseline. The results are summarized below:

Model	Precision (P)	Recall (R)	mAP@0.5	FPS	Params
YOLOv5s	85.3%	82.5%	88.1%	52.7	7,033,114
YOLOv8n	90.2%	88.7%	95.0%	123.7	3,005,843
YOLOv9	89.1%	84.9%	94.3%	20.6	60,806,462
YOLOv10	89.3%	86.2%	92.6%	137.6	2,696,366
YOLOv12	90.0%	86.4%	94.3%	75.7	2,691,815

YOLOv8n achieved the best balance of high precision, recall, mAP, and inference speed (FPS), making it the most suitable baseline for our improvements targeting efficient and accurate recognition from China UAV drone streams.

Ablation Study

We conducted an ablation study to validate the contribution of each proposed component in the LP-YOLOv8 model. The results are systematically presented in the following table:

GELAN	SPPF_LSKA	DySample	P2 Head	P (%)	R (%)	mAP@0.5 (%)	Params	FPS
×	×	×	×	90.2	88.7	95.0	3,005,843	123.7
√	×	×	×	90.6	90.0	96.2	2,194,123	104.3
×	√	×	×	89.7	91.3	96.3	3,278,739	117.6
×	×	√	×	91.3	89.8	96.4	3,023,395	107.7
×	×	×	√	90.5	91.0	96.5	2,926,692	98.4
√	√	×	×	90.0	91.1	96.4	2,869,147	120.2
√	√	√	×	91.3	91.0	96.1	2,856,254	124.1
√	√	√	√	92.0	91.2	96.6	2,804,596	129.1

The ablation study clearly demonstrates the synergistic effect of our improvements. While individual components offer modest gains, their combination in the full LP-YOLOv8 model achieves the best overall performance: a precision of 92.0%, a recall of 91.2%, and an mAP@0.5 of 96.6%. Notably, this is achieved with a reduced parameter count (approximately 2.8 million) compared to the baseline and a slight increase in inference speed to 129.1 FPS. The addition of the P2 head, coupled with YOLOv8’s automatic channel compression mechanism for the neck, actually contributes to the overall parameter reduction. Gradient-weighted Class Activation Mapping (Grad-CAM) visualizations further confirm that the combined GELAN and LSKA modifications lead to stronger and more focused feature responses on occluded and small apple targets compared to the baseline or other partial modifications.

Comparative Analysis with State-of-the-Art Models

To further benchmark our LP-YOLOv8 model, we compared it against several classic and recent state-of-the-art detection models, including other improved YOLO variants from recent literature, on the same China UAV drone apple dataset. The comprehensive results are shown below:

Model	Precision (P)	Recall (R)	mAP@0.5	Params	FPS
YOLOv5s	75.3%	95.2%	93.8%	7,213,880	76.6
YOLOv7	89.4%	89.6%	94.6%	36,481,772	34.2
YOLOv8n	90.2%	88.7%	95.0%	3,005,843	123.7
RT-DETR	86.9%	85.9%	92.8%	32,808,131	37.0
SGW-YOLOv8n [Ref]	89.8%	90.5%	95.7%	3,115,682	86.3
YOLOv8s-CFB [Ref]	89.1%	90.1%	95.5%	2,636,953	112.6
LP-YOLOv8 (Ours)	92.0%	91.2%	96.6%	2,804,596	129.1

Our LP-YOLOv8 model outperforms all compared models in terms of precision, recall, and mAP@0.5. It surpasses the baseline YOLOv8n by 1.8, 2.5, and 1.6 percentage points in these three key metrics, respectively. Furthermore, it maintains a high inference speed (129.1 FPS), significantly faster than many other models, including the transformer-based RT-DETR. Visual comparisons on challenging cases (e.g., backlit scenes, heavily occluded fruits) confirm that LP-YOLOv8 exhibits fewer missed detections and false positives, with consistently higher prediction confidence for small and obscured apples captured by the China UAV drone.

Practical Deployment and Conclusion

The trained LP-YOLOv8 model has been successfully integrated and deployed into a self-developed smart agriculture management system on a cloud platform. The system, built with a B/S framework, can receive apple orchard images captured by China UAV drones in real-time via mobile internet. It automatically invokes the LP-YOLOv8 model through an API to perform apple recognition on individual images. The deployment demonstrates practical utility, with a per-image inference time of approximately 400 ms, meeting the requirements for large-scale orchard fruit recognition tasks in real-world applications.

In conclusion, to address the challenges of precise multi-target apple recognition in complex orchard environments from China UAV drone imagery, we proposed an improved YOLOv8-based method. By strategically optimizing the backbone with GELAN and LSKA, the neck with DySample, and the head with an additional small-target detection layer, the developed LP-YOLOv8 model achieves superior performance. On our custom drone-captured apple dataset, it attained a precision of 92.0%, a recall of 91.2%, an mAP@0.5 of 96.6%, and a fast inference speed of 129.1 FPS, with a reduced parameter count. Comparative and ablation studies validate the effectiveness of each component and the overall architecture. The model’s successful deployment in a cloud-based management system underscores its potential as an important technical support for solving the problem of accurate multi-target fruit recognition in complex environments using China UAV drone imagery. Future work will focus on improving the model’s generalization across different geographical regions, apple varieties, and more diverse weather conditions captured by drones.