Accurate Wheat Spike Counting via Drone Video Streams and Deep Learning

In modern agriculture, the accurate monitoring of crop traits such as wheat spike count is crucial for yield prediction, breeding programs, and cultivation management. Traditional methods relying on manual counting are labor-intensive, time-consuming, and prone to human error, especially in large-scale fields. With advancements in drone technology, Unmanned Aerial Vehicles (UAVs) have emerged as powerful tools for capturing high-resolution imagery and video streams of crops. These UAV-based systems, when integrated with deep learning algorithms, enable efficient and real-time analysis of plant phenotypes. In this study, we explore the feasibility of using deep learning models to accurately count wheat spikes from UAV video streams, addressing challenges such as varying planting densities, environmental conditions, and spike developmental stages.

We selected ten major winter wheat cultivars from the Huang-Huai region, including varieties like Fanmai 8, Zhoumai 36, and Zhongmai 895, to represent diverse spike types. Three planting densities—1.2 million, 2.4 million, and 3.6 million plants per hectare—were established in randomized block designs. Additionally, a production field of Zhongmai 578 with a density of 3 million plants per hectare was used for validation. Data collection was conducted during the grain-filling stages using a consumer-grade DJI Air 3 drone equipped with a 48-megapixel camera. The drone was flown at an altitude of 3 meters above the canopy, with a tilt angle of 37 degrees and a speed of 0.5–1 m/s, capturing video streams at a resolution of 3840 × 2160 and 60 frames per second. To ensure robustness, videos were cropped into 450 × 450 pixel sub-videos, and frames were extracted every 45 frames to minimize overlap, resulting in 528 initial images. Data augmentation techniques, including horizontal flip, vertical flip, Gaussian blur, and contrast adjustment, were applied to generate 3,168 images, which were split into training, validation, and test sets in an 8:1:1 ratio.

We employed four state-of-the-art deep learning models—YOLOv5, YOLOv6, YOLOv8, and YOLOv10—for wheat spike detection and counting. These models are single-stage object detectors known for their balance between speed and accuracy. The training process utilized the PyTorch framework with stochastic gradient descent (SGD) optimizer, an initial learning rate of 0.01, and 150 epochs. Key evaluation metrics included precision (Pr), recall (Re), F1-score (F1), average precision (AP), accuracy (Acc), coefficient of determination (r²), root mean square error (RMSE), and mean absolute error (MAE). The formulas for these metrics are as follows:

Precision: $$Pr = \frac{TP}{TP + FP}$$

Recall: $$Re = \frac{TP}{TP + FN}$$

F1-score: $$F1 = \frac{2 \times (Pr \times Re)}{Pr + Re}$$

Average Precision: $$AP = \int_0^1 Pr(Re) dRe$$

Accuracy: $$Acc = \frac{TP + TN}{TP + TN + FP + FN}$$

Coefficient of Determination: $$r^2 = 1 – \frac{\sum_{i=1}^n (y_i – \hat{y}_i)^2}{\sum_{i=1}^n (y_i – \bar{y})^2}$$

Root Mean Square Error: $$RMSE = \sqrt{\frac{\sum_{i=1}^n (y_i – \hat{y}_i)^2}{n}}$$

Mean Absolute Error: $$MAE = \frac{\sum_{i=1}^n |y_i – \hat{y}_i|}{n}$$

where (TP) is true positive, (FP) is false positive, (FN) is false negative, (TN) is true negative, (y_i) is the actual spike count, (\hat{y}_i) is the predicted count, and (\bar{y}) is the mean of actual counts.

The training loss functions for all models decreased over iterations, indicating convergence. YOLOv6 had the shortest training time at 6.37 hours, while YOLOv8 took the longest at 24.37 hours. In terms of detection efficiency, YOLOv6 achieved the highest frames per second (13.04 fps), whereas YOLOv8 and YOLOv5 were slower but more accurate. The performance metrics on the test set revealed that YOLOv8 outperformed others with a recall of 90.90%, F1-score of 93.00%, AP of 97.20%, and accuracy of 88.00%. YOLOv10, despite having the highest precision (96.60%), showed lower recall and accuracy, indicating a tendency to miss spikes. The results are summarized in Table 1.

Table 1: Performance Comparison of Deep Learning Models for Wheat Spike Detection
Model	Precision (%)	Recall (%)	F1-score (%)	Average Precision (%)	Accuracy (%)	Training Time (hours)	FPS
YOLOv5	92.10	90.10	92.00	96.60	85.70	11.15	12.80
YOLOv6	96.10	80.30	89.00	93.80	80.90	6.37	13.04
YOLOv8	93.10	90.90	93.00	97.20	88.00	24.37	11.95
YOLOv10	96.60	72.70	82.00	88.60	69.80	8.57	10.07

Under different planting densities, the correlation between model-predicted spike counts and manual counts decreased as density increased. For YOLOv8, the r² values were 0.92, 0.81, and 0.79 at densities of 1.2, 2.4, and 3.6 million plants per hectare, respectively. The RMSE and MAE also rose with density, indicating higher error in denser canopies. Statistical analysis using t-tests showed that YOLOv8’s performance was significantly better (p < 0.05 or p < 0.01) than other models at higher densities, as detailed in Table 2. This highlights the robustness of YOLOv8 in handling occlusions and overlapping spikes, common challenges in drone-based imagery.

Table 2: Model Performance Across Planting Densities (r² values)
Planting Density (million/ha)	YOLOv5	YOLOv6	YOLOv8	YOLOv10
1.2	0.89	0.85	0.92	0.84
2.4	0.78	0.72	0.81	0.71
3.6	0.75	0.68	0.79	0.67

Validation on the production field across grain-filling stages (early, mid, and late) further confirmed YOLOv8’s superiority. In early grain-filling, YOLOv8 achieved an r² of 0.78 with RMSE and MAE of 3.62 and 3.30 spikes per image, respectively. By late grain-filling, these metrics improved to r² = 0.91, RMSE = 3.10, and MAE = 3.51. Other models, such as YOLOv5 and YOLOv10, showed lower consistency and higher errors. For real-time video stream processing, YOLOv8 maintained the highest accuracy with an r² of 0.90, RMSE of 10.00 spikes per image, and MAE of 7.80, outperforming YOLOv5 (r² = 0.84), YOLOv6 (r² = 0.74), and YOLOv10 (r² = 0.72). The detailed spike detection results, including true positives, false positives, and false negatives, are presented in Table 3.

Table 3: Spike Detection Results in Production Field Videos
Model	True Spikes	Detected Spikes	False Positives	False Negatives	True Positives
YOLOv5	2759	2708	191	242	2517
YOLOv6	2759	2386	122	495	2264
YOLOv8	2759	2713	169	215	2544
YOLOv10	2759	2114	76	721	2038

The integration of drone technology with deep learning, particularly using YOLOv8, demonstrates significant potential for automated wheat spike counting. The Unmanned Aerial Vehicle platform provides high-quality video streams that capture dynamic crop features, while the deep learning model excels in detecting spikes under varying conditions. YOLOv8’s anchor-based mechanism and C2f architecture contribute to its high performance in dense environments. Although YOLOv6 and YOLOv10 offer faster processing, their lower recall and accuracy limit practical application. Data augmentation techniques, such as Gaussian blur and flipping, enhanced model generalization, reducing overfitting to specific lighting or weather conditions. This aligns with previous studies emphasizing the role of augmentation in improving robustness for agricultural drone-based systems.

In discussion, we note that the use of drone technology for real-time monitoring allows for continuous data collection, enabling timely decisions in precision agriculture. The Unmanned Aerial Vehicle’s ability to cover large areas efficiently complements the deep learning models’ scalability. However, challenges remain, such as handling extreme occlusions in high-density fields and adapting to different wheat cultivars. Future work could focus on integrating multi-spectral data from advanced UAV sensors or developing hybrid models that combine object detection with density estimation for improved accuracy.

In conclusion, our study validates that YOLOv8, when combined with drone video streams, provides a reliable and efficient solution for wheat spike counting. It outperforms other models in terms of recall, F1-score, and accuracy across diverse planting densities and growth stages. The application of this approach can revolutionize yield prediction, breeding selection, and field management, underscoring the transformative impact of drone technology and deep learning in modern agriculture. As Unmanned Aerial Vehicles become more accessible, their integration with advanced algorithms like YOLOv8 will pave the way for sustainable and data-driven farming practices.