In modern agriculture, accurate estimation of crop yield is crucial for food security and informed decision-making. Rice, as a staple food globally, requires efficient monitoring techniques, especially in complex field environments. Traditional manual counting of rice panicles is labor-intensive and error-prone, prompting the need for automated solutions. With advancements in remote sensing, UAV drone technology has emerged as a powerful tool for capturing high-resolution aerial imagery, enabling non-invasive and scalable data collection. However, detecting rice panicles from UAV drone images poses challenges due to factors like occlusion, varying panicle morphology, uneven lighting, and background clutter. Existing detection methods often struggle with balancing accuracy and computational efficiency, limiting deployment on resource-constrained devices such as UAV drones. In this study, we propose a lightweight detection model, CBLP YOLO11n, designed specifically for UAV drone-based rice panicle counting. Our approach integrates several enhancements to improve feature extraction, multi-scale fusion, and detection head efficiency while reducing model complexity. We focus on leveraging UAV drone imagery to address real-world agricultural scenarios, ensuring robustness and practicality.

The use of UAV drones in agriculture has revolutionized data acquisition, allowing for rapid and cost-effective field surveys. For rice panicle detection, UAV drone images provide detailed spatial information, but they also introduce complexities like small target sizes and environmental noise. Previous methods, including traditional machine learning and deep learning approaches, have shown limitations in handling these intricacies. For instance, image segmentation techniques, while precise, are computationally expensive and sensitive to background interference. In contrast, object detection models offer a balance between speed and accuracy, making them suitable for UAV drone applications. Among these, YOLO (You Only Look Once) variants are popular due to their real-time performance. We base our work on YOLO11n, the lightweight version of YOLO11, and introduce modifications to enhance its suitability for UAV drone-based rice panicle detection. Our goal is to achieve high detection precision with minimal resource usage, facilitating deployment on UAV drones for in-field monitoring.
To address the challenges, we collected UAV drone imagery from rice fields using a DJI Air 2S drone flown at a low altitude of 3 meters to capture high-resolution images. The dataset comprises diverse scenarios, including clear, occluded, and overlapping panicles, reflecting real-world conditions. We processed the data by cropping images into smaller patches and applying augmentation techniques like rotation and noise addition to improve model generalization. Additionally, we employed Real ESRGAN for super-resolution reconstruction to enhance image details, which is critical for detecting small panicles in UAV drone images. The model’s effectiveness is evaluated using metrics such as precision, recall, mean average precision (mAP), parameter count, and floating-point operations (FLOPs). Our experiments demonstrate that CBLP YOLO11n outperforms baseline and other state-of-the-art models, offering a lightweight solution for UAV drone-based rice panicle counting.
In the following sections, we detail our methodology, including the architectural improvements to YOLO11n. We introduce the C3k2 CFCGLU module for enhanced feature extraction, BiFPN for efficient multi-scale feature fusion, a Lightweight Detail-Enhanced Shared Detection Head (LDSDH) to reduce complexity, and Powerful IoUv2 loss function for better bounding box regression. We present results through tables and formulas, highlighting the model’s performance in various scenarios. Finally, we discuss implications for agricultural applications using UAV drones, emphasizing the model’s potential for real-time field deployment.
Materials and Methods
Our study relies on UAV drone imagery acquired from rice fields to train and evaluate the detection model. The UAV drone, a DJI Air 2S, was operated at an altitude of 3 meters with the camera oriented vertically downward, capturing images at a resolution of 5472 × 3648 pixels. This setup ensures detailed coverage of rice panicles while minimizing distortion. The field conditions included sunny weather with varying lighting, simulating typical UAV drone survey environments. A total of 116 original images were collected, which were subsequently split into 512 × 512 patches with 10% overlap to create a manageable dataset for training. We annotated rice panicles using bounding boxes, resulting in 1,281 valid images. The dataset was divided into training, validation, and test sets in an 8:1:1 ratio. To augment the training data, we applied transformations such as horizontal-vertical flipping, random rotation, cropping, Gaussian noise addition, and brightness-contrast adjustments. This augmentation increases dataset diversity and model robustness, crucial for handling UAV drone images under different conditions.
Furthermore, we utilized Real ESRGAN for super-resolution reconstruction to enhance image quality. This step addresses limitations in UAV drone imagery, such as blurriness or low resolution due to factors like wind or transmission issues. Real ESRGAN employs a generator with Residual-in-Residual Dense Blocks (RRDB) to reconstruct high-detail images. The process can be summarized with formulas. Let the input feature be \(x\), the Residual Dense Block (RDB) output is given by:
$$ F_{RDB}(x) = x + W_{\text{fusion}}[F_0, F_1, \ldots, F_N] $$
where \(W_{\text{fusion}}\) is a \(1 \times 1\) convolution, and \(F_N\) is the output of the \(N\)-th layer. The RRDB output combines multiple RDBs with residual scaling factors \(\alpha_i\):
$$ F_{RRDB} = x + \sum_{i=1}^{M} \alpha_i F_{RDB_i}(x) $$
This enhancement improves feature extraction for small rice panicles in UAV drone images, leading to better detection accuracy. After processing, the training set expanded to 1,536 images, while validation and test sets remained unchanged for fair evaluation.
Model Architecture: CBLP YOLO11n
We base our model on YOLO11n, a lightweight version of YOLO11 designed for efficient object detection. However, to tailor it for UAV drone-based rice panicle detection, we introduce several modifications. The original YOLO11n consists of Backbone, Neck, and Head networks. Our improvements include replacing the C3k2 modules in the Backbone with C3k2 CFCGLU, substituting the Neck’s PANet with BiFPN, designing a Lightweight Detail-Enhanced Shared Detection Head (LDSDH), and adopting the Powerful IoUv2 loss function. These changes aim to enhance feature representation, reduce computational cost, and improve detection accuracy for rice panicles in UAV drone imagery.
The C3k2 CFCGLU module integrates ConvFormer and Convolutional Gated Linear Unit (CGLU) to boost feature extraction. ConvFormer replaces self-attention with separable convolution, reducing parameters while maintaining performance. The CGLU adds a gating mechanism with depthwise convolution to emphasize important features. This module is particularly effective for capturing details of rice panicles in complex backgrounds from UAV drone images. The BiFPN (Bidirectional Feature Pyramid Network) enhances multi-scale feature fusion by adding skip connections and removing redundant nodes. It assigns adaptive weights to different input features, improving the fusion of low-level details and high-level semantics. This is vital for detecting rice panicles of varying sizes in UAV drone captures, where small and occluded targets are common. The feature fusion process in BiFPN can be expressed as a weighted sum:
$$ P_{\text{out}} = \sum_{i} w_i \cdot F_i $$
where \(w_i\) are learnable weights and \(F_i\) are input features from different scales. This dynamic weighting helps the model focus on relevant features for rice panicle detection.
For the detection head, we propose LDSDH to reduce complexity. It uses shared convolutions and a Detail-Enhanced Convolution (DEConv) module, which combines standard convolution with differential convolutions (e.g., central, horizontal, vertical, angular) to capture both intensity and gradient information. The DEConv operation can be formulated as:
$$ F_{\text{out}} = \text{DEConv}(F_{\text{in}}) = \sum_{i=1}^{5} F_{\text{in}} \ast K_i = F_{\text{in}} \ast \left( \sum_{i=1}^{5} K_i \right) = F_{\text{in}} \ast K_{\text{combined}} $$
where \(K_i\) represent kernels from different convolutional branches. This reparameterization reduces parameters while enhancing detail extraction, crucial for accurate panicle localization in UAV drone images.
Additionally, we replace the CIoU loss function with Powerful IoUv2 (PIoUv2) to accelerate convergence and improve bounding box regression. PIoUv2 introduces a non-monotonic attention layer controlled by a hyperparameter \(\lambda\). The loss function is defined as:
$$ P = \frac{d_{\omega1}}{\omega_{gt}} + \frac{d_{\omega2}}{\omega_{gt}} + \frac{d_{h1}}{h_{gt}} + \frac{d_{h2}}{h_{gt}} $$
where \(\omega_{gt}\) and \(h_{gt}\) are the width and height of the ground truth box, and \(d\) terms represent distances between boxes. The PIoU loss is:
$$ L_{PIoU} = L_{IoU} + 1 – e^{-P^2} \quad (0 \leq L_{PIoU} \leq 2) $$
and the attention factor \(g\) is:
$$ g = e^{-P} \quad (g \in (0,1]) $$
The final PIoUv2 loss incorporates an attention function \(u(\lambda g)\):
$$ u(\lambda g) = 3 \lambda g e^{-(\lambda g)^2} $$
Thus, the total loss is:
$$ L_{PIoUv2} = u(\lambda g) L_{PIoU} = 3 (\lambda g) e^{-(\lambda g)^2} L_{PIoU} $$
This loss function mitigates issues like anchor box over-enlargement, common in complex UAV drone scenes, leading to faster and more accurate detection.
Experimental Results and Analysis
We conducted experiments to evaluate CBLP YOLO11n using the UAV drone image dataset. Training was performed on a system with an NVIDIA GeForce RTX 3060 GPU, using PyTorch framework. Hyperparameters included an image size of 640 × 640, batch size of 16, initial learning rate of 0.01, and SGD optimizer. We compared our model with baseline YOLO11n and other state-of-the-art models like SSD, Faster R-CNN, YOLO variants, and RT-DETR. Performance metrics included precision (P), recall (R), mAP@0.5, FLOPs, parameter count, inference speed, and model memory usage.
First, we assessed the impact of data augmentation and super-resolution on detection. As shown in Table 1, augmentation improved mAP by 0.8 percentage points, confirming its necessity for UAV drone image processing.
| Data Processing | Precision (%) | Recall (%) | mAP@0.5 (%) |
|---|---|---|---|
| No Augmentation | 86.0 | 85.6 | 91.8 |
| With Augmentation | 86.3 | 86.8 | 92.6 |
Next, we compared loss functions. Table 2 demonstrates that PIoUv2 achieved the highest precision, recall, and mAP, outperforming CIoU, GIoU, SIoU, MPDIoU, and Wise IoUv3. This aligns with our goal of efficient detection for UAV drone applications.
| Loss Function | Precision (%) | Recall (%) | mAP@0.5 (%) |
|---|---|---|---|
| CIoU | 86.3 | 86.8 | 92.6 |
| GIoU | 87.1 | 86.9 | 92.6 |
| SIoU | 87.1 | 86.7 | 93.0 |
| MPDIoU | 87.4 | 86.4 | 93.0 |
| Wise IoUv3 | 87.1 | 86.3 | 92.8 |
| Powerful IoUv2 | 87.7 | 87.2 | 93.2 |
Ablation studies were performed to validate each component of CBLP YOLO11n. Table 3 summarizes the results, showing that our full model achieves the best mAP of 93.9% with reduced parameters and FLOPs. The C3k2 CFCGLU module improved recall, BiFPN enhanced mAP while cutting parameters, LDSDH boosted precision with lower computation, and PIoUv2 accelerated convergence. The combination of all components yielded a balanced improvement, making it suitable for UAV drone deployment.
| Model | C3k2 CFCGLU | BiFPN | LDSDH | PIoUv2 | Precision (%) | Recall (%) | mAP@0.5 (%) | FLOPs | Parameters |
|---|---|---|---|---|---|---|---|---|---|
| YOLO11n | – | – | – | – | 86.3 | 86.8 | 92.6 | 6.313e9 | 2.582e6 |
| Model 1 | ✓ | – | – | – | 86.9 | 87.6 | 92.9 | 6.179e9 | 2.440e6 |
| Model 2 | – | ✓ | – | – | 88.3 | 86.9 | 93.2 | 6.277e9 | 1.923e6 |
| Model 3 | – | – | ✓ | – | 88.2 | 86.4 | 93.2 | 4.777e9 | 2.186e6 |
| Model 4 | – | – | – | ✓ | 86.8 | 86.5 | 93.1 | 6.143e9 | 1.780e6 |
| Model 5 | – | ✓ | ✓ | – | 88.4 | 86.3 | 93.0 | 4.950e9 | 1.676e6 |
| Model 6 | ✓ | ✓ | ✓ | – | 87.2 | 86.8 | 93.4 | 4.816e9 | 1.534e6 |
| Model 7 | – | ✓ | ✓ | ✓ | 88.6 | 86.7 | 93.5 | 4.950e9 | 1.676e6 |
| CBLP YOLO11n | ✓ | ✓ | ✓ | ✓ | 88.2 | 87.9 | 93.9 | 4.816e9 | 1.534e6 |
Comparative analysis with other models is presented in Table 4. CBLP YOLO11n achieves the highest mAP of 93.9% and the smallest memory usage of 3.78 MB, outperforming SSD, Faster R-CNN, YOLO variants, and RT-DETR. This highlights its efficiency for UAV drone-based applications, where low memory and high accuracy are critical.
| Model | mAP@0.5 (%) | FLOPs | Parameters | Inference Speed (fps) | Memory Usage (MB) |
|---|---|---|---|---|---|
| SSD | 89.4 | 3.0428e10 | 2.3746e7 | 40.30 | 91.67 |
| Faster R-CNN | 87.1 | 1.37216e11 | 4.1348e7 | 17.60 | 315.79 |
| YOLOv5n | 91.5 | 5.973e9 | 2.216e6 | 82.60 | 4.47 |
| YOLOv7-tiny | 90.6 | 1.3022e10 | 6.007e6 | 74.60 | 11.72 |
| YOLOv8n | 92.4 | 8.085e9 | 3.006e6 | 88.90 | 5.99 |
| RT-DETR | 86.3 | 1.25976e11 | 4.0134e7 | 25.80 | 63.11 |
| YOLOv9t | 89.3 | 1.0693e10 | 2.617e6 | 25.30 | 5.83 |
| YOLOv10n | 91.2 | 8.568e9 | 2.763e6 | 85.50 | 5.55 |
| YOLO11n | 92.6 | 6.313e9 | 2.582e6 | 122.44 | 5.27 |
| YOLO12n | 92.3 | 5.819e9 | 2.509e6 | 81.40 | 5.23 |
| CBLP YOLO11n | 93.9 | 4.816e9 | 1.534e6 | 103.81 | 3.78 |
We also evaluated counting accuracy by comparing predicted panicle counts with manual annotations. A linear regression analysis showed that CBLP YOLO11n achieved an R² value of 0.86, higher than the baseline’s 0.80, indicating reliable counting performance for UAV drone imagery. This is essential for yield estimation in precision agriculture.
Discussion and Implications
The proposed CBLP YOLO11n model demonstrates significant advantages for UAV drone-based rice panicle detection. By integrating lightweight modules and advanced loss functions, it addresses common challenges in agricultural UAV drone applications, such as small target detection and computational constraints. The use of UAV drones for data acquisition allows scalable monitoring, and our model’s low memory footprint enables deployment on embedded systems onboard UAV drones. This facilitates real-time in-field analysis, reducing reliance on cloud processing and enhancing operational efficiency.
In practice, UAV drone surveys can cover large rice fields quickly, generating vast amounts of imagery. CBLP YOLO11n’s efficiency ensures timely processing, providing farmers with accurate panicle counts for yield prediction. The model’s robustness to occlusions and varying lighting conditions, as seen in our experiments, makes it suitable for diverse environments captured by UAV drones. Future work could involve extending the model to other crops or integrating multispectral UAV drone data for enhanced feature extraction. Additionally, on-device optimization for UAV drone hardware could further improve speed and energy efficiency.
Conclusion
In this study, we developed CBLP YOLO11n, a lightweight detection model for rice panicles using UAV drone imagery. Our approach combines enhanced feature extraction with C3k2 CFCGLU, efficient multi-scale fusion via BiFPN, a streamlined detection head with LDSDH, and improved bounding box regression using Powerful IoUv2 loss. Experimental results show that the model achieves a precision of 88.2%, recall of 87.9%, and mAP of 93.9%, with a 23.7% reduction in parameters and 40.6% reduction in FLOPs compared to YOLO11n. It outperforms other state-of-the-art models in accuracy and memory efficiency, making it ideal for deployment on resource-limited UAV drones. The success of this work highlights the potential of UAV drone technology combined with advanced deep learning for agricultural monitoring, paving the way for automated and precise crop yield estimation in real-world settings. As UAV drone usage expands in agriculture, lightweight models like CBLP YOLO11n will play a crucial role in enabling sustainable and data-driven farming practices.
