The efficient and precise assessment of disease severity is a cornerstone of modern precision agriculture, enabling timely interventions, optimized resource application, and accurate yield loss predictions. Among the major oilseed crops, sunflower (Helianthus annuus L.) holds significant value due to its adaptability to marginal soils. However, its cultivation is threatened by Sclerotinia Head Rot, a devastating disease caused by the fungus Sclerotinia sclerotiorum. This pathogen leads to seed rot, reduced yield, and diminished oil quality. Traditional field scouting methods for assessing disease severity are labor-intensive, subjective, and lack scalability. Therefore, there is a pressing need for automated, objective, and high-throughput monitoring techniques.
Recent advancements in computer vision, particularly deep learning-based object detection, offer promising solutions for agricultural disease monitoring. While significant progress has been made in detecting foliar diseases in various crops using ground-based imagery, the unique challenge of monitoring diseases on the downward-facing heads of mature sunflowers remains under-explored. Ground-based imaging is often impractical for this task due to occlusion and accessibility issues. Unmanned drone technology emerges as a superior platform, capable of efficiently capturing high-resolution, top-down imagery over large field areas, thus providing an ideal data source for head rot assessment.

This study addresses the critical gap in automated sunflower head rot severity assessment by proposing a novel, lightweight detection model named YOLOv12n-RCL. The model is built upon the efficient YOLOv12n architecture and is specifically tailored for processing imagery captured by unmanned drones. We integrated several key enhancements to overcome the challenges of detecting variable-sized lesions, distinguishing subtle inter-class differences, and maintaining robustness against complex field backgrounds like soil and weeds. Our contributions focus on enhancing feature extraction, improving feature map reconstruction, and optimizing the detection head for both accuracy and efficiency, making it suitable for potential deployment on edge computing devices.
Materials and Methodology
1. Data Acquisition and Dataset Construction
High-resolution visible-light imagery was acquired from a sunflower field trial using a DJI M300 multi-rotor unmanned drone equipped with a Zenmuse P1 camera. The flight mission was conducted at an altitude of 30 meters with high overlap rates to ensure comprehensive coverage. The captured imagery had a native resolution of 8192 × 5460 pixels.
To create a manageable dataset for model training and to reduce background interference, the original large-scale orthomosaic was partitioned into smaller image tiles. Non-target and blurry tiles were manually removed, resulting in a curated set of 1,300 image patches containing sunflower heads. Each sunflower head instance was then annotated with a bounding box and classified into one of five severity grades based on the proportion of the head area covered by visible rot lesions:
- Grade 0: No visible lesions.
- Grade 1: Lesion area ≤ 25%.
- Grade 2: Lesion area 26% – 50%.
- Grade 3: Lesion area 51% – 75%.
- Grade 4: Lesion area ≥ 76%.
The lesion proportion $$P_b$$ for grading was calculated as:
$$P_b = \frac{P_t}{P_h}$$
where $$P_t$$ is the number of pixels in the lesion region and $$P_h$$ is the total number of pixels in the head region, measured from the manually annotated masks.
The annotated dataset was randomly split into training (80%), validation (10%), and test (10%) sets. To enhance model generalization and prevent overfitting, data augmentation techniques including rotation, translation, color jittering, and brightness adjustment were applied exclusively to the training set images. The final distribution of labels across sets is summarized in Table 1.
| Disease Grade | Original Training Set | Augmented Training Set | Validation Set | Test Set |
|---|---|---|---|---|
| 0 | 7,108 | 28,432 | 1,012 | 900 |
| 1 | 6,517 | 26,068 | 855 | 872 |
| 2 | 6,633 | 26,532 | 881 | 767 |
| 3 | 6,400 | 25,600 | 860 | 847 |
| 4 | 6,817 | 27,364 | 810 | 873 |
2. The Proposed YOLOv12n-RCL Architecture
We selected YOLOv12n as our baseline model due to its favorable balance between speed and accuracy for embedded applications. To tailor it for the specific challenges of drone-based head rot detection, we introduced three major modifications, leading to the YOLOv12n-RCL (Receptive field, Coordinate attention, CARAFE, Lightweight head) model.
2.1. Enhanced Feature Extraction with C3K2-RC Modules
The original C3K2 modules in the backbone and neck networks were replaced with our proposed C3K2-RC modules. This redesign integrates two components to boost feature discrimination:
- Receptive Field Attention Convolution (RFAConv): This module replaces standard convolutions within the bottleneck. It employs a dual-branch structure where one branch generates a spatial attention map by aggregating features via average pooling and group convolution, while the other branch extracts spatial features using group convolution. The final output is an element-wise product of the attention weights and the spatial features, allowing the model to adaptively focus on important regions within an expanded receptive field. The process can be described as:
$$ \begin{cases}
f_w = \text{Softmax}\left(g^{1\times1}_{conv}(\text{AvgPool}(X))\right) \\
f_e = \text{ReLU}\left(\text{BatchNorm}\left(g^{3\times3}_{conv}(X)\right)\right) \\
f_x = f_w \odot f_e
\end{cases} $$
where $$X$$ is the input feature map, $$g^{k\times k}_{conv}$$ denotes a $$k \times k$$ group convolution, and $$\odot$$ is element-wise multiplication. - Coordinate Attention (CA) Mechanism: Following the RFAConv-enhanced bottleneck, a CA block is appended. CA simultaneously captures cross-channel information and long-range spatial dependencies by performing global pooling along the height and width dimensions. It encodes precise positional information into the channel attention, which is crucial for accurately locating irregular lesion boundaries. The integration of RFAConv and CA forms the C3K2-RC module, significantly strengthening the model’s ability to extract discriminative features for different rot severity grades under complex field conditions captured by the unmanned drone.
2.2. Improved Feature Reconstruction with CARAFE
The standard nearest-neighbor upsampling operator in the neck was replaced with the Content-Aware Reassembly of FEatures (CARAFE) operator. Unlike fixed-kernel upsampling, CARAFE dynamically generates adaptive upsampling kernels based on the content of the input feature map. For a given location, it predicts a kernel from a larger receptive field and uses it to reassemble features from the corresponding local region. This content-aware mechanism enables better recovery of fine-grained details and sharper edges of the head rot lesions, which is vital for distinguishing between adjacent severity grades where boundary information is key.
2.3. Lightweight and Effective Detection Head: LSCSBD
The original decoupled head was replaced with a Lightweight Shared Convolutional head with Separated Batch Normalization Detection (LSCSBD). This design reduces computational complexity and parameter count by sharing convolutional layers across different feature pyramid levels while maintaining separate paths for classification and regression. Crucially, it employs separated Batch Normalization and activation layers (BnAct) to preserve and enhance the representational capacity for detecting small-scale lesion features. This modification leads to a more efficient model better suited for real-time inference on hardware with limited computational resources, a common requirement for processing data streams from an unmanned drone in real-time.
3. Experimental Setup and Evaluation Metrics
The model was trained for 200 epochs using Stochastic Gradient Descent (SGD) optimizer with an initial learning rate of 0.01, momentum of 0.935, and weight decay of 0.0005. The input image size was set to 1024×1024 pixels with a batch size of 16. Performance was evaluated using standard object detection metrics: Precision (P), Recall (R), mean Average Precision at IoU=0.5 (mAP@0.5), and mAP across IoU thresholds from 0.5 to 0.95 (mAP@0.5:0.95). Model complexity was assessed via the number of parameters, floating-point operations (FLOPs), and the size of the saved weights file. Detection speed was measured in Frames Per Second (FPS).
Results and Analysis
1. Ablation Study on Model Components
A comprehensive ablation study was conducted to validate the contribution of each proposed module. The results, presented in Table 2, clearly demonstrate the incremental benefits and synergistic effects of our modifications.
| C3K2-RC | CARAFE | LSCSBD | P (%) | R (%) | mAP@0.5 (%) | mAP@0.5:0.95 (%) | Params (M) | FLOPs (G) | Size (MB) |
|---|---|---|---|---|---|---|---|---|---|
| □ | □ | □ | 79.6 | 78.0 | 81.4 | 46.2 | 2.57 | 6.5 | 5.6 |
| ■ | □ | □ | 81.1 | 77.8 | 82.2 | 48.3 | 2.59 | 6.6 | 5.7 |
| □ | ■ | □ | 79.6 | 78.4 | 81.6 | 47.0 | 2.23 | 5.8 | 4.9 |
| □ | □ | ■ | 80.2 | 78.7 | 82.7 | 48.4 | 2.44 | 6.4 | 5.3 |
| ■ | ■ | □ | 82.7 | 80.1 | 83.6 | 49.7 | 2.33 | 6.3 | 5.1 |
| ■ | □ | ■ | 83.7 | 81.2 | 84.5 | 50.1 | 2.46 | 6.5 | 5.4 |
| □ | ■ | ■ | 79.9 | 78.0 | 82.1 | 48.4 | 2.09 | 5.6 | 4.5 |
| ■ | ■ | ■ | 83.4 | 81.2 | 84.8 | 50.2 | 2.11 | 5.7 | 4.5 |
The final YOLOv12n-RCL model, integrating all three modules, achieved the best overall performance. Compared to the baseline YOLOv12n, it improved P, R, mAP@0.5, and mAP@0.5:0.95 by 3.8, 3.2, 3.4, and 4.0 percentage points, respectively. Remarkably, this significant gain in accuracy was accompanied by a reduction in model complexity: parameters decreased by 17.9%, FLOPs by 12.3%, and model size by 19.6%. This demonstrates an effective trade-off, yielding a model that is both more accurate and more efficient.
2. Comparison with State-of-the-Art Detectors
We compared YOLOv12n-RCL against a wide range of popular object detection models on our sunflower head rot test set. The results, summarized in Table 3, highlight the superiority of our approach.
| Model | P (%) | R (%) | mAP@0.5 (%) | mAP@0.5:0.95 (%) | Params (M) | FLOPs (G) | Size (MB) |
|---|---|---|---|---|---|---|---|
| Faster R-CNN | 70.1 | 65.6 | 68.8 | 40.2 | 137.1 | 370.2 | 114.2 |
| SSD | 73.1 | 66.4 | 72.1 | 42.3 | 24.0 | 61.1 | 105.2 |
| CenterNet | 72.1 | 67.1 | 75.6 | 43.7 | 32.7 | 70.2 | 90.6 |
| RT-DETR-R50 | 71.9 | 68.2 | 78.7 | 44.1 | 42.8 | 130.5 | 96.7 |
| YOLOv5s | 78.9 | 77.3 | 81.2 | 46.4 | 7.26 | 16.4 | 13.7 |
| YOLOv7-tiny | 80.1 | 81.0 | 80.5 | 45.5 | 6.02 | 13.2 | 11.7 |
| YOLOv8n | 79.1 | 78.0 | 81.4 | 46.7 | 3.01 | 8.2 | 6.3 |
| YOLOv9t | 76.5 | 74.2 | 79.5 | 44.7 | 1.73 | 6.4 | 4.3 |
| YOLOv10n | 79.2 | 78.7 | 80.4 | 45.1 | 2.69 | 8.2 | 5.8 |
| YOLOv11n | 78.3 | 78.8 | 80.6 | 46.0 | 2.61 | 6.6 | 5.7 |
| YOLOv12n (Baseline) | 79.6 | 78.0 | 81.4 | 46.2 | 2.57 | 6.5 | 5.6 |
| YOLOv12n-RCL (Ours) | 83.4 | 81.2 | 84.8 | 50.2 | 2.11 | 5.7 | 4.5 |
Our model outperformed all competitors across the primary accuracy metrics (P, R, mAP@0.5, mAP@0.5:0.95). While YOLOv9t has slightly fewer parameters, its accuracy is substantially lower. YOLOv12n-RCL strikes the best balance, delivering state-of-the-art detection performance while maintaining a highly compact and efficient form factor, a critical advantage for processing data from an unmanned drone.
3. Confusion Matrix and Misclassification Analysis
A normalized confusion matrix was generated to analyze the model’s performance per disease grade and to identify specific points of confusion. The matrix, represented conceptually below, revealed that the model achieved balanced recall across all five grades (e.g., ~83% for Grade 0, ~81% for Grade 1, ~79% for Grade 2, ~81% for Grade 3, ~82% for Grade 4). The highest rate of misclassification occurred between adjacent severity Grades 2 and 3, which is understandable given the sometimes subtle difference in lesion coverage between the 26-50% and 51-75% ranges. Importantly, the misclassification rate for non-adjacent classes was very low (below 10%), confirming the model’s strong discriminative capability.
4. Edge Device Deployment and Real-World Application
To evaluate practical utility, the model was optimized and deployed on a Jetson Orin Nano edge computing device, simulating a potential onboard processing unit for an unmanned drone. The model was converted and accelerated using TensorRT with INT8 quantization. The optimized YOLOv12n-RCL achieved an inference speed of 27.5 FPS, a 47.1% improvement over the accelerated baseline model (18.7 FPS). This frame rate is sufficient for real-time analysis of video streams or rapid processing of captured images during a drone flight.
In a practical test on 70 field images containing 1,298 annotated heads, the deployed YOLOv12n-RCL model demonstrated robust performance with only 10 missed detections and 8 false positives, significantly fewer than the baseline. Furthermore, the detection results from processed image tiles can be spatially stitched back together to generate a field-scale severity map. This map visually illustrates the spatial distribution of head rot across the field, providing actionable intelligence for targeted scouting or variable-rate application of fungicides, forming a complete “sense-and-act” pipeline powered by the unmanned drone platform.
Discussion
The proposed YOLOv12n-RCL model effectively addresses the challenge of automated sunflower head rot severity assessment using unmanned drone imagery. The integration of RFAConv and CA mechanisms (C3K2-RC) empowers the model to capture more contextual information and focus on salient lesion features, which is crucial in heterogeneous field environments. The CARAFE upsampler preserves critical detail during feature map enlargement, aiding in the precise delineation of lesion boundaries. Finally, the LSCSBD head maintains high detection fidelity for smaller lesions while reducing computational overhead.
The results confirm that a carefully designed, lightweight model can outperform larger, more generic architectures for this specific agricultural vision task. The significant reduction in model size and computational cost, coupled with the increase in accuracy, makes YOLOv12n-RCL a strong candidate for integration into real-time drone-based crop monitoring systems. This approach moves beyond simple presence/absence detection to provide quantitative severity grading, which is far more valuable for informed decision-making in crop protection and yield forecasting.
Current limitations include the model’s training on imagery from a single growth stage (mature, heads facing down) and a fixed drone altitude. Future work will involve collecting and annotating multi-temporal datasets encompassing different sunflower growth stages (e.g., flowering, seed development) and incorporating imagery from varying flight altitudes to enhance model robustness and generalizability across a wider range of operational scenarios for the unmanned drone.
Conclusion
This study presents a novel deep learning framework, YOLOv12n-RCL, for the detection and severity grading of Sclerotinia Head Rot in sunflowers using unmanned drone imagery. By innovatively integrating receptive field attention, coordinate attention, content-aware upsampling, and a lightweight detection head, the model achieves a superior balance between high accuracy and operational efficiency. Experimental results demonstrate that YOLOv12n-RCL outperforms a suite of state-of-the-art detectors, achieving a mAP@0.5 of 84.8% while reducing parameters by 17.9% and model size by 19.6% compared to its baseline. Successful deployment on an edge device at 27.5 FPS confirms its potential for real-time, in-field disease monitoring. This work provides a robust algorithmic foundation for developing intelligent scouting and precision sprayer control systems, contributing to more sustainable and efficient sunflower production through advanced unmanned drone technology.
