Semi-supervised Detection of Tea Leaf Blight from UAV Remote Sensing Imagery

The accurate and timely detection of plant diseases is paramount for ensuring crop health and yield. Among the various threats to tea plantations, Tea Leaf Blight (TLB) is a prevalent and destructive fungal disease. Early and accurate identification of TLB is crucial for implementing targeted control measures, preventing widespread damage, and reducing economic losses. Traditional methods relying on manual scouting by experts are labor-intensive, time-consuming, and often inadequate for monitoring large and remote tea-growing areas.

In recent years, the integration of Unmanned Aerial Vehicle (UAV) remote sensing with deep learning has opened new avenues for precision agriculture. UAV drones equipped with high-resolution cameras can rapidly capture extensive aerial imagery of tea plantations, providing a macroscopic and efficient means for crop monitoring. Deep learning models, particularly convolutional neural networks (CNNs), have demonstrated remarkable capabilities in automatically learning and extracting discriminative features for disease recognition from such imagery. However, applying these advanced techniques to the specific problem of TLB detection presents several formidable challenges that existing methods struggle to address effectively.

Firstly, in UAV-acquired imagery, TLB-infected leaves often appear as small, densely clustered lesions against a complex background of healthy foliage, soil, and shadows. This creates a classic “small and dense object detection” problem where individual lesions are difficult to isolate and recognize. Secondly, the inherent limitations of UAV-based photography, such as motion blur, varying illumination, and the distance from the canopy, often result in low-resolution representations of the disease spots. The boundaries of these lesions are frequently blurred and ambiguous, making it hard for models to distinguish between the foreground (disease) and the background (healthy leaf or other objects). Thirdly, the prevailing paradigm for training deep learning models is fully supervised learning, which requires a large volume of meticulously annotated image data. Annotating thousands of high-resolution UAV images, marking each small and often ambiguous disease spot, is an extremely tedious, expensive, and time-consuming process that demands significant domain expertise. This reliance on massive labeled datasets constitutes a major bottleneck for practical deployment.

Semi-supervised learning (SSL) offers a promising solution to this data annotation bottleneck. SSL frameworks are designed to leverage a small set of labeled data alongside a much larger pool of unlabeled data to train high-performance models. The core idea is to use the model’s own predictions on the unlabeled data, after careful refinement, as pseudo-labels to guide further learning. This approach can dramatically reduce the annotation burden while still achieving performance competitive with fully supervised models. Therefore, the central objective of this work is to develop a robust semi-supervised object detection framework specifically tailored to overcome the challenges of detecting small, dense, and blurry TLB lesions in UAV remote sensing imagery. We aim to maximize detection accuracy while minimizing the dependency on manually annotated data.

The proposed method, named SSTLBdet (Semi-Supervised detection method for Tea Leaf Blight), is built upon the well-established teacher-student semi-supervised learning paradigm but introduces several novel components to address the specific difficulties of TLB detection. The overall framework operates in two main phases. The initial “burn-in” phase trains a base object detector (we adopt the anchor-free FCOS detector) using only the limited available labeled data. This provides a preliminary model to kickstart the process. The core iterative “self-training” phase then begins. In each iteration, the “teacher” model generates predictions on the unlabeled data. A key innovation is the Iterative Annotation strategy, which employs a dynamic metric to select the most informative and challenging unlabeled samples for pseudo-label generation, prioritizing images with uncertain, visually rich, and diversely-sized lesions.

To tackle the issue of inconsistency between classification and localization scores for dense, ambiguous objects, we introduce a Weighted Joint Confidence Estimation (WJCE) module. Instead of relying on a single score, WJCE computes a composite confidence by strategically weighting the classification and localization scores, leading to more reliable pseudo-labels. The pseudo-labels filtered by WJCE are then processed by an Adaptive Sample Selection (ASS) module. ASS goes beyond simple thresholding by adaptively defining three regions: high-confidence positives, clear negatives, and an “ambiguous region” containing potential lesions that are hard to classify. This module further applies classification and localization mining strategies to learn from these challenging ambiguous samples, which are often small or blurry lesions. The refined pseudo-labels from the teacher model are then used, together with the original labeled data, to train the “student” model. The student model’s parameters are subsequently used to update the teacher model via an Exponential Moving Average (EMA), creating a virtuous cycle of improving pseudo-label quality and model performance. The entire process can be summarized by the following supervised and unsupervised loss functions guiding the student model:

$$L = L_{sup} + \lambda L_{unsup}$$

$$L_{sup} = \frac{1}{N_l} \sum_{i=1}^{N_l} [L_{cls}(x^l_i, y^l_{cls}) + L_{loc}(x^l_i, y^l_{loc})]$$

$$L_{unsup} = \frac{1}{N_{cls}}\sum_{i=1}^{N_{cls}} L_{cls}(x^u_i, \tilde{y}_{cls}) + \frac{1}{N_{loc}}\sum_{i=1}^{N_{loc}} L_{loc}(x^u_i, \tilde{y}_{loc}) + \frac{\alpha}{N_{loc}}\sum_{i=1}^{N_{loc}} L_{iou}(x^u_i, \tilde{x}^u_i)$$

where $L_{sup}$ is the supervised loss on labeled data, $L_{unsup}$ is the unsupervised loss on pseudo-labeled data, and $\lambda$ is a balancing weight. $\tilde{y}$ represents the pseudo-labels generated by the teacher model.

The performance of SSTLBdet was rigorously evaluated on a dedicated dataset of UAV remote sensing imagery collected from tea plantations. The dataset was annotated by experts and augmented using techniques like rotation, flipping, scaling, and mosaic augmentation to enhance diversity. We compared SSTLBdet against several state-of-the-art semi-supervised object detection methods under different labeling ratios (i.e., using only 1%, 5%, 10%, 20%, or 30% of the training data as labeled, with the rest as unlabeled).

Table 1: Comparative performance (mAP@0.5 in %) of different semi-supervised object detection methods on the UAV TLB dataset under varying labeling ratios.
Model	Labeling Ratio 1%	Labeling Ratio 5%	Labeling Ratio 10%	Labeling Ratio 20%	Labeling Ratio 30%
Faster R-CNN (Supervised)	21.19	31.81	44.05	50.41	52.04
STAC	29.56	38.60	51.58	54.58	56.36
Unbiased Teacher	43.90	51.41	59.82	60.04	61.98
Soft Teacher	43.28	58.05	62.05	64.88	66.98
DSL	46.60	53.29	65.32	69.05	71.27
FCOS (Supervised Baseline)	17.15	28.47	39.78	43.99	45.42
ARSL	48.31	57.25	72.89	73.39	77.76
SSTLBdet (Ours)	51.28	59.85	75.09	77.99	78.12

The results, as shown in Table 1, clearly demonstrate the superiority of the proposed SSTLBdet framework. It consistently outperforms all other semi-supervised methods across every labeling ratio. Notably, with only 30% labeled data, SSTLBdet achieves a mean Average Precision (mAP@0.5) of 78.12%, which is not only a significant 32.70 percentage point improvement over the supervised FCOS baseline trained on the same 30% data but also surpasses the recent ARSL method. This performance is highly competitive with the performance of fully supervised models trained on 100% of the labeled data, as indicated by comparisons with other TLB-specific detection methods.

Table 2: Performance comparison between the proposed semi-supervised SSTLBdet (using 30% labels) and other fully supervised TLB detection methods (requiring 100% labels).
Method	Labeling Ratio	F1-Score (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)
LC3Net (Fully Supervised)	100%	76.39	79.21	71.12
DDMA-YOLO (Fully Supervised)	100%	73.20	78.52	70.50
LWDNet (Fully Supervised)	100%	76.58	80.12	71.41
SSTLBdet (Ours, Semi-Supervised)	30%	71.24	78.12	69.95

Table 2 provides a critical perspective. It shows that our semi-supervised SSTLBdet, utilizing only 30% of the annotation effort, achieves performance metrics that are very close to, and in some cases on par with, several recent fully supervised methods that require labeling the entire dataset. The F1-Score of 71.24% and mAP@0.5 of 78.12% underscore the effectiveness of our approach in leveraging unlabeled data. This demonstrates a drastic reduction in the data annotation workload without compromising the final detection accuracy, which is a primary goal of employing UAV drones for scalable agricultural monitoring.

To validate the contribution of each proposed component within the SSTLBdet framework, we conducted a series of ablation studies. We start with a baseline FCOS model trained only on 30% labeled data (no semi-supervision). Then, we incrementally add our core modules: the basic Semi-Supervised (SS) teacher-student framework, the Iterative Annotation (IA) module, the Weighted Joint Confidence Estimation (WJCE) module, and finally the Adaptive Sample Selection (ASS) module.

Table 3: Ablation study on the components of the proposed SSTLBdet framework (evaluated with 30% labeled data).
Semi-Supervised	Iterative Annotation	WJCE	ASS	AP@0.5 (%)	AP@0.5:0.95 (%)
×	×	×	×	58.98	45.59
✓	×	×	×	70.23	53.21
✓	✓	×	×	74.13	57.64
✓	✓	✓	×	77.29	62.77
✓	✓	✓	✓	78.12	63.32

The results in Table 3 clearly show the progressive improvement brought by each component. The semi-supervised framework itself provides a massive boost over the supervised baseline. The Iterative Annotation module further improves results by intelligently selecting valuable pseudo-labels. The WJCE module, addressing classification-localization inconsistency, leads to another significant jump in AP@0.5. Finally, the ASS module, designed to handle ambiguous and small lesions, pushes the performance to its peak. This step-by-step improvement validates the necessity and effectiveness of each proposed innovation in tackling the specific challenges posed by UAV-based TLB imagery.

Furthermore, we dissected the Iterative Annotation module to understand the impact of its different sample scoring strategies: Difficulty (measuring prediction uncertainty), Informativeness (measuring visual richness), and Diversity (measuring size variation). The baseline is SSTLBdet without the iterative strategy (using all pseudo-labels).

Table 4: Ablation on the scoring strategies within the Iterative Annotation module.
Difficulty	Informativeness	Diversity	5% Label mAP	10% Label mAP	20% Label mAP
×	×	×	47.84	64.39	67.26
✓	×	×	56.03	61.13	64.59
×	✓	×	55.92	60.98	62.87
×	×	✓	57.40	62.26	69.05
✓	✓	✓	59.85	75.09	77.99

Table 4 reveals that while each individual scoring criterion contributes positively, their combined use via the integrated metric score leads to the best overall performance. This confirms that the iterative process of selecting samples that are simultaneously uncertain, information-rich, and diverse is key to efficient semi-supervised learning for this task.

The core of the WJCE module is the weighted fusion of classification ($S_{cls}$) and localization ($S_{loc}$) scores. The formula for the final joint confidence $S$ is:

$$S = (S_{cls})^a \cdot (S_{loc})^b$$

Through empirical analysis, we found the optimal weights for TLB detection to be $a=0.8$ and $b=1.2$, giving slightly more importance to the localization quality, which is critical for precise detection of small, dense lesions. The ASS module uses adaptive thresholds ($\tau_{neg}$ and $\tau_{pos}$) to segment pixels or proposals into negative, ambiguous, and positive sets. The positive threshold $\tau_{pos}$ is calculated in a class-adaptive manner to be more sensitive:

$$\tau_{pos} = \left( \frac{\sum_{i=1}^{N_{pos}} \mathbb{1}\{\tilde{S}_i == 1\} p_i }{N_{pos}} \right)^\gamma \cdot \tau$$

where $\tilde{S}_i$ is the IoU score, $p_i$ is the classification probability, $N_{pos}$ is the number of positive pixels, $\gamma$ is a focusing parameter (set to 0.7), and $\tau$ is a base threshold.

In conclusion, this work presents SSTLBdet, a novel semi-supervised object detection framework specifically designed for the challenging task of detecting Tea Leaf Blight from UAV remote sensing imagery. By integrating an Iterative Annotation strategy for intelligent sample selection, a Weighted Joint Confidence Estimation module to resolve score inconsistencies for dense objects, and an Adaptive Sample Selection module to meticulously learn from ambiguous and small lesions, the proposed method effectively addresses the core difficulties of this application. Extensive experiments on a dedicated UAV dataset demonstrate that SSTLBdet significantly outperforms existing semi-supervised detection methods across various labeling ratios. Remarkably, it achieves detection accuracy comparable to state-of-the-art fully supervised models while requiring only a fraction (e.g., 30%) of the manual annotation effort. This research underscores the tremendous potential of combining advanced semi-supervised learning techniques with UAV drone technology for efficient, accurate, and scalable crop disease monitoring in precision agriculture. The principles and modules developed here could be adapted to other agricultural object detection tasks involving small, dense, or ambiguous targets in UAV imagery, further amplifying the impact of autonomous aerial systems in smart farming.