The cultivation of chilli pepper is a cornerstone of the agricultural economy in Guizhou Province, China. However, the traditional management of these crops, particularly in the region’s characteristic fragmented, mountainous plots, often involves inefficient water and fertilizer practices. This leads to suboptimal nutrient uptake, reduced yields, and ultimately, constrained industry growth. A critical bottleneck is the lack of efficient, scalable methods for monitoring soil moisture content (SMC) across these dispersed fields. Conventional techniques like soil moisture stations or Time-Domain Reflectometry (TDR) are costly and offer limited spatial representation, making them impractical for comprehensive field-scale assessment. This gap hinders the implementation of precise irrigation scheduling. Fortunately, the rapid advancement of unmanned drone (UAV) remote sensing presents a transformative solution. UAV platforms equipped with hyperspectral sensors offer unparalleled flexibility, high spatial and temporal resolution, and the ability to capture detailed spectral information from crop canopies. When coupled with sophisticated machine learning algorithms, this technology holds immense promise for accurate, non-destructive, and dynamic monitoring of crop physiological status, including indirect estimation of root-zone soil moisture. This study integrates unmanned drone-based hyperspectral data with advanced machine learning to develop a robust model for the dynamic inversion of SMC in mountainous chilli pepper fields, providing a technological pathway towards intelligent water management.

The research was conducted in Dongfeng Village, Liupanshui City, Guizhou Province (26°12′N, 105°25′E), an area characterized by a humid subtropical monsoon climate with an annual precipitation of 1200–1500 mm. The terrain is typical of the region’s sloped and fragmented farmland. A field experiment was established involving 24 treatment plots, combining three irrigation levels (50%, 65%, and 80% of field capacity) with eight fertilization treatments. The unmanned drone platform used was a DJI M350 RTK, fitted with a LR1601-IRIS push-broom hyperspectral imager sensitive across the 384–1025 nm range with a spectral resolution of 2.7 nm. Nine flight campaigns were executed at key growth stages: seedling, initial flowering, and early fruiting. Flights were conducted at 30 m altitude under clear skies between 11:00 and 13:00, generating imagery with a ground sample distance of ~1.03 cm. Concurrent with each unmanned drone flight, in-situ SMC measurements were taken at a 20 cm depth using MiniTrase TDR probes, resulting in 216 synchronized data pairs of canopy reflectance and soil moisture.
The core methodological challenge lies in extracting the most informative features from the high-dimensional hyperspectral data that are sensitive to the soil water status reflected in the chilli pepper canopy. The process involved two main steps: spectral feature selection and model construction. First, standard Vegetation Indices (VIs) were calculated, and their association with measured SMC was evaluated using Grey Relational Analysis (GRA) and Pearson correlation. The indices showing the highest and most consistent correlations were selected. Simultaneously, the reflectance values of all 256 original spectral bands were analyzed. GRA was used to perform an initial screening, retaining the top 50% of bands based on their grey relational grade with SMC. Subsequently, a Random Forest (RF) algorithm was employed on this subset to calculate the importance score of each band, leading to the final selection of the most sensitive spectral features. The selected VIs and spectral bands were then combined to form the optimal feature set for model input. The general workflow is encapsulated in the following conceptual formula, where \( R_{λ} \) represents reflectance at wavelength \( λ \), and \( f_{model} \) is the machine learning function:
$$ SMC_{estimated} = f_{model}(VI_1, VI_2, …, VI_n, R_{λ_1}, R_{λ_2}, …, R_{λ_m}) $$
Five distinct machine learning algorithms were trained and compared for the SMC inversion task: Partial Least Squares Regression (PLSR), L2-regularized linear regression (L2), Decision Tree (DT), Random Forest (RF), and Categorical Boosting (CatBoost). The dataset was randomly split into a training set (151 samples) and an independent test set (65 samples). Model performance was rigorously evaluated using the coefficient of determination (R²), Root Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE).
The feature selection process yielded a highly informative set of predictors. Among the Vegetation Indices, the Enhanced Vegetation Index (EVI), Difference Vegetation Index (DVI), and Ratio Vegetation Index 2 (RVI2) demonstrated the strongest and most stable correlations with SMC across growth stages. The formulas for these key indices are:
$$ EVI = 2.5 \times \frac{(R_{NIR} – R_{Red})}{(R_{NIR} + 6 \times R_{Red} – 7.5 \times R_{Blue} + 1)} $$
$$ DVI = R_{NIR} – R_{Red} $$
$$ RVI2 = \frac{R_{NIR}}{R_{Green}} $$
Concurrently, the GRA and RF-based band selection identified 36 sensitive spectral bands. The most influential individual bands were B5 (400.9 nm), B128 (926.7 nm), B1 (384.6 nm), and B33 (516.2 nm). The feature importance distribution for a subset of the top bands is summarized in Table 1.
| Band (Center Wavelength) | Feature Importance | Band (Center Wavelength) | Feature Importance |
|---|---|---|---|
| B5 (400.9 nm) | 4.70% | B29 (~505 nm) | 1.70% |
| B128 (926.7 nm) | 3.20% | B67 (~652 nm) | 1.60% |
| B1 (384.6 nm) | 2.70% | B79 (~730 nm) | 1.60% |
| B33 (516.2 nm) | 2.70% | B125 (923 nm) | 1.60% |
| B122 (911 nm) | 2.10% | B15 (429 nm) | 1.50% |
The initial modeling attempts with raw spectral data revealed significant limitations of linear models like PLSR and L2, which struggled with the complex, non-linear relationships. The simple DT model was prone to overfitting. While RF showed better performance, it was the CatBoost algorithm that consistently achieved superior results, especially after further optimization of the input data (including spectral smoothing with a Wiener filter and incorporating irrigation level as a categorical variable). The final comparative performance of the key models on the independent test set is presented in Table 2. The predictive power of the optimal CatBoost model is visually demonstrated by the scatter plot of predicted versus measured SMC, showing a tight clustering along the 1:1 line.
| Model | R² | RMSE (%) | MAPE (%) |
|---|---|---|---|
| Partial Least Squares (PLSR) | 0.489 | 3.551 | 6.578 |
| L2 Regularized Regression | 0.304 | 5.671 | 12.524 |
| Random Forest (RF) | 0.647 | 2.237 | 4.053 |
| CatBoost | 0.758 | 2.897 | 5.902 |
The success of the EVI, DVI, and RVI2 aligns with established remote sensing principles, as these indices are known to mitigate saturation effects and remain sensitive to canopy biophysical parameters under moderate-to-dense vegetation cover, which relate to plant water stress. The identified sensitive spectral bands further reinforce the physical basis of the model. The high importance of bands in the near-infrared region (e.g., B128 at 926.7 nm) corresponds to strong water absorption features. The significance of bands in the blue (B5, B1) and green-red (B33) regions is linked to pigment absorption and scattering processes within the leaf, which are influenced by plant turgor and water status. The superior performance of the CatBoost model can be attributed to its inherent design. Unlike linear models, it can capture intricate non-linear patterns. Compared to single DT or the bagging-based RF, CatBoost’s boosting framework iteratively corrects the errors of previous trees, its ordered boosting prevents target leakage, and it handles categorical features natively (like irrigation treatment), making it exceptionally robust for this type of structured, heterogeneous agricultural dataset captured by the unmanned drone.
In conclusion, this research demonstrates a viable and accurate framework for dynamic SMC monitoring in challenging mountainous chilli pepper fields. By deploying an unmanned drone equipped with a hyperspectral sensor, we captured canopy spectral data that, when processed through a tailored feature selection pipeline, revealed a potent combination of vegetation indices and specific spectral bands sensitive to soil water status. The comparative analysis unequivocally identified the CatBoost algorithm as the most effective model for this inversion task, achieving a test set R² of 0.758. This study underscores the practical value of integrating unmanned drone remote sensing with advanced machine learning for precision agriculture. The developed model provides a scalable, non-destructive tool for farmers and agronomists to assess soil moisture variability across fragmented landscapes, forming a critical data layer for informed irrigation decision-making and ultimately promoting sustainable water use in Guizhou’s vital chilli pepper industry.
