
Accurate and high-resolution crop classification is fundamental for precision agriculture, food security assessment, and land use management. Traditional satellite-based remote sensing, while valuable for large-scale monitoring, often suffers from limitations such as long revisit periods, cloud contamination, and coarse spatial resolution, which hinder timely and fine-grained agricultural monitoring. The advent of unmanned drone technology has revolutionized data acquisition in agriculture, providing unprecedented flexibility, spatial resolution, and data richness. By equipping unmanned drone platforms with advanced sensors like multispectral cameras and Light Detection and Ranging (LiDAR) scanners, we can now capture complementary data modalities—detailed spectral reflectance and precise three-dimensional structural information—simultaneously and on-demand. This capability is crucial for discriminating between crop types that may appear spectrally similar but possess distinct structural characteristics, especially in complex, small-plot farming landscapes.
This study addresses the critical challenge of achieving fine-grained crop classification in heterogeneous agricultural scenes by proposing a novel deep learning framework that synergistically fuses high-resolution multispectral imagery and LiDAR-derived structural data collected via unmanned drone. We hypothesize that a deep, guided fusion of spectral and height features will significantly outperform models using single data sources or simplistic fusion strategies. Our primary objective is to develop a robust model capable of accurately delineating crop boundaries and classifying different species under real-world, complex field conditions, thereby providing a powerful tool for high-precision, time-sensitive agricultural management.
1. Data Acquisition and Feature Construction
The research was conducted in a typical agricultural region characterized by a mixed-cropping system. Data acquisition was performed using a DJI Matrice 300 RTK unmanned drone platform, ensuring high-precision geotagging. The platform was equipped with two key payloads operated in subsequent flights over the same area:
- Multispectral Sensor (Sentera): Captured imagery across five narrow bands: Blue (B: 450-520nm), Green (G: 515-600nm), Red (R: 620-750nm), Red-edge (RE: 720-780nm), and Near-Infrared (NIR: 780-900nm). The flight was conducted at 90m altitude, yielding a ground sampling distance (GSD) of approximately 4.15 cm. Images were processed and radiometrically calibrated using specialized software to generate a seamless, high-resolution multispectral orthomosaic.
- LiDAR Sensor (Zenmuse L2): Collected dense 3D point clouds during a separate flight at the same altitude. The raw point cloud data underwent standard preprocessing steps, including noise filtering, ground point classification, and interpolation, to generate a Digital Surface Model (DSM) and a Digital Terrain Model (DTM). The crucial structural feature, the Canopy Height Model (CHM), was computed as the difference between the DSM and the DTM:
$$ \text{CHM}(x,y) = \text{DSM}(x,y) – \text{DTM}(x,y) $$
This CHM raster, aligned with the multispectral imagery, provides a direct measure of vegetation height at each pixel location.
Manual annotation was performed on the high-resolution visible composite to create a pixel-wise ground truth label map with three classes: Background, Corn, and Sunflower. The aligned multispectral bands, CHM layer, and label map were then partitioned into non-overlapping patches of 272×272 pixels to create the dataset for model training and evaluation.
2. Methodology: The EOFFM-FusionUnet Architecture
Our proposed model, named EOFFM-FusionUnet, is built upon a dual-stream encoder-decoder architecture (a variant of Res-UNet) specifically designed for multi-modal feature extraction and fusion. The core innovation lies in two custom modules: the Elevation-Optical Feature Fusion Module (EOFFM) and the Spatial Boundary Detail Extraction Module (SBDEM). The overall architecture is illustrated in Figure 1.
2.1 Dual-Stream Feature Encoding
The model processes two input streams in parallel:
- Stream 1 (Spectral Features): Takes the stacked multispectral bands (e.g., R, G, B, RE, NIR) as input. It follows a standard convolutional downsampling path to extract hierarchical spectral-spatial features.
- Stream 2 (Structural Features): Takes the single-channel CHM as input. It follows an identical downsampling path to extract hierarchical features related to canopy height and structure.
This separation allows each stream to specialize in learning representations from its respective modality without early-stage interference.
2.2 The Elevation-Optical Feature Fusion Module (EOFFM)
Simply concatenating features from the two streams is suboptimal. The EOFFM module is introduced at corresponding levels of the encoder to perform a learned, attention-guided fusion. As shown in Figure 2, its operation can be summarized as:
- Feature Transformation: Features from both streams, $F_{spec}^l$ and $F_{chm}^l$ at level $l$, are first passed through independent dense convolution blocks (Dense Layer) for enhanced representation.
- Global Enhanced Fusion (GEFM): This is the core of EOFFM. It employs a shared cross-attention mechanism. The transformed features are projected into Query (Q), Key (K), and Value (V) matrices. The Query is derived from a joint representation, while Keys and Values are modality-specific.
- An optical-attention path computes how structural (CHM) features should attend to and refine spectral features.
- A structural-attention path computes how spectral features should attend to and refine structural features.
The attention mechanism is implemented as:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
where $d_k$ is the dimension of the key vectors. - Residual Fusion: The outputs from the two attention paths are combined with the original input features via learnable weights $\gamma_1$ and $\gamma_2$ in a residual manner:
$$ F_{fused}^l = \gamma_1 \cdot (\text{Optical-Attn Output}) + \gamma_2 \cdot (\text{Structural-Attn Output}) + F_{spec}^l $$
This generates a fused feature map $F_{fused}^l$ that carries complementary information from both spectra and structure.
2.3 The Spatial Boundary Detail Extraction Module (SBDEM)
Accurate boundary delineation is critical for fine-grained classification. The SBDEM, integrated into the spectral stream, enhances the model’s sensitivity to crop edges. As depicted in Figure 3, it operates as follows:
- Multi-directional Wavelet Decomposition: The input feature map $X$ is decomposed using 2D discrete wavelet transform (DWT) into four sub-bands capturing different spatial frequencies and orientations:
$$ \{X_{LL}, X_{LH}, X_{HL}, X_{HH}\} = \text{DWT}(X) $$
These represent approximations (LL), horizontal (LH), vertical (HL), and diagonal (HH) details, respectively. - Feature Reconstruction and Enhancement: The sub-bands are concatenated and reconstructed. This reconstructed detail map is then fused with features from a parallel standard convolution path.
- FReLU Activation: The combined features are activated using a Funnel ReLU (FReLU), defined as $ \text{FReLU}(x) = \max(x, T(x)) $, where $T(x)$ is a contextually conditioned transformation. This activation helps preserve fine boundary information that might be lost with standard ReLU.
The output of SBDEM is a boundary-enhanced feature map, which is later concatenated with the main decoder features.
2.4 Decoder and Loss Function
The fused features from multiple levels are progressively upsampled in the decoder via transpose convolutions. At the final decoder layer, the high-level semantic features are concatenated with the boundary-enhanced features from SBDEM. A final 1×1 convolution followed by softmax produces the per-pixel class probability map.
To handle class imbalance and improve segmentation quality, a combined loss function is used:
$$ \mathcal{L}_{total} = 0.5 \cdot \mathcal{L}_{Dice} + 0.5 \cdot \mathcal{L}_{mIoU} $$
where the Dice Loss $\mathcal{L}_{Dice}$ and the mIoU Loss $\mathcal{L}_{mIoU}$ for $C$ classes are defined as:
$$ \mathcal{L}_{Dice} = 1 – \frac{1}{C} \sum_{c=1}^{C} \frac{2 \cdot TP_c + \epsilon}{2 \cdot TP_c + FP_c + FN_c + \epsilon} $$
$$ \mathcal{L}_{mIoU} = 1 – \frac{1}{C} \sum_{c=1}^{C} \frac{TP_c}{TP_c + FP_c + FN_c} $$
$TP_c, FP_c, FN_c$ are true positives, false positives, and false negatives for class $c$, and $\epsilon$ is a smoothing constant.
3. Experimental Results and Analysis
We conducted extensive experiments to validate the effectiveness of our proposed EOFFM-FusionUnet model. The dataset was split into training, validation, and test sets. All models were trained using the Adam optimizer with a learning rate of 0.001 for 100 epochs.
3.1 Comparative Analysis with State-of-the-Art Models
We compared EOFFM-FusionUnet against nine prominent semantic segmentation models, including U-Net, DeepLabV3+, SegFormer, and U-NetFormer. The performance was evaluated using Overall Accuracy (OA), mean Intersection over Union (mIoU), F1-Score, and Kappa coefficient.
| Model | OA (%) | mIoU (%) | F1-Score (%) | Kappa |
|---|---|---|---|---|
| U-Net | 75.47 | 61.88 | 76.31 | 0.6334 |
| DeepLabV3+ | 77.52 | 61.62 | 75.99 | 0.6247 |
| SegFormer | 80.02 | 67.36 | 80.46 | 0.6881 |
| U-NetFormer | 81.24 | 68.74 | 81.41 | 0.7018 |
| BiSeNetV2 | 79.63 | 66.33 | 79.72 | 0.6791 |
| EOFFM-FusionUnet (Ours) | 83.89 | 71.59 | 83.37 | 0.7314 |
The results clearly demonstrate the superiority of our model. EOFFM-FusionUnet achieves the highest scores across all metrics, with an OA of 83.89% and an mIoU of 71.59%. It outperforms the strong transformer-based SegFormer by 3.87% in OA and 4.23% in mIoU, and the recent U-NetFormer by 2.65% in OA and 2.85% in mIoU. This significant improvement validates the advantage of our dedicated multi-modal fusion strategy over architectures designed primarily for single-modality data or using basic fusion techniques.
3.2 Ablation Studies
To dissect the contribution of each component, we performed systematic ablation studies.
A) Input Modality Ablation: We evaluated the impact of different input feature combinations using a baseline dual-stream Res-UNet. The results are summarized in Table 2.
| Input Modality | OA (%) | mIoU (%) | Gain over RGB (OA) |
|---|---|---|---|
| RGB only | 70.13 | 47.68 | – |
| RGB + NIR | 74.62 | 53.24 | +4.49% |
| RGB + NIR + Red-edge | 75.85 | 54.89 | +5.72% |
| RGB + CHM | 78.43 | 60.27 | +8.30% |
| RGB + NIR + Red-edge + CHM (Full) | 78.11 | 69.94 | +7.98% |
The analysis reveals several key insights: 1) Adding NIR and Red-edge bands improves performance over visible RGB alone, highlighting the importance of spectral regions sensitive to vegetation health and structure. 2) Incorporating the CHM (structural information) provides the most substantial single-modality boost, increasing OA by over 8% compared to RGB. This underscores the critical complementary value of 3D information provided by the unmanned drone LiDAR. 3) The full combination of all spectral bands and CHM yields the best mIoU, demonstrating the synergistic effect of comprehensive spectral-structural data fusion.
B) Module Ablation: We incrementally added our proposed modules to the baseline dual-stream Res-UNet (trained with the full modality set). The results are shown in Table 3.
| Model Configuration | OA (%) | mIoU (%) | F1-Score (%) |
|---|---|---|---|
| Baseline (Dual-Stream Res-UNet) | 72.48 | 61.35 | 69.81 |
| Baseline + SBDEM | 76.53 | 66.82 | 73.08 |
| Baseline + EOFFM | 78.11 | 69.94 | 77.92 |
| Baseline + SBDEM + EOFFM (Full Model) | 83.89 | 71.59 | 83.37 |
Each module contributes positively to the final performance. The EOFFM module provides a larger boost than SBDEM when added individually, emphasizing the central role of effective feature fusion. However, the combination of both modules leads to the best results, with a remarkable 11.41% increase in OA over the baseline. This proves that the collaborative design of guided multi-modal fusion (EOFFM) and enhanced boundary awareness (SBDEM) is highly effective for the complex task of fine-grained crop classification from unmanned drone data.
3.3 Qualitative Results and Visualization
Visual inspection of the classification maps provides compelling evidence of the model’s capabilities. In complex scenes where crop plots are interwoven, traditional models like U-Net or DeepLabV3+ often produce blurry boundaries and misclassify small or irregularly shaped plots. SegFormer and U-NetFormer show clearer boundaries but still exhibit confusion between spectrally similar crops in shadowed or densely planted areas.
In contrast, the predictions from EOFFM-FusionUnet exhibit several superior characteristics:
- Sharper Boundaries: Crop plot edges are delineated with higher precision, closely matching the manual annotations. This is a direct benefit of the SBDEM module.
- Reduced Confusion: The model shows a stronger ability to distinguish between corn and sunflower, even when their spectral signatures overlap. This is attributed to the EOFFM module’s ability to leverage the distinct height profiles of these crops (corn being significantly taller) to disambiguate the spectral information.
- Robustness in Heterogeneous Areas: The model performs consistently well across the image, including in areas with mixed weeds, bare soil, and varying planting densities.
The final model output is a geographically referenced classification map that can be directly overlaid on the original orthomosaic, providing an immediately actionable product for farm-scale management.
4. Discussion and Conclusion
The integration of multispectral and LiDAR data from unmanned drone platforms represents a paradigm shift towards high-resolution, multi-dimensional crop monitoring. This study successfully demonstrates that deep, learned fusion of these complementary data streams is not only feasible but highly advantageous. The proposed EOFFM-FusionUnet model, through its dedicated EOFFM and SBDEM modules, effectively addresses the core challenges of multi-modal feature alignment, synergistic representation learning, and precise boundary segmentation.
The experimental results lead to several conclusive findings:
- Multi-modal Superiority: The fusion of high-resolution spectral (particularly Red-edge and NIR) and structural (CHM) features captured by unmanned drone significantly enhances classification accuracy compared to using any single data source. The CHM feature proved to be exceptionally valuable.
- Architectural Efficacy: The EOFFM-FusionUnet architecture, with its dual-stream design and specialized fusion/boundary modules, outperforms a wide range of state-of-the-art semantic segmentation models. It establishes a new effective benchmark for unmanned drone-based crop classification tasks.
- Practical Applicability: The model generates accurate, high-resolution, and georeferenced crop maps. This output is directly usable for precision agriculture applications such as variable-rate input application, yield estimation, and field-level crop health assessment, all with a timeliness that satellite systems cannot match.
In conclusion, this research presents a robust and effective framework for fine-grained crop classification by harnessing the full potential of unmanned drone remote sensing. The proposed methodology bridges the gap between high-resolution data acquisition and intelligent data analysis, offering a powerful tool for advancing precision agriculture and sustainable land management. Future work will focus on testing the model’s generalizability across different geographical regions, crop types, and growth stages, and on exploring the integration of temporal sequences from unmanned drone to capture phenological dynamics for even more robust classification.
