Accurate segmentation of terrain features in aerial imagery from camera drones remains challenging due to complex environmental variations and scale differences. Traditional methods relying on local features exhibit poor adaptability, while standard deep learning approaches like UNet struggle with multi-scale feature extraction. This work introduces an enhanced U2Net model incorporating Efficient Channel Attention (ECA) modules to address these limitations. The ECA mechanism dynamically weights feature channels, enabling superior discrimination between critical terrain elements like water bodies, vegetation, and infrastructure in camera UAV imagery.

The U2Net architecture utilizes Residual U-blocks (RSU) for multi-scale feature extraction. Each RSU consists of an input convolution layer (Cin), a symmetrical encoder-decoder structure (M), and a residual fusion output (Cout). The encoder progressively downsamples spatial resolution while expanding receptive fields through cascaded RSU modules. The decoder then upsamples feature maps using bilinear interpolation, with skip connections fusing shallow and deep features. The standard RSU operation is defined as:
$$F_{\text{out}} = F_1(x) + U(F_1(x))$$
where $F_1(x)$ is the initial feature transformation and $U(\cdot)$ represents the U-shaped encoder-decoder operations.
Our modification integrates ECA modules within each RSU to amplify relevant channel features. The ECA mechanism first applies global average pooling to input feature map $X \in \mathbb{R}^{C \times H \times W}$:
$$z_c = \frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} x_c(i,j)$$
Channel weights are then learned via a 1D convolution with adaptive kernel size $k$ determined by channel dimension $C$:
$$k = \psi(C) = \frac{\log_2(C)}{\gamma} + \frac{b}{\gamma}$$
where $\gamma$ and $b$ control scaling. The final recalibrated features are obtained through channel-wise multiplication:
$$\hat{X} = \sigma(\text{Conv1D}(z)) \otimes X$$
This enables selective emphasis on channels critical for distinguishing overlapping terrain classes in camera drone imagery.
| Model | Precision | Recall | F1-Score | mIoU |
|---|---|---|---|---|
| UNet | 0.782 | 0.645 | 0.701 | 0.589 |
| Original U2Net | 0.805 | 0.682 | 0.769 | 0.649 |
| ECA-U2Net (Ours) | 0.832 | 0.731 | 0.794 | 0.690 |
Experiments used the LoveDA dataset containing 0.3m-resolution camera UAV images across urban/rural areas. Models were evaluated using pixel-wise metrics:
$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}, \quad \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}$$
$$\text{mIoU} = \frac{1}{N} \sum_{i=0}^{N} \frac{\text{TP}_i}{\text{TP}_i + \text{FP}_i + \text{FN}_i}$$
Our ECA-U2Net achieved a 2.7% precision gain and 4.95% recall improvement over baseline U2Net. The attention mechanism demonstrated particular efficacy in separating spectrally similar classes (e.g., water vs. shadows) by amplifying discriminative high-frequency features.
Training used a composite loss combining deep supervision outputs and final predictions:
$$\mathcal{L}_{\text{total}} = \sum_{m=1}^{M} w_{\text{side}}^{(m)} \mathcal{L}_{\text{side}}^{(m)} + w_{\text{fuse}} \mathcal{L}_{\text{fuse}}$$
where $\mathcal{L}_{\text{side}}$ and $\mathcal{L}_{\text{fuse}}$ are binary cross-entropy terms. The model converged within 100 epochs, demonstrating stable optimization dynamics even with complex camera drone scenes.
The enhanced architecture provides robust segmentation for land-cover mapping, infrastructure inspection, and environmental monitoring using camera UAV platforms. Future work will explore lightweight deployment for real-time onboard processing during drone surveys.
