Lightweight UAV Image Segmentation via Wavelet-Detail Collaborative Network

The intelligent analysis of Unmanned Aerial Vehicle (UAV) imagery is a cornerstone technology in low-altitude remote sensing, playing a pivotal role in diverse application scenarios such as urban planning, precision agriculture, and disaster emergency response. Compared to satellite remote sensing, which is often constrained by revisit cycles and imaging resolution, UAVs offer superior mobility, enabling centimeter-level resolution imaging and facilitating rapid, large-scale acquisition of high-resolution data. This provides an excellent foundation for high-precision, fine-grained semantic segmentation—a core technology for scene understanding that achieves pixel-level semantic annotation, thereby enhancing the autonomous perception and intelligent decision-making capabilities of UAVs. However, constrained by the hardware resources of UAV-embedded computing platforms, segmentation models must optimize computational efficiency and control parameter scale while maintaining accuracy. Concurrently, data characteristics inherent to high-resolution imagery, such as multi-scale objects and complex background interference, further complicate algorithm design.

To achieve high-precision semantic segmentation of remote sensing images, numerous deep learning-based methods have been proposed. Early models like the Fully Convolutional Network (FCN) utilized a fully convolutional structure for end-to-end pixel-level prediction, laying the foundation for semantic segmentation tasks. U-Net introduced skip-connection mechanisms to effectively fuse shallow and deep features, improving segmentation performance. The DeepLab series further enhanced the processing of multi-scale contextual information by combining atrous convolution with conditional random fields. While these methods have achieved significant progress in segmentation accuracy, they generally suffer from large parameter counts and high computational costs, making them difficult to meet the practical requirements for lightweight design and real-time performance on resource-constrained platforms like UAVs. For real-time applications, lightweight semantic segmentation networks have become a research hotspot. The BiSeNet series balanced speed and accuracy through dual-path and feature fusion architectures. STDC improved boundary quality by introducing a detail guidance module at shallow layers. DDRNet enhanced multi-scale object recognition using a dual-resolution parallel structure. PIDNet strengthened contour information by adding a dedicated boundary branch. However, in UAV remote sensing scenarios, these methods still face three common issues: (1) downsampling leads to the loss of detailed textures, resulting in blurred boundaries and poor performance on small objects; (2) the efficiency of multi-scale modeling and semantic enhancement is limited, and high-computation structures are detrimental to lightweight deployment; (3) feature fusion is predominantly unidirectional, leading to insufficient utilization of detail information. These problems collectively constrain the performance ceiling of lightweight segmentation networks in complex UAV remote sensing scenarios.

To address the aforementioned challenges, we propose a lightweight semantic segmentation network for UAV remote sensing imagery, termed the Wavelet Detail Collaborative Network (WDCNet). In contrast to existing methods, the core innovation of WDCNet lies in constructing a design paradigm of “detail preservation – semantic enhancement – bidirectional fusion,” achieving systematic collaborative optimization across three stages: feature generation, semantic enhancement, and fusion interaction.

Related Work

Lightweight Semantic Segmentation Networks and Feature Fusion

To improve segmentation efficiency and deployment performance, lightweight semantic segmentation networks have garnered significant attention, with continuous development in architectural design and feature fusion strategies. BiSeNetV2, as a typical representative, employs a bilateral structure with a semantic branch and a detail branch, introducing a Bilateral Guided Aggregation (BGA) module for rapid fusion of semantic and detail features, striking a balance between speed and accuracy. However, BGA employs unidirectional guidance, lacking feedback from details to semantics, which limits performance in boundary regions and on small objects. DDRNet adopts a high-low resolution parallel structure, enhancing the recognition of multi-scale targets, but the detail path participation is insufficient, and its computational complexity is relatively high,不利于 lightweight deployment. PIDNet enhances contour information by introducing a boundary branch, yet its fusion approach still leans heavily on the semantic path, resulting in low efficiency in utilizing edge information and limited improvement. Overall, the common issues among these methods are a singular fusion approach and insufficient utilization of details.

Detail Feature Extraction and Preservation Methods

In remote sensing image segmentation, edge and detail features are crucial for improving accuracy. Although low-level features possess limited semantic information, they are irreplaceable for capturing object contours and fine-grained structures. ICNet enhanced boundary perception through a high-resolution detail branch, but the detail path lacked independent guidance and supervision, causing detail information to be easily suppressed by semantic features, limiting boundary recovery. The STDC network introduced a detail guidance module at shallow layers to enhance boundary expression through constrained supervision, thereby improving the model’s perception of complex object contours. However, it primarily relied on spatial features, making it difficult to preserve high-frequency information over long distances. Recently, some studies have begun to explore the use of frequency-domain information to enhance the expression of detail features. For instance, models like WaveletCNNs and Wave-ViT successfully extracted edge and texture details by incorporating wavelet transforms and leveraging their multi-resolution analysis properties. However, these methods focus on frequency-domain feature extraction and lack deep integration with spatial detail paths, limiting their application in structural segmentation tasks for UAV drones.

Attention Mechanisms and Multi-scale Semantic Feature Representation

In semantic segmentation tasks, accurately extracting semantic information is key to understanding complex scenes, especially in remote sensing images from UAV drones where multi-scale objects and cluttered backgrounds pose greater challenges. Recently, attention mechanisms and multi-scale feature fusion have been widely used to improve segmentation performance. Traditional attention mechanisms like Squeeze-and-Excitation (SE) and the Convolutional Block Attention Module (CBAM) enhanced responses in salient regions by modulating channel or spatial feature weights. However, such methods typically focus on a single dimension, lacking synergistic enhancement between channel and spatial information, making it difficult to handle the representation needs of heterogeneous objects and complex textures in remote sensing imagery. To this end, Multi-dimensional Collaborative Attention (MCA) introduced a multi-dimensional collaboration mechanism to enhance feature interaction across different dimensions. However, its structure is relatively simple, and its stability and efficiency in complex scenes remain insufficient. Context pyramid-like structures can enhance multi-scale representation, but may suffer from information loss in lightweight networks.

Methodology

Network Overall Architecture

The overall architecture of WDCNet is inspired by the decoupled dual-branch structure idea and is systematically optimized from three aspects—detail preservation, semantic enhancement, and bidirectional fusion—to address the shortcomings of existing lightweight methods in UAV drones remote sensing scenarios. In the detail branch, a Wavelet Transform Downsampling Module (WTDM) replaces traditional strided convolution to preserve more high-frequency detail information and boundary texture during downsampling. The branch maintains a three-level resolution reduction structure (1/2, 1/4, 1/8) and incorporates a Detail Guidance Module (DGM) at its output, which enhances the shallow features’ perception of object contours through explicit edge supervision, thereby significantly mitigating boundary blurring caused by downsampling and improving the segmentation accuracy of fine-grained structures. In the semantic branch, a five-level resolution reduction structure is retained. At the 1/4, 1/8, and 1/16 resolution stages, a Residual Multidimensional Collaborative Attention Module (RMCAM) is introduced to achieve collaborative attention enhancement across channel and spatial dimensions, highlighting salient regions, suppressing background interference, and enhancing modeling capability for heterogeneous objects. At the end of the semantic branch, a Context Pyramid Module (CPM) integrates multi-scale receptive fields and context information aggregation strategies to more efficiently combine global and local semantic features. At the branch fusion stage, an Enhanced Bilateral Guided Aggregation Module (EBGA) is proposed, introducing a reverse guidance path from the detail branch to the semantic branch, establishing a bidirectional feedback mechanism between detail and semantic features. This mechanism not only strengthens the detail branch’s ability to preserve structural information but also enhances the collaborative expression between the semantic branch and boundary structures, achieving a better balance between spatial consistency and semantic sufficiency. The fused features are progressively restored to spatial resolution by the decoder to generate the final segmentation prediction.

Detail Branch Module Design

To address the issues of fine-grained structure loss during downsampling and insufficient boundary response in UAV drones remote sensing images, a detail enhancement unit composed of the Wavelet Transform Downsampling Module (WTDM) and the Detail Guidance Module (DGM) is proposed in the detail branch,协同提升 detail perception from both frequency-domain detail preservation and explicit boundary supervision aspects.

The WTDM decomposes the input feature map into frequency domains based on the Haar Wavelet Transform (HWT),分解为 one low-frequency component $LL$ and three directional high-frequency components $LH$, $HL$, and $HH$. The low-frequency component primarily retains the overall structural information of the image, while the high-frequency components reflect edge and detail variations. This process effectively distinguishes structural and textural information while halving the spatial resolution, providing a richer representation for subsequent detail feature extraction. The three high-frequency components are stacked and sequentially passed through convolution, batch normalization, and ReLU activation to extract detail response features $F_H$; the low-frequency component is independently encoded to obtain structural features $F_L$. Subsequently, the two are fused via element-wise addition to generate the output feature $I_{out}$, with the calculation process as shown in Equations (1)-(3).

$$F_H = \text{ReLU}(\text{BN}(\text{Conv}([I_{LH}, I_{HL}, I_{HH}])))$$

$$F_L = \text{ReLU}(\text{BN}(\text{Conv}(I_{LL})))$$

$$I_{out} = F_H + F_L$$

Compared to traditional max-pooling or strided convolution that downsamples directly in the spatial domain, WTDM preserves critical high-frequency information through frequency-domain decomposition, endowing the network with stronger expressive power in boundary, small object, and complex texture regions.

Although WTDM effectively preserves high-frequency details during downsampling, the network’s focus on key boundary regions remains insufficient. Therefore, the Detail Guidance Module (DGM) is introduced at the output of the third level of the detail branch. By establishing explicit boundary supervision, it guides shallow features to enhance their ability to respond to structural contours. In terms of design, DGM first processes the detail branch features through a “Detail Head”: features pass through a 3×3 convolutional layer, batch normalization, and ReLU activation to extract local context information; then channel compression is achieved via a 1×1 convolution to obtain the detail prediction map $F_d$. Simultaneously, to generate the detail ground truth map $G_d$, a multi-scale parallel feature extraction strategy based on the Laplacian convolution kernel is employed. Specifically, the original segmentation ground truth is subjected to 3×3 Laplacian convolution kernels with strides of 1, 2, and 4, where the branches with strides 2 and 4 are restored to the original resolution via 2× and 4× bilinear upsampling, respectively. Subsequently, the three sets of features are concatenated along the channel dimension and fused via a 1×1 convolution to obtain the multi-scale integrated detail ground truth map $G_d$. During training, a combination of Binary Cross-Entropy (BCE) loss and Dice loss is used to optimize the similarity between the detail prediction map and the ground truth map. The loss function for DGM is defined as follows, where Dice loss focuses on improving overall boundary consistency, and BCE loss enhances local pixel classification accuracy. This module is only enabled during training, thus introducing no additional inference computational overhead.

$$L_{\text{detail}} = L_{\text{dice}}(F_d, G_d) + L_{\text{bce}}(F_d, G_d)$$

Semantic Branch Module Design

To enhance the network’s semantic representation capability in complex UAV drones remote sensing scenes, a Residual Multidimensional Collaborative Attention Module (RMCAM) and a Context Pyramid Module (CPM) are introduced in the semantic branch. RMCAM, from the perspective of multi-dimensional feature enhancement, is used to improve the discriminativity and fine-grained expressive ability of mid-to-high-level semantic features. CPM, from the perspective of multi-scale context fusion, achieves a dynamic balance between global semantics and local details. Their协同作用 effectively improves the overall feature modeling capability of the semantic branch.

The targets in UAV drones remote sensing images exhibit characteristics such as varying scales and rich details. A single-dimensional attention mechanism struggles to simultaneously capture global semantics and fine-grained structural features. RMCAM employs a three-dimensional collaborative attention mechanism across channels, height, and width, modeling contextual dependencies in different dimensions separately, and introduces residual connections to mitigate the attenuation of deep information during feature transmission. The module contains three parallel branches. The processing flow for each branch is: first, dimension permutation is performed on the input feature; then statistical information for that dimension is extracted via a Squeeze Transform (ST); subsequently, an Excitation Transform (ET) generates the attention weights for that dimension, which are element-wise multiplied with the permuted feature to obtain the enhanced feature.

The ST phase融合全局平均池化 (GAP), 最大池化 (MaxP), and 标准差池化 (StdP) statistics, employing learnable parameters to weight GAP and MaxP before adding the StdP result:

$$S = \frac{1}{3}[\alpha \cdot \text{GAP}(F) + \beta \cdot \text{MaxP}(F) + \text{StdP}(F) + \alpha \cdot \text{GAP}(F) + \beta \cdot \text{MaxP}(F)]$$

where $\alpha$ and $\beta$ are trainable parameters, and StdP reflects the dispersion degree of the feature distribution, which can enhance responses in boundary and texture variation regions. The ET phase captures local interaction features through channel permutation and dynamic convolution (1×K), where the kernel size K is determined by:

$$K = \text{odd}\left(\frac{\ln(C)}{\gamma \lambda}\right)$$

where $C$ is the number of channels after permutation, $\lambda=1.5$, $\gamma=1$, and $\text{odd}(\cdot)$ denotes taking the nearest odd integer. Finally, the enhanced features from the three directions are fused with the input feature in a residual manner:

$$F’ = \frac{1}{3}(F’_C + F’_H + F’_W) + F$$

where $F’_C$, $F’_H$, and $F’_W$ are the enhanced channel, height, and width features, respectively, and $F$ is the original input feature. RMCAM can effectively improve the perception ability of fine-grained semantic differences and provide more discriminative high-level semantic features for subsequent context fusion.

Although RMCAM enhances multi-dimensional feature interaction, challenges remain in capturing global dependencies among targets at different scales and preserving detail information. Therefore, the CPM module is introduced at the high-level of the semantic branch. It achieves multi-scale contextual feature extraction from local to global through five parallel branches and employs a stepwise fusion strategy along with Context Embedding Blocks (CEB) to establish a balance between global semantics and detail刻画.

The CPM consists of five parallel branches with receptive fields gradually expanding from local to global. The first branch uses 1×1 convolution for channel compression and feeds into a CEBlock, injecting global semantic information while maintaining spatial dimensions to highlight small objects and boundary details. Branches two to four employ 5×5, 9×9, and 17×17 convolutions with strides of 2, 4, and 8, respectively, to extract medium and large-scale context features. After dimensionality reduction via 1×1 convolution and upsampling to restore the original resolution, they are fused stepwise by adding to the features of the adjacent smaller receptive field branch, then passed to a CEBlock. This stepwise fusion mechanism ensures that high-frequency details from small receptive field branches are prioritized in the feature transmission chain, while low-frequency global semantics from large receptive field branches are synergistically enhanced with features from smaller receptive fields before introduction, thereby implicitly balancing the weights of multi-scale features structurally. The fifth branch uses global convolution (kernel size equal to the feature map size) to obtain full-image context features, which after 1×1 reduction and upsampling are fused with the fourth branch and further enhanced with local details via a CEBlock. The computations for the five branches are:

$$B_i = \text{CEBlock}(\text{Up}(\text{Conv}_{1\times1}(\text{Conv}_{k_i \times k_i, s_i}(F_{in}))) + \delta_{i>1} \cdot B_{i-1}), \quad i=1,…,5$$

$$F_{out} = \text{Conv}_{1\times1}(\text{Concat}(B_1, B_2, B_3, B_4, B_5)) + \text{Conv}_{1\times1}(F_{in})$$

where $(k_1, k_2, k_3, k_4, k_5) = (1,5,9,17, H \times W)$, $(s_1, s_2, s_3, s_4, s_5) = (1,2,4,8,1)$, and $\delta_{i>1}$ is a conditional weighting factor equal to 1 when $i>1$, otherwise 0. The CEBlock is defined as:

$$\text{CEBlock}(F) = \text{Conv}_{3\times3}(F + \text{Conv}_{1\times1}(\text{GAP}(F)))$$

where GAP is global average pooling. This design, while introducing global semantics, explicitly recovers local high-frequency details through 3×3 convolution, effectively suppressing the smoothing effect caused by large receptive field fusion and significantly reducing the loss of boundary and small object information.凭借 stepwise fusion and the协同作用 of CEBlock, CPM ensures detail priority in transmission through structural order during multi-scale context feature extraction and achieves adaptive adjustment of contribution ratios from different scale features via learnable convolution parameters, thereby simultaneously maintaining detail integrity and global consistency. This offers significant advantages for small object detection, boundary segmentation, and complex background parsing in UAV drones remote sensing images. RMCAM focuses on capturing and enhancing fine-grained differences in multi-dimensional feature space, improving the local discriminativity of mid-to-high-level features. CPM, while integrating multi-scale context, preserves high-frequency details and introduces global semantics, achieving dynamic balance across scales. The two are complementary in structure and function: RMCAM outputs discriminative multi-dimensionally enhanced features, providing a high-quality semantic foundation for CPM’s multi-scale fusion; CPM’s global-local协同机制 amplifies the optimization effect of RMCAM on details. This multi-dimensional–multi-scale协同 optimization strategy enables the semantic branch to handle small objects, complex boundaries, and long-range dependency scenarios in high-resolution UAV drones remote sensing images simultaneously.

Enhanced Bilateral Guided Aggregation

To improve the fusion effect of detail and semantic features, an Enhanced Bilateral Guided Aggregation module (EBGA) is designed based on the Bilateral Guided Aggregation (BGA) module from BiSeNetV2. This module retains the advantages of the original dual-branch structure while proposing several structural improvements to enhance the协同表达能力 of details and semantics, addressing the dual requirements of boundary accuracy and global modeling in high-resolution UAV drones remote sensing image segmentation. Firstly, based on the original unidirectional guidance mechanism where the semantic branch guides the detail branch, a reverse guidance from the detail branch to the semantic branch is added. This allows high-resolution spatial information to directly enhance the feature discriminativity and localization accuracy of the semantic branch in boundary regions, effectively mitigating boundary blurring issues. Secondly, during branch feature fusion, EBGA compresses all intermediate feature channels to half of the original number to reduce redundant computation while maintaining key information transmission. After feature addition fusion, a 3×3 dilated convolution is introduced to expand the receptive field and enhance global semantic perception, combined with BN and ReLU activation to improve the nonlinear expressive capability of the features, endowing the fused features with both rich semantics and detailed spatial information. Finally, to further enhance the discriminativity and selectivity of inter-channel information, EBGA integrates a Lightweight Channel Recalibration Module (LCRM) at the output. LCRM first extracts channel statistics via GAP global average pooling, then generates attention weights through 1×1 convolution and nonlinear activation to adaptively strengthen key channel responses; simultaneously, residual connections are retained to maintain the original information flow while enhancing features, thereby improving model stability and segmentation performance.

Loss Function Design

To address the characteristics of UAV drones remote sensing images, such as diverse target morphology, scale variation, and boundary blurring, a multi-supervised training scheme融合全局 and detail feature constraints is designed, enabling the model to balance overall semantic parsing and precise segmentation of complex regions during optimization. The overall loss is weighted from the supervision signals of the Detail Guidance Module, multi-scale auxiliary supervision in the semantic branch, the dual-branch fusion location, and the final segmentation head. The loss for the Detail Guidance Module $L_{\text{detail}}$ is defined earlier. Its weight is set to 1 in the total loss to maintain consistent gradient magnitude with the main branches and effectively alleviate issues with small objects and class imbalance. In the semantic branch, to enhance the discriminative ability of features at different semantic levels, auxiliary supervision heads are introduced at the four stages with feature resolutions of 1/4, 1/8, 1/16, and 1/32. Their loss function is defined as:

$$L_{\text{aux}}^{(i)} = \alpha L_{CE}^{(i)} + \beta L_{\text{Dice}}^{(i)}, \quad i=1,2,3,4$$

where the cross-entropy loss $L_{CE}$ measures category prediction accuracy; the Dice loss $L_{\text{Dice}}$ strengthens boundary consistency and segmentation effectiveness in small object regions. The weight ratio $\alpha=0.4$, $\beta=0.6$ references empirical settings from remote sensing segmentation and is validated in our experiments. At the final segmentation head and the dual-branch fusion location, the cross-entropy loss with Online Hard Example Mining (OHEM) strategy is adopted, corresponding to the segmentation head output ($L_{\text{OHEM1}}$) and the dual-branch fusion output ($L_{\text{OHEM2}}$). Their mathematical definition and combination form are:

$$L_{\text{OHEM}k} = -\frac{1}{|S|} \sum_{j \in S} \sum_{c=1}^{C} y_{j,c} \ln(p_{j,c}), \quad k \in \{1,2\}$$

$$L_{\text{OHEM-total}} = L_{\text{OHEM1}} + \gamma L_{\text{OHEM2}}$$

where $S$ denotes the set of top-k hard samples selected from all pixels based on prediction correctness probability from low to high, $C$ is the total number of categories, $y_{j,c}$ is the one-hot label for pixel $j$, and $p_{j,c}$ is the prediction probability; based on experimental comparison results, $\gamma$ is set to 0.4. This strategy improves the model’s辨识能力 for complex boundaries and detail regions by focusing on hard samples and reducing gradient interference from easy samples. In summary, the total loss function is defined as:

$$L_{\text{total}} = L_{\text{detail}} + \sum_{i=1}^{4} L_{\text{aux}}^{(i)} + L_{\text{OHEM-total}}$$

This total loss combines high-resolution detail guidance (DGM), multi-scale deep supervision (semantic branch), global constraints, and fusion supervision (OHEM-total), significantly improving overall accuracy, boundary quality, and small object recognition in UAV drones remote sensing segmentation tasks characterized by class imbalance and multi-scale complex objects.

Experiments and Analysis

Datasets and Preprocessing

Three representative datasets are selected for the UAV image semantic segmentation task: Vaihingen, Potsdam, and UAVid, covering aerial urban imagery, finely annotated ground scenes, and dynamic urban traffic environments, aiming to comprehensively evaluate the adaptability and generalization capability of the proposed method. The Vaihingen and Potsdam datasets are from ISPRS, with images uniformly cropped to 512×512 resolution. Vaihingen contains 344 training, 100 validation, and 198 test images. Potsdam contains 3450, 1000, and 1016 images for training, validation, and testing, respectively. UAVid focuses on dynamic urban roads, with images cropped to 1024×1024, divided into 2400 training, 300 validation, and 540 test images.

Differentiated data augmentation strategies are designed based on dataset characteristics. For Vaihingen and Potsdam, images are first randomly scaled, followed by random cropping, horizontal flipping, and lighting perturbation, with labels同步变换. For UAVid, images are first scaled to a base resolution of 2048×1024, then randomly scaled within a range of 0.5 to 2.0, followed by the same cropping and flipping steps as the first two datasets.

During validation and testing, a unified preprocessing pipeline is employed to ensure objective and reproducible results, including scale normalization, size adjustment, and label packaging, without random augmentation. In the testing phase, multi-scale enhancement and horizontal flip combined prediction (Test-time Augmentation, TTA) is further employed. This strategy has been proven effective in enhancing model robustness in several semantic segmentation studies. Specifically, multi-scale prediction enhances adaptability to different object sizes, while flip prediction reduces model bias towards image orientation. To ensure fairness and reproducibility, TTA uses consistent configuration for all datasets and is only used during testing. All input images are standardized before feeding into the model, with pixel values normalized to the [0,1] range.

Experimental Environment and Training Parameters

Experiments were conducted on a system running Ubuntu 22.04.5, with hardware including an NVIDIA RTX 3090 GPU, Intel Xeon Platinum 8352V CPU, and 125 GB RAM. The runtime environment was based on CUDA 12.4 and cuDNN 9.0.1. The software platform used Python 3.8.20 and PyTorch 2.4.1 for model construction and training, and relied on the MMSegmentation v1.2.2 framework for semantic segmentation module development and debugging.

The training batch size was set to 4, using a poly learning rate decay strategy (Power=0.9). Considering the differences in sample size, class distribution, and scene characteristics among the three datasets, training hyperparameters were set with reference to common configurations in existing remote sensing semantic segmentation research. The specific parameters are summarized in the table below.

Dataset	Optimizer	Learning Rate	Momentum	Max Iterations
Vaihingen	Adam	0.001	β1=0.9, β2=0.999	80000
Potsdam	SGD	0.01	0.9	160000
UAVid	SGD	0.001	0.9	80000

Evaluation Metrics

To comprehensively measure model performance in semantic segmentation tasks, three mainstream evaluation metrics are selected: Mean Intersection over Union (mIoU), Mean Dice Coefficient (mDice), and Mean Accuracy (mAcc), providing quantitative analysis from dimensions of region overlap, prediction consistency, and classification accuracy. The formulas are as follows:

$$\text{mIoU} = \frac{1}{C} \sum_{i=1}^{C} \frac{TP_i}{TP_i + FP_i + FN_i}$$

$$\text{mDice} = \frac{1}{C} \sum_{i=1}^{C} \frac{2TP_i}{2TP_i + FP_i + FN_i}$$

$$\text{mAcc} = \frac{1}{k} \sum_{i=0}^{k} \frac{p_{ii}}{\sum_{j=0}^{k} p_{ij}}$$

where $C$ represents the total number of categories, $TP_i$, $FP_i$, $FN_i$ represent the true positive, false positive, and false negative counts for the $i$-th class, respectively, and in mAcc, $p_{ij}$ denotes the number of pixels of class $i$ predicted as class $j$.

Ablation Study and Analysis

To verify the effectiveness of the proposed modules in improving semantic segmentation performance and lightweight design, a step-by-step ablation study was conducted on the Vaihingen dataset using BiSeNetV2 as the baseline model, evaluating the contribution of each module from the dimensions of accuracy metric mIoU and parameter count Params (MB).

First, the effects of the Residual Multidimensional Collaborative Attention Module (RMCAM), Wavelet Transform Downsampling Module (WTDM), and Context Pyramid Module (CPM) within the backbone network were assessed. The results are shown in the table below.

Baseline	RMCAM	WTDM	CPM	mIoU/%	Params/MB
√	×	×	×	68.46	3.35
√	√	×	×	69.54	3.35
√	×	√	×	69.93	3.28
√	√	√	×	70.74	3.28
√	√	√	√	71.87	3.53

The baseline model achieved an mIoU of 68.46% with 3.35MB parameters. Introducing RMCAM increased mIoU by 1.08 percentage points with a negligible parameter increase, demonstrating its ability to enhance feature expression while maintaining lightweight. Replacing with WTDM increased mIoU to 69.93% while reducing parameters to 3.28MB, benefiting from its replacement of strided convolution which reduces kernel count and better preserves edges and details. Combining RMCAM and WTDM achieved an mIoU of 70.74% while maintaining 3.28MB parameters. Further adding CPM increased mIoU to 71.87% with a slight parameter increase of 0.18MB,但带来 a 3.41 percentage point accuracy gain.

Following the analysis of internal backbone modules, ablation experiments on external modules—the EBGA module, DGM module, and the improved loss function—were further conducted. The results are shown below.

Backbone	EBGA	DGM	Improved Loss	mIoU/%	Params/MB
√	×	×	×	71.87	3.53
√	√	×	×	72.84	3.86
√	√	√	×	73.04	3.86
√	√	√	√	73.45	3.86

On the basis of integrating all internal modules, the gains from external modules were evaluated: adding the Enhanced Bilateral Guided Aggregation module (EBGA) increased mIoU to 72.84% with parameters rising to 3.86MB; introducing the Detail Guidance Module (DGM) further slightly increased mIoU to 73.04% with unchanged parameter count (as DGM only operates during training and does not affect inference scale); finally, combining the improved loss function achieved an mIoU of 73.45% with unchanged parameters, further enhancing accuracy by optimizing the objective to mitigate class imbalance and boundary blur.

Comparative Experimental Results and Analysis

To verify the effectiveness and advancement of the proposed WDCNet model in UAV drones remote sensing image semantic segmentation tasks, systematic experiments were conducted on three typical remote sensing image segmentation datasets: UAVid, Vaihingen, and Potsdam, with analysis from both quantitative metrics and visual results.

The UAVid dataset features a typical low-altitude UAV perspective and complex dynamic backgrounds, primarily used to assess model adaptability and robustness in low-altitude remote sensing scenarios. The quantitative comparison results of different semantic segmentation models on the UAVid dataset are shown in the table below. The proposed WDCNet achieved optimal performance in overall accuracy and core metrics, with mIoU, mDice, and mAcc reaching 73.25%, 83.21%, and 81.76%, respectively, surpassing the current best-performing DDRNet by 3.22, 1.91, and 1.57 percentage points. The improvement in mDice is most significant, indicating that WDCNet has stronger advantages in boundary delineation and fine-grained structure preservation. From the per-class IoU results, WDCNet leads in most categories. In the ‘human’ class, IoU reached 44.82%, exceeding DDRNet by over 9 percentage points, effectively demonstrating the role of the detail guidance module and optimized loss function in small object and boundary region recognition. In core classes like ‘road’ and ‘moving car’, IoU reached 72.02% and 77.64%, showing clear improvements over other methods. The model parameter count is only 3.86MB, similar to the lightweight BiSeNetV2, but significantly outperforms all compared methods in accuracy, achieving an excellent balance between performance and efficiency.

Methods	Clutter	Building	Road	Static Car	Tree	Low Veg.	Human	Moving Car	mIoU/%	mDice/%	mAcc/%	Params/MB
PSPNet	64.46	89.91	78.70	67.26	76.63	67.11	32.91	72.79	68.72	80.31	78.35	12.64
ICNet	62.85	90.07	76.92	62.49	75.03	66.36	33.67	71.72	67.39	79.42	78.31	47.53
BiSeNet V2	61.56	89.67	75.76	63.17	75.07	66.73	33.45	72.23	67.21	79.28	77.25	3.35
DDRNet	65.77	90.90	80.11	68.87	77.04	67.55	35.19	74.83	70.03	81.30	79.19	20.29
WDCNet (Ours)	66.30	91.28	88.29	72.02	77.51	68.18	44.82	77.64	73.25	83.21	81.76	3.86

On the Vaihingen dataset, WDCNet achieved mIoU, mDice, and mAcc of 73.45%, 84.03%, and 81.93%, respectively, surpassing the suboptimal DeepLabV3+ by 1.67, 0.75, and 0.15 percentage points, securing the top position. The improvement in mIoU is particularly notable, further validating its robustness and class balance in complex remote sensing scenes. Per-class IoU analysis shows that WDCNet outperforms other compared methods in classes like clutter, building, and car, demonstrating stronger capability in segmenting blurred boundaries and small objects. In terms of model efficiency, WDCNet’s parameter count is much lower than DeepLabV3+ and DDRNet, and similar to BiSeNet V2 but with significantly better accuracy.

Methods	Clutter	Building	Tree	Low Veg.	Imp. Sur.	Car	mIoU/%	mDice/%	mAcc/%	Params/MB
U-Net	25.68	64.84	77.70	67.55	86.46	81.75	67.33	78.39	76.19	28.98
DeepLabV3+	41.16	66.77	78.87	69.99	89.39	84.48	71.78	82.48	79.75	12.32
BiSeNet V2	31.77	59.84	78.53	69.22	88.20	83.33	68.46	79.58	76.17	3.35
WDCNet (Ours)	50.00	69.63	78.29	68.98	89.07	84.73	73.45	84.03	81.93	3.86

On the Potsdam dataset, WDCNet achieved mIoU, mDice, and mAcc of 76.54%, 85.42%, and 84.26%, respectively, outperforming all compared methods. In mIoU, it surpassed DDRNet by 1.69 percentage points, demonstrating its stability and generalizability in multi-class segmentation tasks. This advantage is primarily attributed to the协同作用 of the context pyramid and detail guidance modules, enabling the model to excel in complex structure recovery and small object representation. In the ‘building’ and ‘car’ classes, WDCNet achieved IoUs of 91.24% and 89.39%, significantly outperforming DDRNet and SegFormer-B1, indicating stronger advantages in maintaining building boundary continuity and recognizing complete vehicle targets. It also maintains leadership in texture-complex classes like ‘tree’ and ‘low vegetation’, reflecting its robustness in spatial context modeling and fine-grained structure recovery. In terms of model efficiency, WDCNet’s parameter count is not only much lower than large models like FastFCN but also significantly smaller than most mainstream methods, while accuracy exceeds various lightweight models, achieving a balance of high efficiency and high accuracy.

Methods	Clutter	Building	Tree	Low Veg.	Imp. Sur.	Car	mIoU/%	mDice/%	mAcc/%	Params/MB
U-Net	34.68	89.34	77.26	73.94	89.15	83.37	74.62	83.88	82.97	28.98
BiSeNet V2	38.83	87.84	75.10	73.88	89.49	83.67	74.80	84.30	83.41	3.35
DDRNet	36.72	88.86	75.87	73.30	89.89	84.42	74.85	84.15	83.38	20.29
WDCNet (Ours)	39.67	89.39	78.09	75.99	91.24	84.90	76.54	85.42	84.26	3.86

Conclusion

This paper addresses the problems of boundary detail loss, insufficient utilization of multi-scale context, and unidirectional fusion of semantics and details in semantic segmentation of UAV drones remote sensing images, proposing a lightweight, high-precision network named WDCNet. It establishes a systematic design paradigm of “detail preservation – semantic enhancement – bidirectional fusion.” In the detail path, a wavelet transform downsampling module and a detail guidance module are introduced to preserve high-frequency details and boundary contours. In the semantic branch, a residual multidimensional collaborative attention module and a context pyramid module are combined to efficiently model multi-scale context. At the fusion stage, an enhanced bilateral guided aggregation module achieves a bidirectional feedback mechanism between details and semantics. Simultaneously, a combined OHEM and Dice loss is employed to mitigate the impact of class imbalance and hard samples, further improving boundary and small object segmentation performance. Experimental results on three typical remote sensing datasets—UAVid, Vaihingen, and Potsdam—demonstrate that WDCNet surpasses representative lightweight methods in accuracy, boundary parsing quality, and small object recognition, while maintaining low model complexity. This validates its feasibility and application potential for achieving high-precision segmentation under resource-constrained conditions,具备 deployment feasibility on embedded and resource-limited platforms to meet intelligent parsing needs for various low-altitude remote sensing tasks. Future work will involve integrating graph neural networks to enhance global semantic modeling, combining weak supervision and few-shot learning to improve adaptability to low-annotation scenarios, and conducting cross-modal segmentation research with heterogeneous sensors to expand multi-source remote sensing fusion applications.