The proliferation of Unmanned Aerial Vehicle (UAV) drone technology has revolutionized data acquisition across numerous civilian and military domains. Offering extensive coverage, high real-time data collection capability, extended endurance, and relatively low operational costs, UAV drone platforms are indispensable. A core technology enabling the intelligent interpretation of this vast visual data is semantic segmentation of UAV drone imagery. This process involves performing pixel-wise classification to assign an object category label to every pixel in an aerial photograph. This pixel-level understanding is fundamental for efficient geographic information extraction, forming the critical data backbone for autonomous decision-making systems in various UAV drone applications. While providing a broad, overhead perspective, UAV drone imagery simultaneously presents significant challenges for semantic segmentation algorithms: 1) The open, high-altitude viewpoint leads to substantial variations in the scale of target objects, ranging from large structures to small vehicles or individuals, and often results in blurred or indistinct object boundaries. 2) The extensive coverage typically results in a highly imbalanced distribution of semantic classes, where minority classes (e.g., pedestrians, vehicles) occupy a tiny fraction of the total pixels compared to majority classes (e.g., sky, vegetation). This imbalance biases model training towards dominant classes, causing poor segmentation accuracy for smaller, yet often operationally critical, objects.
In recent years, semantic segmentation has seen remarkable progress driven by deep convolutional neural networks. Architectures like the Fully Convolutional Network (FCN) established the foundation for end-to-end pixel-level prediction. Subsequent models like U-Net, with its symmetric encoder-decoder and skip connections, excelled at preserving fine-grained details. PSPNet introduced pyramid pooling to capture multi-scale context, while DeepLab series networks addressed the loss of spatial resolution through atrous (dilated) convolutions and sophisticated modules like the Atrous Spatial Pyramid Pooling (ASPP). The DeepLab V3+ architecture, in particular, enhanced boundary delineation by incorporating an encoder-decoder structure. Despite these advances, achieving precise segmentation in complex UAV drone scenarios—characterized by extreme scale variations, ambiguous boundaries, and severe class imbalance—remains an open challenge. Existing methods often struggle to comprehensively address these intertwined issues.

To tackle the aforementioned challenges inherent to UAV drone image analysis, we propose a comprehensively enhanced DeepLab V3+ network. Our improvements are strategically designed to augment feature representation, refine boundary details, and counteract training bias from class imbalance, making the model more robust and accurate for UAV drone semantic segmentation tasks.
The first major enhancement is the design of a novel CA-PA (Channel Attention – Position Attention) dual-attention mechanism integrated into the feature extraction backbone. Attention mechanisms allow the model to dynamically focus on the most informative parts of the feature maps. Our CA-PA module sequentially applies channel and spatial attention. The Channel Attention (CA) component learns to emphasize or suppress specific feature channels. To reduce parameter complexity compared to standard implementations, we employ shared one-dimensional convolutions. Given an input feature map $F$, the channel attention process is formulated as:
$$M_c(F) = \sigma(\text{Conv1D}_k(\text{AvgPool}(F)) + \text{Conv1D}_k(\text{MaxPool}(F))) \odot F + F$$
where $M_c(F)$ is the output, $\sigma$ is the Sigmoid function, $\text{Conv1D}_k$ denotes a 1D convolution with kernel size $k=3$, $\text{AvgPool}$ and $\text{MaxPool}$ are global pooling operations, and $\odot$ is element-wise multiplication. The refined features $X = M_c(F)$ are then passed to the Position Attention (PA) module. To capture long-range spatial dependencies without the prohibitive memory cost of a full Non-Local block, we integrate a Pyramid Pooling Module (PPM) to downsample the Key and Value projections. This drastically reduces the size of the affinity matrix from $HW \times HW$ to $HW \times 110$, where $H$ and $W$ are spatial dimensions. The position attention is computed as:
$$M_q(X) = \sigma(f_{1\times1}(X^T)), \quad M_k(X) = \text{PPM}(f_{1\times1}(X)), \quad M_v(X) = \text{PPM}(f_{1\times1}(X))$$
$$M_p(X) = M_v(X) \times (M_k(X) \times M_q(X))^T + X$$
Here, $M_q, M_k, M_v$ are Query, Key, and Value transformations, $f_{1\times1}$ is a 1×1 convolution, and PPM denotes pyramid pooling. The final output of the CA-PA mechanism is $O(F) = M_p(M_c(F))$. This dual-attention mechanism enables our model to adaptively focus on both semantically salient channels and spatially critical regions across different scales, which is vital for processing diverse UAV drone scenes.
The second key improvement addresses the fusion of low-level and high-level features within the DeepLab V3+ decoder. Simple concatenation can lead to feature degradation and blurred boundaries. We introduce a Feature Fusion Module (FFM) to intelligently merge the high-level semantic features from the encoder (after upsampling) with the low-level detail-rich features from the backbone. The FFM first concatenates the two feature streams. It then employs a channel attention mechanism generated via global average pooling and a 1×1 convolution followed by a Sigmoid activation. This produces a channel-wise weighting vector that highlights more important feature channels from the concatenated tensor. A final 1×1 convolution with a residual connection refines the fused features. This process ensures that detailed boundary information from low-level features is effectively preserved and reinforced with the contextual understanding from high-level features, leading to sharper segmentation masks for objects captured by the UAV drone.
The third cornerstone of our method is the adoption of a hybrid loss function to combat the class imbalance prevalent in UAV drone datasets. The standard Cross-Entropy (CE) loss tends to be dominated by majority classes. While Dice Loss directly optimizes for the overlap between prediction and ground truth and is naturally more robust to imbalance, it can suffer from training instability. Focal Loss down-weights the loss contribution from well-classified, easy examples, forcing the network to focus on harder, minority-class pixels. We combine their strengths through a weighted sum:
$$L_{\text{Dice}} = 1 – \frac{2|X \cap Y|}{|X| + |Y|}, \quad L_{\text{Focal}} = -\alpha (1 – p_c)^{\gamma} \log(p_c)$$
$$L_{\text{Hybrid}} = w_1 \cdot L_{\text{Dice}} + w_2 \cdot L_{\text{Focal}}$$
where $X$ and $Y$ are the ground truth and prediction sets, $p_c$ is the predicted probability for the target class $c$, and $\alpha$ and $\gamma$ are hyperparameters. We set $w_1=0.8$ and $w_2=0.2$ based on empirical validation. This hybrid loss provides a stable training signal that significantly improves segmentation performance for under-represented classes in UAV drone imagery.
We conducted extensive experiments to validate our proposed method. The primary evaluation was performed on the Aeroscapes dataset, a challenging UAV drone dataset with 11 classes and significant class imbalance. We also tested generalization on the UAVid dataset. Our experimental setup used standard training protocols for fair comparison. The key evaluation metrics are Mean Intersection over Union (MIoU) and Mean Pixel Accuracy (MPA), calculated as follows:
$$\text{MIoU} = \frac{1}{m+1} \sum_{i=0}^{m} \frac{p_{ii}}{\sum_{j=0}^{m} p_{ij} + \sum_{j=0}^{m} p_{ji} – p_{ii}}, \quad \text{MPA} = \frac{1}{m+1} \sum_{i=0}^{m} \frac{p_{ii}}{\sum_{j=0}^{m} p_{ij}}$$
where $p_{ij}$ is the number of pixels of class $i$ predicted as class $j$, and $m$ is the number of foreground classes.
Ablation studies conclusively demonstrate the contribution of each proposed component. The table below summarizes the results, showing that each module individually improves performance, and their combination yields the best result.
| Model | CA-PA | FFM | Hybrid Loss | MIoU (%) | MPA (%) | FPS |
|---|---|---|---|---|---|---|
| Baseline (DeepLab V3+) | 68.72 | 76.79 | 31.17 | |||
| Model 1 | ✓ | 70.40 | 78.34 | 30.40 | ||
| Model 2 | ✓ | 70.25 | 76.99 | 29.27 | ||
| Model 3 | ✓ | 71.87 | 79.46 | 31.12 | ||
| Model 4 | ✓ | ✓ | 71.02 | 78.51 | 29.04 | |
| Model 5 | ✓ | ✓ | 73.08 | 81.20 | 29.85 | |
| Model 6 | ✓ | ✓ | 72.96 | 81.21 | 29.28 | |
| Our Full Model | ✓ | ✓ | ✓ | 74.26 | 81.95 | 29.14 |
Further analysis per class reveals that the CA-PA module significantly improves segmentation of multi-scale objects like “road” and “vegetation.” The FFM module enhances boundaries for objects with complex outlines. Most notably, the hybrid loss function brings dramatic improvements to minority classes like “bicycle,” “UAV drone,” and “obstacle,” as seen in the detailed IoU table below. This is crucial for practical UAV drone operations where detecting such small objects is paramount.
| Class | Baseline IoU (%) | Our Model IoU (%) |
|---|---|---|
| Background | 80.44 | 76.68 |
| Road | 86.87 | 83.64 |
| Vegetation | 94.50 | 93.23 |
| Building | 79.28 | 78.64 |
| Sky | 95.39 | 96.21 |
| Animal | 66.15 | 71.30 |
| Boat | 72.91 | 81.92 |
| Person | 59.29 | 65.74 |
| Bicycle | 19.52 | 43.20 |
| Car | 85.91 | 90.67 |
| UAV Drone | 59.68 | 72.88 |
| Obstacle | 24.69 | 37.00 |
We compared our full model against several state-of-the-art semantic segmentation networks on the Aeroscapes UAV drone dataset. The results, consolidated in the following table, show that our method achieves the highest MIoU and MPA, outperforming strong competitors like HRNet, SegFormer, and a recent efficient model, SegMAN. This demonstrates the effectiveness of our integrated improvements for the UAV drone domain.
| Network | Backbone | MIoU (%) | MPA (%) | Params (M) | FLOPs (G) |
|---|---|---|---|---|---|
| FCN | ResNet-50 | 51.03 | 64.39 | 32.95 | 823.95 |
| U-Net | ResNet-50 | 61.66 | 69.50 | 43.93 | 538.90 |
| PSPNet | ResNet-50 | 62.74 | 69.85 | 46.71 | 352.95 |
| HRNet | HRNetV2-w32 | 69.94 | 79.58 | 29.54 | 271.33 |
| SegFormer | SegFormer-b2 | 69.14 | 76.39 | 27.36 | 337.81 |
| SegMAN | SegMAN-B | 73.13 | 79.50 | 51.78 | 152.55 |
| DeepLab V3+ | Xception | 68.72 | 76.79 | 54.71 | 497.49 |
| Our Model | Xception | 74.26 | 81.95 | 55.52 | 563.10 |
Generalization tests on the UAVid dataset further confirm the robustness of our approach. Our model achieved an MIoU of 61.75% and an MPA of 72.34%, representing a gain of 2.89% and 3.47% over the baseline DeepLab V3+, respectively. This performance is competitive with the leading SegMAN model and superior to other benchmarks, proving its adaptability to different UAV drone imaging scenarios and label sets.
In conclusion, we have presented a significantly enhanced DeepLab V3+ network tailored for the specific demands of UAV drone image semantic segmentation. By integrating a novel CA-PA attention mechanism, a dedicated Feature Fusion Module, and a hybrid Dice-Focal loss function, our method effectively addresses the core challenges of multi-scale object recognition, boundary ambiguity, and severe class imbalance. Comprehensive experiments on standard UAV drone benchmarks validate that our model sets a new state-of-the-art performance, delivering more precise and reliable segmentation maps. This advancement contributes directly to improving the perception and autonomous capabilities of UAV drone systems. Future work will focus on model optimization through techniques like pruning and quantization to enhance inference speed without compromising accuracy, facilitating real-time onboard processing for UAV drone applications.
