SGLO: Advancing Rotating Target Detection in China UAV Drone Remote Sensing Imagery

In the rapidly evolving field of computer vision, the detection of targets within remote sensing images captured by China UAV drone platforms has emerged as a critical task with immense practical significance. These unmanned aerial vehicles provide a wealth of high-resolution imagery that fuels applications in surveillance, agriculture, urban planning, and environmental monitoring. However, the unique challenges presented by this domain, particularly the prevalence of small targets with arbitrary orientations and complex backgrounds, persistently hinder the performance of conventional detection frameworks. Small targets, due to their limited pixel footprint, often lose discriminative geometric and texture features through successive downsampling operations in deep networks. Simultaneously, the arbitrary rotational alignment of objects, such as vehicles, ships, or buildings in aerial views, renders traditional horizontal bounding boxes insufficient, leading to inaccurate localization and inclusion of excessive background noise. This confluence of factors—small scale, dense arrangements, and rotational variability—demands specialized algorithmic solutions to achieve the precision required for real-world China UAV drone operations.

The advent of deep learning has revolutionized object detection, with methodologies broadly categorized into two-stage and single-stage detectors. Pioneering two-stage networks, such as Faster R-CNN and its variants like Mask R-CNN, propose regions of interest before performing classification and regression, offering high accuracy at the cost of computational speed. In contrast, single-stage detectors like the YOLO series and SSD prioritize inference efficiency by directly predicting bounding boxes and class probabilities from image features in a single pass. While these models have achieved remarkable success in natural imagery, their direct application to rotated object detection in aerial scenes often falls short. The introduction of Transformer architectures has further spurred progress, with models like R3Det, SCRDet++, and AO2-DETR introducing refined mechanisms for handling rotation, such as feature refinement modules and modulated loss functions. Nonetheless, a significant gap remains in designing a network that holistically addresses the trifecta of small target sensitivity, rotational robustness, and computational efficiency—a balance paramount for deployment on resource-constrained China UAV drone systems where real-time processing is often required.

To bridge this gap, we propose a novel network architecture named SGLO (Small-target Guided Lightweight Oriented object detector), specifically designed for rotating target detection in remote sensing imagery acquired by China UAV drone platforms. Our work is motivated by the need for a model that does not compromise detection accuracy for speed, nor vice versa, especially when dealing with the ubiquitous small targets in drone-captured scenes. The core innovations of SGLO are threefold, each targeting a specific limitation in existing pipelines. First, we enhance the backbone’s feature extraction capability, particularly for small objects, by integrating a space-to-depth convolution (SPDConv) module. Second, we redesign the neck structure with a multi-scale feature gathering and distributing mechanism, augmented with an additional very small-scale output path, to preserve and fuse fine-grained details critical for small target detection. Third, we introduce a lightweight oriented detection head that drastically reduces computational overhead without sacrificing performance. Through extensive experimentation on standard aerial datasets, we demonstrate that SGLO achieves state-of-the-art balance, pushing the boundaries of what is possible for intelligent analysis in China UAV drone applications.

The foundation of our approach is a critical analysis of feature representation. In standard convolutional neural networks (CNNs), repeated strided convolutions or pooling operations gradually reduce the spatial resolution of feature maps. While this expands the receptive field and captures higher-level semantics, it invariably leads to the erosion of features corresponding to small objects. For a China UAV drone flying at high altitude, a vehicle may occupy only a few dozen pixels in the image. After several downsampling stages, the semantic information for such a target becomes vanishingly small, making detection nearly impossible. To counteract this, we replace standard convolution blocks in the early stages of the backbone with the SPDConv module. The operation transforms a feature map \(X\) of spatial dimensions \(S \times S\) without losing information. It systematically partitions the input into a set of sub-feature maps. Formally, for a given scale factor `scale`, a sub-feature map \(f_{i,j}\) is extracted as:

$$f_{i,j} = X \left[ i \cdot \frac{S}{scale} : (i+1) \cdot \frac{S}{scale}, j \cdot \frac{S}{scale} : (j+1) \cdot \frac{S}{scale} \right]$$

where \(i, j \in \{0, 1, …, scale-1\}\). This yields \(scale^2\) sub-feature maps, each of size \(\frac{S}{scale} \times \frac{S}{scale}\). These are then concatenated along the channel dimension and passed through a non-strided \(1 \times 1\) convolution. This process effectively performs a learnable downsampling that preserves fine spatial details. Unlike max-pooling or strided convolution, which discard information, SPDConv redistributes spatial information into channels, allowing subsequent layers to operate on a lower-resolution map while still having access to the complete set of localized details from the original resolution. This is particularly beneficial for the myriad of small objects encountered in China UAV drone imagery, ensuring their features remain distinct and detectable throughout the network’s depth.

Beyond improved feature extraction, effective multi-scale feature fusion is paramount. The path aggregation network (PANet) used in models like YOLOv8 facilitates information flow between different scales via top-down and bottom-up paths. However, this sequential fusion can lead to information dilution, especially for features from non-adjacent layers. To overcome this, SGLO incorporates an enhanced version of the Gather-and-Distribute (GD) mechanism from Gold-YOLO. Our redesigned neck, termed Multi_Gold, consists of two complementary GD processes: Low-GD and High-GD. Low-GD operates on the intermediate features \(\{B_2, B_3, B_4, B_5\}\) from the backbone. Its goal is to gather and align these multi-scale features. The Feature Alignment Module (FAM) within Low-GD uses operations like bilinear interpolation and average pooling to resize all input features to a common spatial dimension, typically that of \(B_4\). If we denote the aligned feature set as \(F_{\text{align}}\), the process is:

$$F_{\text{align}} = \text{LOW}_{\text{FAM}}([B_2, B_3, B_4, B_5])$$

The aligned features are then fused through an Information Fusion Module (IFM), which typically comprises a series of RepVGG blocks (RepBlock). The fused feature \(F_{\text{fuse}}\) is split to generate injection features for different levels:

$$F_{\text{fuse}} = \text{RepBlock}(F_{\text{align}})$$
$$[F_{\text{Inject\_P3}}, F_{\text{Inject\_P4}}] = \text{Split}(F_{\text{fuse}})$$

Finally, the Inject module integrates these global context features into the local feature pathways. For a local feature \(x_{\text{local}}\) and a global injection feature \(x_{\text{global}}\), it computes a global activation map and a global embedding, fusing them as:

$$\text{Inject}_{\text{act}} = \sigma(\text{Conv}(x_{\text{global}}))$$
$$\text{Inject}_{\text{embed}} = \text{Conv}(x_{\text{global}})$$
$$x_{\text{fused}} = \text{Refine}(x_{\text{local}} \odot \text{Inject}_{\text{act}} + \text{Inject}_{\text{embed}})$$

where \(\sigma\) is the sigmoid function, \(\odot\) denotes element-wise multiplication, and Refine is a feature refinement block. High-GD performs a similar process on the higher-level features \(\{P_3, P_4, P_5\}\) produced after Low-GD. Crucially, our Multi_Gold extends this paradigm by introducing an additional output path for very small-scale features. For an input image of 640×640, while standard detectors output feature maps of sizes 20×20, 40×40, and 80×80, we add a 160×160 map. This map is generated by upsampling and refining a low-level feature from the GD process and concatenating it with a shallow backbone feature, creating a dedicated pathway attuned to the minute targets prevalent in China UAV drone surveys.

The final component of SGLO addresses computational efficiency in the detection head. Predicting oriented bounding boxes (OBB) requires outputting additional parameters like the rotation angle, increasing the complexity of the final layers. The standard detection head in oriented models employs multiple convolutional blocks, which become a computational bottleneck, especially when processing high-resolution feature maps from the small-scale path. Inspired by lightweight asymmetric dual-head designs, we propose LADH_OBB (Lightweight Asymmetric Dual Head for OBB). It replaces the standard cascade of convolutions with a more efficient structure: a depthwise separable convolution (DSConv) followed by two standard 1×1 convolutions. The DSConv factorizes a standard convolution into a depthwise convolution (applying a single filter per input channel) and a pointwise convolution (1×1 convolution to combine channels), dramatically reducing parameters and FLOPs. For an input feature map \(F \in \mathbb{R}^{C \times H \times W}\), a standard 3×3 conv with \(C\) output channels has computational cost proportional to \(9 \cdot C^2 \cdot H \cdot W\). In contrast, DSConv has cost proportional to \(9 \cdot C \cdot H \cdot W + C^2 \cdot H \cdot W\). The subsequent 1×1 convolutions efficiently generate the final classification, regression, and angle predictions. This design significantly lowers the GFLOPs attributed to the detection head, making the entire SGLO network more suitable for on-board processing in China UAV drone systems where power and compute are limited.

We now present a comprehensive empirical evaluation of SGLO. All experiments are conducted on two benchmark datasets relevant to China UAV drone applications: DIOR-R (a large-scale oriented object detection dataset for optical remote sensing) and VisDrone2019 (a dataset focused on object detection in UAV-captured images, containing many small instances). We use standard evaluation metrics: Precision (P), Recall (R), Average Precision (AP), and mean Average Precision (mAP) at Intersection over Union (IoU) thresholds of 0.5 and 0.5:0.95 (denoted mAP@50 and mAP@95). We also report computational complexity in Giga Floating Point Operations (GFLOPs). The baseline for our ablation studies is YOLOv8n-OBB, a efficient oriented object detector.

The first ablation study on the DIOR-R dataset investigates the individual and combined contributions of our three core modules: Multi_Gold (A), SPDConv (B), and LADH_OBB (C). The results are summarized in the table below.

Experiment Configuration P (%) R (%) mAP@50 (%) mAP@95 (%) GFLOPs
Baseline (YOLOv8n-OBB) 86.4 78.6 83.5 65.4 8.9
Baseline + A (Multi_Gold) 87.7 80.2 85.8 66.6 14.2
Baseline + B (SPDConv) 87.0 78.9 84.1 65.9 9.7
Baseline + C (LADH_OBB) 87.3 77.1 83.0 64.5 5.7
Baseline + A + B 87.3 81.4 86.4 67.5 15.5
Baseline + A + C 86.3 81.1 85.6 67.3 9.7
Baseline + B + C 87.2 78.0 84.0 65.5 6.6
SGLO (Baseline + A + B + C) 86.9 81.3 86.3 67.2 10.4

The table reveals several insights. Incorporating Multi_Gold alone provides a substantial boost of 2.3% in mAP@50, confirming its efficacy in multi-scale fusion, albeit with a significant increase in GFLOPs. SPDConv offers a moderate gain of 0.6% with a minimal computational increase. LADH_OBB slightly reduces mAP@50 by 0.5% but achieves a remarkable reduction in GFLOPs from 8.9 to 5.7, highlighting its efficiency. The combination of A and B yields the highest mAP@50 (86.4%) but at the highest computational cost (15.5 GFLOPs). Our full SGLO model (A+B+C) achieves a near-optimal mAP@50 of 86.3%—a 2.8% improvement over the baseline—while maintaining a manageable GFLOPs of 10.4, which is only 1.5 GFLOPs more than the baseline. This represents an excellent trade-off, making SGLO both accurate and efficient for China UAV drone deployment scenarios.

We further conducted an ablation on the optimal number of SPDConv modules within the SGLO backbone. The results are shown in the following table.

Experiment Number of SPDConv Modules mAP@50 (%) mAP@95 (%) GFLOPs
a 0 85.6 67.3 9.7
b 1 86.3 67.2 10.4
c 2 86.2 67.2 11.1
d 3 86.2 67.4 11.8
e 4 86.3 67.3 13.4

The data indicates that a single SPDConv module (Experiment b) provides the best performance-to-computation ratio, achieving the top mAP@50 of 86.3% with only a 0.7 GFLOPs increase over using none. Adding more modules yields diminishing returns and unnecessarily inflates complexity. Therefore, we settled on one SPDConv module for the final SGLO configuration.

To position SGLO within the broader landscape, we compare it against several state-of-the-art rotating object detectors on the DIOR-R dataset. The per-class mAP@50 results and the overall mean are presented below.

Classification Oriented_RCNN Rotated_FasterRCNN Rotated_FCOS S2ANet R3Det Roi-Transformer YOLOv8s-OBB SGLO (Ours)
airplane 69.4 70.7 73.6 7.1 78.7 90.9 97.3 97.3
airport 1.9 3.3 6.9 13.3 17.8 16.7 74.2 82.2
baseball field 80.0 80.5 82.7 82.5 86.4 90.1 94.5 94.5
basketball court 61.5 56.4 51.3 42.7 67.1 75.0 94.7 93.5
bridge 15.7 8.1 14.4 17.4 17.5 13.9 63.0 63.3
chimney 66.5 73.9 66.1 54.3 77.0 78.9 92.1 89.3
dam 9.3 11.5 13.3 12.1 11.3 9.1 55.3 65.1
e-service-area 28.3 22.4 51.4 47.3 48.0 48.0 96.3 96.3
e-toll-station 30.9 36.1 51.4 40.5 44.9 43.2 85.8 83.0
harbor 11.5 20.4 18.2 51.8 10.1 34.7 68.0 68.6
golffield 13.2 22.8 49.1 63.3 48.1 63.9 89.5 91.7
ground track field 55.9 55.2 62.2 18.8 75.9 35.7 91.1 92.5
overpass 35.1 21.2 38.4 28.1 39.6 42.0 74.1 73.7
ship 60.2 59.3 60.3 76.8 68.1 86.1 97.4 97.4
stadium 80.1 80.3 69.2 81.9 83.9 87.3 94.3 95.2
storagetank 49.6 50.9 51.6 52.9 64.6 60.4 94.1 95.0
tenniscourt 79.2 76.6 79.9 75.9 82.9 81.5 98.6 98.1
trainstation 11.3 16.3 25.3 24.5 21.9 12.3 77.6 83.3
vehicle 35.5 35.5 36.3 39.3 40.5 51.2 74.6 75.6
windmill 25.1 39.6 48.1 43.8 49.4 33.3 93.1 91.2
mAP@50 41.03 42.05 46.64 47.12 51.68 52.71 85.3 86.3

SGLO achieves the highest overall mAP@50 of 86.3%, outperforming all traditional rotating detectors by a large margin (e.g., +33.6% over Roi-Transformer) and even surpassing the strong YOLOv8s-OBB baseline by 1%. Notably, for small object categories critical to China UAV drone missions, such as ‘vehicle’ and ‘dam’, SGLO shows substantial gains. Remarkably, SGLO accomplishes this while using only about 40% of the GFLOPs of YOLOv8s-OBB, underscoring its efficiency.

To specifically validate SGLO’s prowess in small object detection, a hallmark challenge for China UAV drone imagery, we conducted experiments on the VisDrone2019 dataset. This dataset is particularly rich in small, densely packed objects like pedestrians and vehicles from an aerial perspective. We compared SGLO against several popular object detectors, both horizontal and oriented, with results summarized below.

Method mAP@50 (%) mAP@95 (%) GFLOPs
Faster-RCNN 33.5 19.3 206.73
YOLOv5s 31.7 18.0 23.8
SSD 24.1 17.2 87.90
YOLOv6s 31.1 17.7 44.0
YOLOv8s 32.2 18.3 28.5
YOLOv9s 33.0 18.9 26.7
YOLOv10s 32.8 18.5 21.4
SGLO (Ours) 34.9 19.8 29.5

SGLO again achieves the best performance, with a mAP@50 of 34.9% and mAP@95 of 19.8%, representing improvements of 2.7% and 1.5%, respectively, over the YOLOv8s baseline. The computational cost is only marginally higher (29.5 vs. 28.5 GFLOPs), demonstrating that our architectural modifications deliver superior small-target detection capability without a prohibitive computational penalty. This makes SGLO an exceptionally compelling choice for processing the continuous video streams or large image batches generated by China UAV drone during mapping or inspection sorties.

The effectiveness of our design choices can be further understood through a mathematical lens. The SPDConv operation ensures that the information loss from downsampling is minimized. Consider an input feature map with a small target occupying a region \(R\). Standard strided convolution might completely skip this region if the stride is large, or blur it with neighboring pixels. SPDConv, however, ensures that every pixel in \(R\) is allocated to a specific channel in the output feature map \(X’\). The subsequent \(1 \times 1\) convolution learns to combine these channels, effectively allowing the network to “remember” the spatial structure of \(R\) even at a lower resolution. This is crucial for maintaining the integrity of small object signatures.

Similarly, the GD mechanism in Multi_Gold can be viewed as an optimized form of feature pyramid fusion. Traditional FPN/PANet uses a limited set of pairwise connections. The GD mechanism creates a dense connection graph where features from all scales contribute to a shared global context pool \(F_{\text{fuse}}\), which is then adaptively injected back into each scale. This process enhances gradient flow and feature representation. The addition of the 160×160 output path explicitly models the feature response at a very high spatial frequency, which is where the signatures of the smallest targets in China UAV drone imagery reside. The detection loss function for oriented boxes also plays a role. We employ a combination of classification loss (e.g., VariFocal loss), regression loss for the bounding box center and dimensions, and a rotation loss. For an oriented box parameterized by \((x, y, w, h, \theta)\), the regression loss often uses a smooth L1 or GIoU loss for the horizontal components of the rotated box, while the angle \(\theta\) might be regressed directly or using a periodic loss like those based on Sin and Cos components to handle angular periodicity:

$$\mathcal{L}_{\text{angle}} = \mathcal{L}( \sin(\theta_{\text{pred}}), \sin(\theta_{\text{gt}}) ) + \mathcal{L}( \cos(\theta_{\text{pred}}), \cos(\theta_{\text{gt}}) )$$

where \(\mathcal{L}\) could be smooth L1 loss. The lightweight head LADH_OBB reduces the complexity of computing these final predictions. The depthwise separable convolution’s efficiency stems from its factorization. The number of parameters for a standard convolution kernel \(K \in \mathbb{R}^{C_{\text{in}} \times C_{\text{out}} \times K_h \times K_w}\) is \(C_{\text{in}} \cdot C_{\text{out}} \cdot K_h \cdot K_w\). For DSConv, the depthwise kernel \(D \in \mathbb{R}^{C_{\text{in}} \times 1 \times K_h \times K_w}\) has \(C_{\text{in}} \cdot K_h \cdot K_w\) parameters, and the pointwise kernel \(P \in \mathbb{R}^{C_{\text{in}} \times C_{\text{out}} \times 1 \times 1}\) has \(C_{\text{in}} \cdot C_{\text{out}}\) parameters. The total is \(C_{\text{in}} \cdot (K_h \cdot K_w + C_{\text{out}})\), which is significantly smaller when \(C_{\text{out}}\) is large. This parameter efficiency directly translates to lower GFLOPs, a key metric for China UAV drone edge devices.

In conclusion, the SGLO network represents a significant step forward in rotating target detection for remote sensing imagery, particularly from China UAV drone platforms. By innovatively integrating an information-preserving downsampling module (SPDConv), a powerful and extended multi-scale fusion neck (Multi_Gold), and a computationally frugal detection head (LADH_OBB), we have created a model that excels in detecting small, rotated objects amidst complex backgrounds. Our extensive experimental validation on DIOR-R and VisDrone2019 datasets confirms SGLO’s superiority over a range of state-of-the-art methods, achieving the highest mAP scores while maintaining a favorable computational profile. The advancements embodied in SGLO have the potential to enhance the autonomy and intelligence of China UAV drone systems across numerous civilian and security applications, from precision agriculture and infrastructure inspection to border surveillance and disaster response. Future work may focus on further optimizing the network for specific hardware accelerators common in drone systems and exploring self-supervised learning techniques to reduce reliance on large annotated datasets, which are often scarce for specialized China UAV drone operations.

Scroll to Top