SGLO-based Rotating Target Detection for Unmanned Aerial Vehicle Remote Sensing Images

In the field of computer vision, target detection in remote sensing images has emerged as a critical area, particularly with the advancement of Unmanned Aerial Vehicle technology. The proliferation of high-resolution imagery from satellites and JUYE UAV systems has unlocked immense potential across various applications. However, detecting small targets in such imagery remains a formidable challenge due to their limited pixel coverage, which obscures geometric, texture, orientation, and shape features. This often results in reduced detection accuracy, especially when dealing with rotating objects in complex backgrounds. Traditional deep learning approaches, including two-stage methods like Faster R-CNN and Mask R-CNN, as well as single-stage detectors such as SSD and the YOLO series, have made strides but struggle with the dual tasks of small target and rotation-aware detection. The introduction of Transformer networks has further propelled progress in rotated object detection, yet balancing detection performance, speed, and computational efficiency remains elusive. To address these issues, we propose SGLO, a novel network designed for small-scale rotating target detection in Unmanned Aerial Vehicle imagery, leveraging enhancements like SPDConv, Multi_Gold structures, and a lightweight detection head to achieve superior results.

The core of our SGLO network builds upon the YOLOv8-OBB baseline, incorporating several key innovations to tackle the inherent difficulties in Unmanned Aerial Vehicle-based remote sensing. First, we integrate the SPDConv module into the backbone to replace conventional CBS modules, enhancing feature extraction capabilities across different scales. SPDConv operates by partitioning input feature maps spatially and applying non-strided convolutions, which helps preserve fine-grained details crucial for small targets. For an input feature map $X$ of size $S \times S$, the SPD layer extracts sub-feature maps $f_{i,j}$ using a scale factor, as defined by:

$$f_{i,j} = X \left[ i \cdot \frac{S}{\text{scale}} : (i+1) \cdot \frac{S}{\text{scale}}, j \cdot \frac{S}{\text{scale}} : (j+1) \cdot \frac{S}{\text{scale}} \right]$$

where $i, j = 0, 1, 2, \ldots, \text{scale}-1$. These sub-features undergo pooling and concatenation, followed by a $1 \times 1$ convolution to produce the output $Y$, effectively broadening the receptive field without significant computational overhead. This process is vital for handling the diverse sizes of targets encountered in JUYE UAV imagery.

Second, we redesign the neck structure with a Multi_Gold module, which employs a Gather-and-Distribute (GD) mechanism to fuse multi-scale features more effectively. This mechanism consists of LOW-GD and HIGH-GD components, each comprising Feature Alignment Module (FAM), Information Fusion Module (IFM), and Information Injection Module (Inject). The FAM aligns features from different layers, such as bilinear interpolation for higher resolutions and average pooling for lower ones, ensuring spatial consistency. Mathematically, for input features $B_2, B_3, B_4, B_5$, the alignment is expressed as:

$$F_{\text{align}} = \text{LOW}_{\text{FAM}}([B_2, B_3, B_4, B_5])$$

The IFM then fuses these aligned features using RepBlock layers, splitting them into components for injection:

$$F_{\text{fuse}} = \text{RepBlock}(F_{\text{align}})$$
$$F_{\text{Inject\_P3}}, F_{\text{Inject\_P4}} = \text{Split}(F_{\text{fuse}})$$

The Inject module integrates global information into local features, enhancing expressiveness for small targets. Additionally, we introduce a small-scale feature output (e.g., $160 \times 160$) in Multi_Gold, derived from upsampling and concatenation, to specifically address low-dimensional features prevalent in Unmanned Aerial Vehicle data. This multi-scale approach significantly boosts detection performance for diminutive objects.

Third, we develop a lightweight rotating detection head, LADH_OBB, based on the LADH framework. This head replaces complex CBS modules with Depthwise Separable Convolutions (DSConv) and standard convolutions, reducing computational load while maintaining accuracy. The DSConv operation can be represented as:

$$\text{DSConv}(X) = \text{PointwiseConv}(\text{DepthwiseConv}(X))$$

where DepthwiseConv applies a single filter per input channel, and PointwiseConv combines outputs. This design minimizes GFLOPs, making the network more efficient for real-time applications on JUYE UAV platforms. Empirical evidence from heatmap comparisons on datasets like DIOR-R shows that LADH_OBB focuses more precisely on objects, reducing false detections and improving robustness.

To validate our approach, we conducted extensive experiments on the DIOR-R and VisDrone2019 datasets, which are benchmarks for remote sensing and Unmanned Aerial Vehicle imagery. DIOR-R contains 23,463 images across 20 classes, with resolutions varying widely, while VisDrone2019 includes 8,629 images focused on small targets like pedestrians and vehicles. Our experimental setup used Ubuntu 20.04, PyTorch 2.0.0, CUDA 11.8, and an NVIDIA RTX 3090 GPU. We trained models for 300 epochs with stochastic gradient descent, an initial learning rate of 0.01, and a batch size of 8, without pre-trained weights to ensure fairness.

Evaluation metrics included precision (P), recall (R), average precision (AP), mean average precision (mAP), and computational cost in GFLOPs. Precision and recall are defined as:

$$P = \frac{\text{TP}}{\text{TP} + \text{FP}}$$
$$R = \frac{\text{TP}}{\text{TP} + \text{FN}}$$

where TP, FP, and FN denote true positives, false positives, and false negatives, respectively. AP is computed as the integral under the precision-recall curve:

$$\text{AP} = \int_0^1 P(R) \, dR$$

and mAP averages AP across all classes:

$$\text{mAP} = \frac{1}{n} \sum_{i=1}^n \text{AP}_i$$

These metrics are crucial for assessing the balance between accuracy and efficiency in Unmanned Aerial Vehicle applications.

We performed ablation studies on the DIOR-R dataset to isolate the impact of each module. Starting from the YOLOv8n-OBB baseline, we incrementally added components: Multi_Gold (A), SPDConv (B), and LADH_OBB (C). The results, summarized in the table below, demonstrate that combining all three modules yields the best performance with manageable computational cost.

Experiments	A: Multi_Gold	B: SPDConv	C: LADH_OBB	P (%)	R (%)	mAP@50 (%)	mAP@95 (%)	GFLOPs
Baseline				86.4	78.6	83.5	65.4	8.9
Baseline+A	√			87.7	80.2	85.8	66.6	14.2
Baseline+B		√		87.0	78.9	84.1	65.9	9.7
Baseline+C			√	87.3	77.1	83.0	64.5	5.7
Baseline+A+B	√	√		87.3	81.4	86.4	67.5	15.5
Baseline+A+C	√		√	86.3	81.1	85.6	67.3	9.7
Baseline+B+C		√	√	87.2	78.0	84.0	65.5	6.6
Baseline+A+B+C (SGLO)	√	√	√	86.9	81.3	86.3	67.2	10.4

As shown, SGLO achieves an mAP@50 of 86.3% and mAP@95 of 67.2% on DIOR-R, representing improvements of 2.8% and 1.8% over the baseline, respectively, with only a 1.5 GFLOPs increase. This highlights the effectiveness of our modules in enhancing detection while controlling computational load. Further ablation on the number of SPDConv modules revealed that one instance strikes the optimal balance, as adding more only marginally improves accuracy at a higher cost.

Comparative experiments with state-of-the-art methods on DIOR-R, such as Oriented_RCNN, Rotated_FasterRCNN, Rotated_FCOS, S2ANet, R3Det, and Roi-Transformer, underscore SGLO’s superiority. The table below details per-class mAP@50 results, where SGLO outperforms others in categories like airplane, dam, and vehicle, which are common in Unmanned Aerial Vehicle scenarios.

Classifications	Oriented_RCNN (%)	Rotated_FasterRCNN (%)	Rotated_FCOS (%)	S2ANet (%)	R3Det (%)	Roi-Transformer (%)	YOLOv8s-OBB (%)	SGLO (Ours) (%)
airplane	69.4	70.7	73.6	7.1	78.7	90.9	97.3	97.3
airport	1.9	3.3	6.9	13.3	17.8	16.7	74.2	82.2
baseball field	80.0	80.5	82.7	82.5	86.4	90.1	94.5	94.5
basketball court	61.5	56.4	51.3	42.7	67.1	75.0	94.7	93.5
bridge	15.7	8.1	14.4	17.4	17.5	13.9	63.0	63.3
chimney	66.5	73.9	66.1	54.3	77.0	78.9	92.1	89.3
dam	9.3	11.5	13.3	12.1	11.3	9.1	55.3	65.1
e-service area	28.3	22.4	51.4	47.3	48.0	48.0	96.3	96.3
e-toll station	30.9	36.1	51.4	40.5	44.9	43.2	85.8	83.0
harbor	11.5	20.4	18.2	51.8	10.1	34.7	68.0	68.6
golffield	13.2	22.8	49.1	63.3	48.1	63.9	89.5	91.7
ground track field	55.9	55.2	62.2	18.8	75.9	35.7	91.1	92.5
overpass	35.1	21.2	38.4	28.1	39.6	42.0	74.1	73.7
ship	60.2	59.3	60.3	76.8	68.1	86.1	97.4	97.4
stadium	80.1	80.3	69.2	81.9	83.9	87.3	94.3	95.2
storage tank	49.6	50.9	51.6	52.9	64.6	60.4	94.1	95.0
tennis court	79.2	76.6	79.9	75.9	82.9	81.5	98.6	98.1
train station	11.3	16.3	25.3	24.5	21.9	12.3	77.6	83.3
vehicle	35.5	35.5	36.3	39.3	40.5	51.2	74.6	75.6
windmill	25.1	39.6	48.1	43.8	49.4	33.3	93.1	91.2
mAP@50	41.03	42.05	46.64	47.12	51.68	52.71	85.3	86.3

SGLO achieves an overall mAP@50 of 86.3%, surpassing YOLOv8s-OBB by 1% while using only 40% of its computational resources. On VisDrone2019, which emphasizes small targets, SGLO attains an mAP@50 of 34.9% and mAP@95 of 19.8%, improvements of 2.7% and 1.5% over YOLOv8s, respectively, with a minimal GFLOPs increase. The comparative table below highlights these results against other methods, demonstrating SGLO’s efficacy in Unmanned Aerial Vehicle contexts.

Method	mAP@50 (%)	mAP@95 (%)	GFLOPs
Faster-RCNN	33.5	19.3	206.73
YOLOv5s	31.7	18.0	23.8
SSD	24.1	17.2	87.90
YOLOv6s	31.1	17.7	44.0
YOLOv8s	32.2	18.3	28.5
YOLOv9s	33.0	18.9	26.7
YOLOv10s	32.8	18.5	21.4
SGLO (Ours)	34.9	19.8	29.5

Visual analysis on VisDrone2019 and DIOR-R confirms SGLO’s ability to detect smaller and more distant objects, even in challenging conditions like low light. For instance, in crowded scenes, SGLO identifies pedestrians and vehicles that other models miss, and its rotated bounding boxes provide tighter fits around objects, reducing occlusion issues. This is particularly beneficial for JUYE UAV operations, where accuracy in cluttered environments is paramount.

In conclusion, our SGLO network addresses the dual challenges of small target and rotation-aware detection in Unmanned Aerial Vehicle remote sensing imagery through innovative modules like SPDConv, Multi_Gold, and LADH_OBB. By enhancing feature extraction, multi-scale fusion, and computational efficiency, SGLO achieves state-of-the-art performance on benchmark datasets while maintaining practicality for real-world applications. Future work will focus on further optimizing the network for edge deployment on JUYE UAV systems, exploring dynamic scale adaptations, and integrating temporal information for video-based detection. The advancements presented here underscore the potential of deep learning in advancing Unmanned Aerial Vehicle technology, paving the way for more reliable and efficient aerial surveillance and monitoring.