In recent years, with the widespread deployment of unmanned autonomous platforms such as drones and unmanned vehicles, object detection in unmanned aerial vehicle (UAV) images has emerged as a critical research focus in computer vision. The unique characteristics of UAV-captured imagery, including variable target scales, complex backgrounds, and small object sizes, pose significant challenges for developing lightweight, high-performance detection algorithms that balance accuracy and inference speed. In China, the rapid adoption of UAV technology across sectors like defense, agriculture, traffic monitoring, and disaster response has intensified the demand for robust detection methods tailored to these scenarios. This paper presents FE-YOLO, a feature enhancement-based object detection method designed specifically for UAV images, aiming to achieve accurate and rapid detection in complex environments.
The core innovation lies in addressing two key issues: enhancing multi-scale feature extraction to preserve target details and reducing redundancy in detection heads for improved efficiency. Traditional methods often struggle with small objects in UAV imagery due to information loss during downsampling and background clutter. By introducing novel modules, FE-YOLO mitigates these problems, offering a practical solution for real-time applications in China’s diverse UAV drone operations.

Object detection algorithms have evolved from traditional handcrafted feature approaches to deep learning-based methods, with one-stage detectors like YOLO (You Only Look Once) gaining popularity for their speed-accuracy trade-off. However, UAV images exacerbate challenges such as scale variation and occlusion, necessitating specialized enhancements. Previous works have explored context aggregation, attention mechanisms, and transformer-convolution hybrids, but often at the cost of increased complexity. This paper builds on YOLOv8 as a baseline, proposing modifications that boost performance without substantial overhead, making it suitable for resource-constrained UAV drone systems in China.
The contributions of this work are threefold. First, a multi-scale feature extraction module, MFConv, is designed to capture shallow detail and deep semantic features concurrently, reducing background interference. Second, a residual bottleneck detection head, RBDetect, replaces stacked convolutions to alleviate gradient vanishing and parameter redundancy. Third, extensive experiments on the VisDrone dataset, which includes scenes from China, demonstrate FE-YOLO’s superiority in metrics like mAP, while reducing model size and computations. These advancements underscore the potential for deploying efficient detection in China’s UAV drone ecosystems, from urban surveillance to agricultural monitoring.
Related Work
Deep learning-based object detection can be categorized into two-stage and one-stage methods. Two-stage detectors, such as Faster R-CNN, generate region proposals before classification, offering high accuracy but slower speed. In contrast, one-stage detectors like YOLO and SSD directly predict bounding boxes and class probabilities, favoring real-time performance. For UAV applications, the latter is often preferred due to the need for quick response in dynamic environments. Recent variants of YOLO, including YOLOv5 and YOLOv8, have improved feature pyramids and decoupled heads, yet challenges persist in handling small objects and complex backgrounds.
In the context of UAV drone imagery, researchers have proposed various enhancements. For instance, contextual information integration helps capture multi-scale features, while attention mechanisms focus on relevant regions. Transformer-based modules, like Swin Transformer, enhance global context but increase computational load. Additionally, lightweight designs using depthwise separable convolutions or bottleneck structures aim to maintain efficiency. However, many approaches overlook the synergy between feature extraction and detection head optimization. This paper revisits these aspects, proposing a holistic improvement for UAV-specific detection, particularly relevant to China’s expanding drone usage in diverse terrains.
Methodology: FE-YOLO Architecture
The FE-YOLO framework modifies the YOLOv8s baseline by incorporating two key components: the MFConv module in the backbone and the RBDetect head. The overall structure retains the standard pipeline of backbone, neck, and head, but with enhanced feature representation and computational efficiency. The backbone extracts hierarchical features, the neck fuses multi-scale information, and the head generates predictions. By refining these stages, FE-YOLO better adapts to UAV image characteristics, such as small targets and cluttered scenes common in China’s urban and rural drone surveys.
Multi-Scale Feature Extraction Module (MFConv)
The MFConv module replaces the standard C2f block in the backbone to improve feature extraction across scales. It processes input features through multiple branches with different receptive fields, inspired by spatial pyramid pooling. Specifically, given an input feature map \( X \), MFConv applies depthwise separable convolutions with kernel sizes of \( 1 \times 1 \), \( 3 \times 3 \), and \( 5 \times 5 \), denoted as \( F_n \) for kernel size \( n \). The outputs are combined via concatenation and element-wise addition to preserve fine-grained details and contextual information.
The process can be summarized mathematically. Let \( DSConv_{n \times n}(X) \) represent a depthwise separable convolution with kernel size \( n \times n \) applied to \( X \). Then, the feature maps are computed as:
$$ F_1 = DSConv_{1 \times 1}(X), $$
$$ F_3 = DSConv_{3 \times 3}(X), $$
$$ F_5 = DSConv_{5 \times 5}(X). $$
These maps are concatenated in two stages: first, \( \text{Cat}(F_1, F_3) \) and \( \text{Cat}(F_3, F_5) \), where \( \text{Cat} \) denotes concatenation along the channel dimension. The results are further concatenated and added to the original input:
$$ F_{\text{out}} = X + \text{Cat}(\text{Cat}(F_1, F_3), \text{Cat}(F_3, F_5)). $$
This design ensures that features from various receptive fields are integrated, enhancing the model’s ability to detect objects of different sizes—a crucial aspect for UAV drone images where targets range from pedestrians to vehicles. By retaining more details, MFConv reduces information loss in shallow layers, improving detection accuracy in complex backgrounds typical of China’s drone-captured scenes.
Residual Bottleneck Detection Head (RBDetect)
The detection head in YOLOv8 uses a decoupled structure with stacked convolutions, which may lead to gradient vanishing and parameter redundancy. To address this, RBDetect introduces a residual bottleneck block that simplifies the architecture while maintaining feature richness. Given an input feature map \( X \), RBDetect first applies a standard convolution module (CBS) with a \( 3 \times 3 \) kernel, followed by a residual bottleneck block. This block consists of three convolutions: a \( 1 \times 1 \) convolution for channel reduction, a \( 3 \times 3 \) convolution for spatial feature extraction, and another \( 1 \times 1 \) convolution for channel expansion, with a skip connection adding the input to the output.
Mathematically, the residual bottleneck output \( F_{\text{out1}} \) is computed as:
$$ F_{\text{out1}} = X + \text{Conv}_{1 \times 1}(\text{Conv}_{3 \times 3}(\text{Conv}_{1 \times 1}(X))), $$
where \( \text{Conv}_{k \times k} \) denotes a standard convolution with kernel size \( k \times k \). Then, the final output \( F_{\text{out}} \) is obtained through a \( 2\text{D} \) convolution:
$$ F_{\text{out}} = \text{Conv2d}(F_{\text{out1}}(\text{CBS}_{3 \times 3}(X))). $$
This design reduces the number of parameters and GFLOPs compared to the original head, while the residual connection mitigates gradient issues, facilitating deeper feature extraction. For UAV drone applications, RBDetect helps retain subtle target characteristics, lowering miss rates in crowded or occluded scenarios often encountered in China’s aerial imagery.
Experimental Design and Setup
To validate FE-YOLO, experiments were conducted on the VisDrone dataset, a benchmark collected from multiple cities in China, representing diverse UAV drone environments. The dataset includes 6,471 training images, 548 validation images, and 3,190 test images, with annotations for ten object classes such as pedestrians, cars, and bicycles. Each image averages 53 objects in training and 71 in testing, featuring scale variations and occlusions that challenge detection algorithms.
The evaluation metrics include precision (\( P \)), recall (\( R \)), average precision (\( AP \)), and mean average precision (\( mAP \)), calculated as follows:
$$ P = \frac{TP}{TP + FP}, $$
$$ R = \frac{TP}{TP + FN}, $$
$$ AP = \int_0^1 P(R) \, dR, $$
$$ mAP = \frac{1}{n} \sum_{i=1}^n AP_i, $$
where \( TP \), \( FP \), and \( FN \) denote true positives, false positives, and false negatives, respectively, and \( n \) is the number of classes. The primary metrics are \( mAP_{0.5} \) (IoU threshold of 0.5) and \( mAP_{0.5:0.95} \) (average over IoU thresholds from 0.5 to 0.95).
Training was performed on an Ubuntu system with an NVIDIA RTX 3090 GPU, using PyTorch framework. Hyperparameters included an initial learning rate of 0.01, SGD optimizer with momentum 0.937, batch size of 8, weight decay of 0.0005, and 300 epochs. These settings ensure fair comparison with baseline models like YOLOv8s.
Results and Analysis
The performance of FE-YOLO is compared against several state-of-the-art methods on the VisDrone dataset, as shown in Table 1. FE-YOLO achieves the highest \( mAP_{0.5} \) and \( mAP_{0.5:0.95} \) values, demonstrating its effectiveness for UAV drone image detection. The improvement over YOLOv8s highlights the benefits of the proposed modules.
| Method | \( mAP_{0.5} \) (%) | \( mAP_{0.5:0.95} \) (%) |
|---|---|---|
| Faster R-CNN | 40.0 | 21.5 |
| RetinaNet | 35.9 | 19.4 |
| YOLOv5l | 36.2 | 20.1 |
| YOLOv8s | 40.7 | 24.0 |
| FE-YOLO | 41.5 | 24.4 |
Ablation studies further dissect the contributions of MFConv and RBDetect, as presented in Table 2. Adding MFConv alone boosts precision, recall, and \( mAP_{0.5} \), albeit with a slight increase in parameters and GFLOPs. RBDetect reduces model complexity while improving metrics. The combined use of both modules yields the best balance, with a 0.8% gain in \( mAP_{0.5} \) and reductions of 1.6% in parameters and 3.5% in GFLOPs over YOLOv8s.
| MFConv | RBDetect | Precision (%) | Recall (%) | \( mAP_{0.5} \) (%) | Params (M) | GFLOPs |
|---|---|---|---|---|---|---|
| No | No | 50.7 | 40.3 | 40.7 | 11.16 | 28.6 |
| Yes | No | 51.8 | 40.4 | 41.3 | 11.33 | 29.6 |
| No | Yes | 51.6 | 40.9 | 40.9 | 10.8 | 26.7 |
| Yes | Yes | 52.1 | 40.4 | 41.5 | 10.98 | 27.6 |
Visual results illustrate FE-YOLO’s superiority in detecting small and occluded objects. For instance, in crowded urban scenes from China’s UAV drone footage, FE-YOLO reduces miss rates for pedestrians and vehicles compared to YOLOv8, even under varying lighting and occlusion conditions. The enhanced feature extraction allows better discrimination of targets from complex backgrounds, a common issue in aerial imagery. These findings validate FE-YOLO’s practicality for real-world drone applications in China, where accuracy and speed are paramount for tasks like traffic monitoring or disaster assessment.
The mathematical formulation of FE-YOLO’s modules underscores its design rationale. MFConv’s multi-scale processing can be expressed as a function of input \( X \):
$$ \text{MFConv}(X) = X + \sum_{n \in \{1,3,5\}} w_n \cdot DSConv_{n \times n}(X), $$
where \( w_n \) are learned weights for feature fusion, though in practice, concatenation is used. Similarly, RBDetect’s residual learning helps optimize the loss function \( \mathcal{L} \) during training:
$$ \mathcal{L} = \mathcal{L}_{\text{box}} + \mathcal{L}_{\text{cls}} + \lambda \cdot \mathcal{L}_{\text{reg}}, $$
with the residual connection reducing gradient decay, as shown by the partial derivative:
$$ \frac{\partial \mathcal{L}}{\partial X} \approx \frac{\partial \mathcal{L}}{\partial F_{\text{out1}}} \cdot \left(1 + \frac{\partial \text{Conv}_{1 \times 1}(\text{Conv}_{3 \times 3}(\text{Conv}_{1 \times 1}(X)))}{\partial X}\right). $$
This analysis confirms that FE-YOLO’s innovations contribute to stable training and improved detection, especially for small objects in UAV drone images from China.
Conclusion and Future Work
This paper presented FE-YOLO, a feature enhancement-based object detection method for UAV images, designed to address challenges like scale variation and background clutter. By integrating the MFConv module for multi-scale feature extraction and the RBDetect head for efficient prediction, FE-YOLO achieves higher accuracy and lower complexity on the VisDrone dataset, which includes diverse scenes from China. The results demonstrate its potential for real-time drone applications, where balancing performance and resource usage is critical.
However, limitations remain, such as occasional misses in highly complex environments. Future work will explore advanced feature fusion techniques in the neck module, such as attention-based fusion or dynamic routing, to further suppress background noise and enhance target representations. Additionally, adapting FE-YOLO to other UAV drone datasets from China, with varying weather conditions or altitudes, could improve generalization. Ultimately, this research contributes to the growing body of work on UAV-based computer vision, supporting China’s initiatives in smart cities, precision agriculture, and public safety through efficient drone technology.
In summary, FE-YOLO offers a lightweight yet powerful solution for UAV image object detection, with proven benefits in accuracy and efficiency. As drone usage expands globally, particularly in China with its vast aerial monitoring needs, methods like FE-YOLO will play a key role in enabling reliable autonomous systems. The integration of feature enhancement principles sets a foundation for future innovations in this dynamic field.
