Enhanced UAV Drone Aerial Image Object Detection with Optimized YOLOv11n

In recent years, Unmanned Aerial Vehicle (UAV) technology has garnered widespread application in the field of object detection due to its broad field of view and high-speed flight capabilities. However, UAV-based object detection faces numerous challenges, such as the small size of targets and susceptibility to interference from complex backgrounds, which significantly constrain the performance of detection algorithms. Consequently, improving the detection accuracy for small targets has become a focal point of research.

In the domain of object detection, deep learning methods have significantly surpassed traditional approaches, becoming the mainstream trend. Current object detection algorithms can be broadly categorized into two types: two-stage and one-stage detectors. Representative two-stage algorithms include R-CNN, Fast R-CNN, and Faster R-CNN. Their process first focuses on generating region proposals, followed by using a detection network to screen and refine these proposals, including position adjustment, thereby ensuring relatively high detection accuracy. However, the complexity of this process also leads to relatively slow processing speeds, limiting its performance in scenarios with high real-time demands, such as UAV drone target detection. In contrast, one-stage detection algorithms, notably the YOLO series, SSD, and RetinaNet, have emerged with unique advantages. These algorithms simplify the complex detection pipeline into a compact neural network model, completely eliminating the intermediate step of region proposal generation, greatly reducing computational load. They can significantly increase detection speed while maintaining a certain level of accuracy. Therefore, one-stage detection algorithms like the YOLO series, which offer better real-time performance and lower computational cost, have seen more development.

To address the aforementioned challenges, numerous researchers have proposed various detection solutions. Some approaches tackle the issue from the perspective of data augmentation, employing methods like background replacement to solve the problem of monotonous backgrounds in UAV drone images. However, such work is not particularly effective in improving small target detection accuracy. Other studies integrate modules like the C3Transformer into the backbone network of models based on YOLOv5 to consolidate global information. They also replace part of the convolutions in the network backbone with depthwise separable convolutions and integrate attention mechanisms like SE to enhance model efficiency. Another line of work integrates Transformer Prediction Heads (TPH) into YOLOv5 for accurate object localization in high-density scenes from a UAV drone perspective. This integrated model can effectively improve detection precision. Furthermore, improvements based on YOLOv5 involve adding a lightweight network like Mobilenet-V2, which increases the model’s detection speed and also improves accuracy. Work on YOLOv7 uses a YOLO-tiny network model as the baseline to balance speed and accuracy, adds a Convolutional Block Attention Module (CBAM) to enhance the model’s ability to learn important features, and employs data augmentation methods like mosaic and mixup to effectively improve detection performance for small targets and reduce missed detections. Research based on YOLOv8 introduces an attention scale sequence fusion mechanism and improves the loss function using the Wise-IoU mechanism, enhancing the flexibility and robustness of feature extraction and target detection for small targets at different levels. While the aforementioned algorithms have shown significant effectiveness in improving detection accuracy, their precision remains insufficient when dealing with scenes involving complex occlusions between small targets, manifesting as increased missed detections and false alarms.

In light of this, we propose an improved model named GR-YOLOv11n, based on enhancements to YOLOv11. The improvements of our algorithm compared to the baseline YOLOv11n are as follows:

To address the problem of losing tiny features, we introduce GhostNet v2. Through efficient convolution operations and a lightweight network structure design, GhostNet v2 effectively mitigates the loss of minute features in deep networks, enabling precise capture and localization of features from small UAV drone targets.
To meet the demand for model lightweighting while maintaining high performance, we design the C2f-RepNCSPFPN module. This module significantly reduces computational complexity and the number of parameters by optimizing the network structure and parameter configuration, while preserving high detection accuracy. This improvement helps reduce the model’s resource consumption, making it more lightweight, without compromising detection performance.

Improved Network Architecture

YOLOv11 represents the latest version in the YOLO series of object detection models, featuring significant improvements and optimizations in model architecture, feature extraction capability, and computational efficiency. Architectural innovations in YOLOv11 include the introduction of the C3k2 block, SPPF, and C2PSA components, which contribute to enhanced feature extraction. The YOLOv11 network structure primarily consists of three parts: the Backbone, the Neck, and the Head.

The proposed GR-YOLOv11n algorithm’s network structure is illustrated in the provided framework diagram. Firstly, the GhostNet v2 module is introduced into the Backbone. As an enhanced version of GhostNet, GhostNet v2 strengthens feature representation and global information capture by incorporating a Decoupled Fully Connected (DFC) attention mechanism, thereby improving the detection rate for small-sized objects captured by UAV drones. Secondly, by designing the C2f-RepNCSPFPN module, the model can extract information more comprehensively during the feature extraction process, enhancing overall detection performance for aerial imagery.

Core Technical Improvements

1. The GhostNet v2 Module

In the exploration of lightweight neural network architectures to optimize the balance between computational resources and model performance, an innovative solution is introduced: employing GhostNet v2 as a network architecture component. GhostNet v2 is a lightweight neural network architecture that optimizes computational efficiency through depthwise separable convolutions and dynamic filter configurations while maintaining high accuracy, which is crucial for processing streaming data from UAV drones.

The core idea of GhostNet v2 is to reduce redundant computation by introducing a novel structure called the Ghost module and to enhance the model’s expressive power through a Decoupled Fully Connected (DFC) attention mechanism. The Ghost module utilizes cheap operations (like depthwise separable convolution) to generate redundant features, thereby obtaining rich feature representations at a lower computational cost. The DFC attention mechanism reduces computational complexity by decoupling fully connected layers while capturing long-range spatial information, further enhancing the model’s expressive power for complex UAV drone scenes.

For an input feature map $X$, the Ghost module replaces a standard convolution with a two-step process. First, a $1 \times 1$ convolution is used to generate intrinsic features $Y’$:

$$ Y’ = X * F^{1 \times 1}, $$

where $*$ denotes the convolution operation, $F^{1 \times 1}$ is the pointwise convolution kernel, and $Y’ = \mathbb{R}^{H \times W \times C’_{out}}$ is the output intrinsic feature map. Its channel number $C’_{out}$ is smaller than the original output channel number $C_{out}$ ($C’_{out} < C_{out}$).

When lightweight operations (like depthwise separable convolution) act on these intrinsic features, they can generate richer feature expressions while controlling computational costs. Finally, the two parts of features are concatenated along the channel dimension to obtain the final output $Y$:

$$ Y = \text{Concat}([Y’, Y’ * F^{dp}]), $$

where $F^{dp}$ represents the depthwise separable convolution kernel, $C_{out}$ refers to the number of channels in the output feature map, and $Y = \mathbb{R}^{H \times W \times C_{out}}$ is the output feature. Although the Ghost module can significantly reduce computational cost, its representational capacity is also diminished. The GhostNet v2 structure addresses this.

GhostNet employs an inverted residual bottleneck containing two Ghost modules. The first module produces expanded features with more channels, while the second module reduces the channel count to obtain the output features. It was found that applying the DFC attention to the features from the first module yields higher model performance for UAV drone image analysis. Therefore, the DFC is ultimately multiplied only with the expanded features.

2. Design of the C2f-RepNCSPFPN Module

To optimize network performance for UAV drone aerial image object detection tasks while balancing computational efficiency and model accuracy, the C2f-RepNCSPFPN module was developed. Building upon and surpassing existing network structure improvement strategies, this module aims to achieve dual optimization of computational load and parameter count through innovative architectural design, while avoiding the performance bottlenecks commonly caused by increased memory access in traditional methods.

The C2f module enhances feature expression capability through its unique structure design, which enables effective splitting and parallel processing of feature maps. The module first adjusts the channel number of the input feature map using a $1 \times 1$ convolution. It then splits the feature map into two branches via a Split operation. These two branches perform feature extraction and fusion through Bottleneck structures and a Concat operation, respectively. Finally, the processed feature map is output through another $1 \times 1$ convolutional layer. This process not only improves the nonlinear representation ability of features but also enhances gradient flow through residual connections, accelerating model convergence.

The RepNCSPFPN module adopts a top-down path and upsampling strategy to achieve effective fusion between feature maps at different levels. By progressively fusing high-level semantic feature maps with low-level detail feature maps, it generates a feature pyramid rich in both semantic and detailed information. This significantly improves the model’s ability to detect targets of different scales, which is crucial for enhancing target detection accuracy, especially when handling small targets and complex scenes encountered by UAV drones.

The combined C2f-RepNCSPFPN module processes the input feature map through a $1 \times 1$ convolutional layer, followed by a Split operation that divides the feature map into two parts. The repetitive asymmetric convolution block, RepNCSP Module, is used for feature extraction. The feature maps processed by the RepNCSP module are then concatenated with the directly passed-through feature maps. The final output is produced after this concatenation. The integrated C2f-RepNCSPFPN module enables YOLOv11n to achieve efficient feature extraction and fusion while maintaining model lightweighting, leading to better performance in object detection tasks. This combination not only improves the model’s detection accuracy but also optimizes computational performance, making the model more suitable for real-time UAV drone application scenarios.

Experiments and Results Analysis

1. Dataset and Experimental Setup

Dataset: The VisDrone2019 dataset, collected by the AISKYEYE team at Tianjin University’s Machine Learning and Data Mining Laboratory, is a comprehensive and challenging benchmark for visual object detection, tracking, and counting from a UAV drone perspective. It consists of high-quality videos and images from complex urban environments, covering diverse weather conditions, viewpoint variations, target scale differences, and dense scenes. This provides researchers with rich data resources to explore and improve the performance of UAV drone vision systems. The dataset is acquired via UAV drone capture, encompassing diverse scenarios under different meteorological conditions across 14 Chinese cities. The VisDrone2019 dataset includes 8,629 images captured by UAV drones, with a focus on small and medium-sized objects. The training, validation, and test sets consist of 6,471, 548, and 1,610 images, respectively. The dataset is divided into 10 categories: pedestrian, person, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motorcycle.

To validate the model’s generalizability and adaptability across different scenarios, a generality experiment was conducted on the RDD2020 (Road Damage Dataset 2020). This dataset covers damage images from various road types, containing rich scene variations and complex background information, which can effectively assess the model’s generalization capability in road damage detection tasks.

Experimental Environment and Evaluation Metrics: The experiments are based on an Ubuntu 22.04 operating system. The environment includes: PyTorch 2.0.0, Python 3.8.10, CUDA 11.8. The GPU is an NVIDIA GeForce RTX 3090 with 24GiB of memory. Default data augmentation methods are used. To ensure fair and comparable model results, no pre-trained weights are used for any ablation or comparative experiments.

To better evaluate the model’s performance and compare it with the baseline, the experiments employ Precision (P), Recall (R), mean Average Precision (mAP@0.5, mAP@0.5:0.95), number of model parameters (Params), and computational complexity (GFLOPs) for assessment. Their calculation formulas are as follows:

$$ P = \frac{TP}{TP + FP}, $$

$$ R = \frac{TP}{TP + FN}, $$

$$ mAP = \frac{1}{N}\sum_{i=1}^{N} AP_i, \quad \text{where} \quad AP = \int_{0}^{1} P(R) dR. $$

Here, $N$ is the number of categories in the dataset. $TP$ represents true positives (correctly predicted positive results), $FP$ represents false positives (incorrectly predicted positive results), and $FN$ represents false negatives (actual positives predicted as negative). IoU (Intersection over Union) is the ratio of the intersection area to the union area of two bounding boxes. mAP@0.5 represents the mAP value at an IoU threshold of 0.5. mAP@0.5:0.95 is the average mAP calculated at IoU thresholds from 0.5 to 0.95 in steps of 0.05. Higher values for mAP@0.5 and mAP@0.5:0.95 indicate better overall model accuracy and performance in the object detection task. The Average Precision (AP) for each class can be viewed as the area under the Precision-Recall (P-R) curve.

The detailed hyperparameter settings during training are summarized in the following table:

Table 1: Detailed Experimental Parameter Settings
Parameter	Setting Value
Training Epochs	200
Batch Size	16
Image Size	640 × 640
Initial Learning Rate	0.01
Learning Rate Momentum	0.937
Weight Decay Coefficient	0.0005
Optimizer	SGD

2. Experimental Results and Analysis

Ablation Study

This study uses the mainstream UAV drone target detection dataset, VisDrone2019, to conduct multiple sets of experiments under the same environment to evaluate the impact of enhancement modules on the baseline model. The results of introducing different improvement strategies are shown in the table below (bold data indicates the best result, a checkmark indicates the module is used).

Table 2: Comparison of Ablation Experiment Results for the Impact of Different Improvement Strategies on Baseline Model Performance
Method	GhostNet v2	C2f-RepNCSPFPN	P/%	R/%	mAP@0.5/%	mAP@0.5:0.95/%	Params/10⁶	GFLOPs
Baseline (YOLOv11n)			43.6	33.2	32.8	19.0	2.59	6.5
Improvement 1	✓		43.4	33.8	33.2	19.1	2.38	6.4
Improvement 2		✓	50.0	38.5	39.2	23.5	2.82	9.8
Improvement 3 (GR-YOLOv11n)	✓	✓	52.0	38.6	40.4	24.4	2.35	8.4

To validate the effectiveness of the proposed improvements for small target detection, the efficacy of each enhancement module was tested on the model. As shown in Table 2, using YOLOv11n as the baseline model, the first step introduces GhostNet v2 into the backbone network, replacing the original C2f modules. This improved the model’s mAP@0.5 by 0.4 percentage points compared to the baseline. The parameter count and floating-point operations slightly decreased, verifying the effectiveness of Improvement 1. In Improvement 2, embedding the C2f-RepNCSPFPN module in the Neck further increased mAP@0.5 and mAP@0.5:0.95 by 6.4 and 4.5 percentage points, respectively, compared to the baseline, with substantial improvements in P and R. Although the improved module increased parameters and computational complexity compared to the baseline, the increase is within an acceptable range.

Compared to the original YOLOv11n algorithm, the proposed GR-YOLOv11n algorithm for small target detection demonstrated significant performance improvement on the VisDrone2019 dataset. Specifically, Precision (P), Recall (R), and mAP@0.5 increased by 8.4, 5.4, and 7.6 percentage points, respectively. This enhancement not only strengthens the model’s ability to recognize small UAV drone targets but also significantly boosts overall detection performance. Simultaneously, through efficient network architecture design, the improved algorithm successfully reduced the number of parameters by 0.24 million, validating the effectiveness of the proposed algorithm.

Comparative Experiments

In this study, comparative experiments were conducted for YOLOv11n to evaluate its performance in object detection tasks. The proposed algorithm was comprehensively compared with models such as Faster R-CNN, SSD, YOLOv3-tiny, YOLOv5n, YOLOv6n, YOLOv7-tiny, YOLOv8n, and YOLOv11n. All algorithms were tested on the VisDrone2019 dataset.

Table 3: Performance Comparison of Different Object Detection Models on the VisDrone2019 Dataset
Model	P/%	R/%	mAP@0.5/%	mAP@0.5:0.95/%	Params/10⁶	GFLOPs
Faster R-CNN	42.6	33.4	32.2	18.6	41.2	206.7
SSD	21.3	36.0	23.9	12.3	24.5	87.9
YOLOv3-Tiny	37.4	23.7	22.8	12.7	12.1	19.1
YOLOv5n	42.3	31.8	31.5	18.1	2.18	5.9
YOLOv6n	38.6	30.2	29.0	16.9	4.16	11.6
YOLOv7-Tiny	43.4	36.1	32.8	16.7	6.03	13.3
YOLOv8n	42.1	32.8	32.0	18.6	2.69	7.0
YOLOv11n (Baseline)	43.6	33.2	32.8	19.0	2.59	6.5
GR-YOLOv11n (Ours)	52.0	38.6	40.4	24.4	2.35	8.4

From the comparative experimental results in Table 3, YOLOv11n performs excellently across multiple key performance metrics. Specifically, YOLOv11n achieves a P of 43.6%, R of 33.2%, mAP@0.5 of 32.8%, and mAP@0.5:0.95 of 19.0%, all superior to the other models except for the proposed improved algorithm. Furthermore, YOLOv11n shows a good balance in terms of model parameters and computational complexity, with Params of 2.59M and GFLOPs of 6.5, indicating it maintains relatively high detection performance while consuming lower computational resources. The SSD algorithm has low detection accuracy, while Faster R-CNN has a large parameter count, complex structure, and slow detection speed, making it unsuitable for UAV drone image target detection. Additionally, the proposed algorithm outperforms YOLOv11n in the first four metrics, achieving P of 52.0%, R of 38.6%, mAP@0.5 of 40.4%, and mAP@0.5:0.95 of 24.4%. It also performs well in terms of parameters and computational complexity, with Params of 2.35M and GFLOPs of 8.4. This indicates that the proposed algorithm further optimizes the network structure and detection performance based on inheriting the advantages of YOLOv11n, giving it stronger competitiveness in UAV drone object detection tasks.

The following table shows the mAP@0.5 detection results for the 10 target categories in the VisDrone2019 dataset across various mainstream algorithms. It can be observed that the proposed algorithm has better performance compared to current mainstream models. Compared to the baseline model YOLOv11, the accuracy metrics for each category have increased, achieving the best results for detecting ‘truck’ and ‘bus’.

Table 4: Comparison of Detection Results (mAP@0.5) of Different Algorithms on the VisDrone2019 Dataset (%)
Method	Pedestrian	Person	Bicycle	Car	Van	Truck	Tricycle	Awning-tricycle	Bus	Motorcycle	mAP@0.5
YOLOv3-Tiny	18.2	14.1	4.2	60.1	29.6	21.4	14.1	6.4	38.4	20.8	22.8
YOLOv5n	33.1	26.4	7.3	74.8	36.7	26.6	18.4	10.5	46.7	34.7	31.5
YOLOv6n	30.6	23.3	3.5	73.5	34.5	24.6	17.5	11.0	41.1	30.2	29.0
YOLOv7-Tiny	34.2	24.8	8.2	74.8	38.5	27.8	20.7	12.4	48.9	38.4	32.8
YOLOv8n	33.6	26.7	7.8	75.1	37.0	28.0	21.1	12.0	43.5	35.3	32.0
YOLOv11n	34.5	26.3	7.9	75.5	38.3	28.0	21.2	12.7	47.2	36.2	32.8
GR-YOLOv11n (Ours)	41.3	33.3	13.8	79.5	44.8	39.3	29.7	17.8	59.5	45.0	40.4

Generality Experiment

To verify the model’s generality and adaptability across different scenarios, a generality experiment was conducted on the RDD2020 dataset. The experimental results are shown in the table below.

Table 5: Comparative Experimental Results of YOLOv11n and Improved GR-YOLOv11n on the RDD2020 Dataset
Model	P/%	R/%	mAP@0.5/%	mAP@0.5:0.95/%	Params/10⁶	GFLOPs
YOLOv11n	54.1	44.6	45.5	20.7	2.59	6.4
GR-YOLOv11n (Ours)	56.6	48.2	46.3	21.8	2.35	8.4

The experimental results indicate that applying the proposed algorithm to the RDD2020 dataset increased the detection accuracy values mAP@0.5 and mAP@0.5:0.95 by 0.8 and 1.1 percentage points, respectively. Precision improved by 2.5 percentage points, and Recall improved by 3.6 percentage points. In terms of computational complexity, the proposed algorithm is 8.4 GFLOPs, which, although slightly higher than the baseline model, is within an acceptable range. All obtained metrics show improvement, fully demonstrating that the proposed improvement method has good applicability and robustness. This also confirms again that the proposed algorithm is well-suited for UAV drone image target detection.

Visualization Comparison and Heatmap Analysis

This study demonstrates the evaluation metrics of the improved model on the VisDrone2019 dataset and visualizes and analyzes the results through charts. The visualization results show that the target scenes contain various complex backgrounds and interfering factors, with significant variations in the size, shape, and position of target objects. In such scenarios, while the YOLOv11n algorithm can detect some targets, it exhibits obvious shortcomings in recognizing some small or occluded UAV drone targets. In contrast, the proposed improved algorithm shows significant advantages in detection results. By optimizing the model structure and improving detection strategies, our algorithm can more accurately locate and identify target objects. Regarding small target detection, the improved algorithm can effectively capture the feature information of targets, reducing the missed detection rate. Compared to the baseline, the original algorithm shows obvious false detection in some cases, such as misidentifying a van as a car. Compared to other complex scenes, the improved algorithm also significantly enhances detection performance in complex backgrounds, with a markedly reduced false alarm rate. From the visualization results, the detection boxes of the improved algorithm fit the contours of target objects more tightly, and the detection results are more accurate and stable.

To more intuitively observe the features learned by the trained network, a heatmap analysis was conducted for the baseline model YOLOv11n and GR-YOLOv11n. The analysis reveals that when YOLOv11n handles object detection tasks, its activated regions are mainly concentrated in the center of the target, with relatively weak feature extraction for target edges. While this feature extraction method can locate targets to some extent, it leads to decreased detection accuracy when dealing with complex scenes or irregularly shaped UAV drone targets. In images with low contrast between the target and the background, the heatmap of YOLOv11n shows lower attention to target edges, making it susceptible to interference from background noise.

In contrast, the proposed improved algorithm shows significant advantages in heatmap representation. The heatmap of the improved algorithm not only shows strong activation in the central region of the target but also exhibits more sufficient feature extraction for target edges. This comprehensive focus on features enables the model to more accurately identify the contours and shapes of targets, thereby improving detection accuracy and robustness. Especially when handling small or occluded UAV drone targets, the heatmap of the improved algorithm shows stronger attention to target details, which helps reduce missed detections and false alarms. Furthermore, the heatmap of the improved algorithm also shows better suppression capability for complex backgrounds. In some images with complex backgrounds, the improved algorithm can more effectively focus attention on the target without being misled by interfering features in the background. This indicates that the improved algorithm already possesses stronger discriminative and anti-interference capabilities during the feature extraction stage.

Conclusion

Addressing the issues of low recognition accuracy and high false alarm rates in existing UAV drone image target detection, an improved algorithm based on YOLOv11 was proposed. This algorithm optimizes the feature extraction and fusion process by employing the GhostNet v2 and C2f-RepNCSPFPN modules, enhancing the model’s robustness and generalization capability. GhostNet v2 reduces computational load through its special convolutional structure while maintaining high accuracy. The C2f-RepNCSPFPN module combines the characteristics of the C2f module and the RepNCSPFPN module, achieving effective fusion between feature maps at different levels through effective splitting and parallel processing of feature maps, as well as a top-down path and upsampling strategy. This improves the model’s ability to detect targets of various scales in aerial imagery. Under the combined effect of these improvements, the performance of UAV drone image target detection is significantly enhanced.

The results show that on the VisDrone2019 dataset, the algorithm increased mAP@0.5 and mAP@0.5:0.95 by 7.6 and 5.4 percentage points, respectively, and performed best compared to other excellent algorithms. Even under complex scenes and varying external factors, it can meet detection requirements for UAV drone operations. Future research can focus on further optimizing the model structure to tackle the challenge of improving detection precision, thereby increasing the model’s practical value for real-world UAV drone applications.