Enhanced Small Target Detection for UAV Drones Using Reconstructed YOLOv11

In recent years, the rapid development of deep learning has propelled the advancement of edge devices. Since the low-altitude economy was first included in the planning outline, this field has been vigorously promoted. As a core application carrier of the low-altitude economy, UAV drones can compensate for information losses caused by weather, airspace restrictions, and other factors in tasks such as low-altitude observation, regional monitoring, and scene inspection. However, aerial images captured by UAV drones often suffer from high viewing angles, small targets, and complex backgrounds, leading to difficulties in feature extraction during detection. Targets may occlude each other, resulting in missed detections and false alarms. Therefore, designing a small target detection algorithm based on deep learning for complex scenarios from the perspective of UAV drones is of significant practical importance.

Currently, deep learning-based target detection algorithms are divided into single-stage and two-stage methods. Two-stage algorithms, such as R-CNN, Fast R-CNN, and Faster R-CNN, first screen candidate target regions and then perform refined target classification and localization to improve detection accuracy. While these algorithms offer certain advantages in precision, they have large parameter counts and slow detection speeds, making them unsuitable for deployment on real-time detection devices. Single-stage detection algorithms, like SSD and the YOLO series, reduce computational steps and achieve efficient real-time detection of targets. However, in small target detection under complex backgrounds, due to difficulties in feature extraction, issues such as missed detections and false alarms often arise, reducing the model’s detection accuracy.

To address these challenges, researchers continuously optimize algorithms to enhance the recognition and localization capabilities for small targets in complex scenarios, enabling more efficient and accurate UAV applications. For instance, some studies have introduced dual-level routing attention mechanisms in the feature extraction part, proposed loss functions like FPIoU, and adopted dynamic detection heads combined with attention mechanisms to improve YOLOv5-based algorithms for UAV small target detection. Others have redesigned feature extraction models to enrich target feature information and designed hybrid domain attention mechanisms to suppress background interference. However, these approaches often face issues such as high model complexity, slow detection speeds, or insufficient accuracy. Specifically, for UAV drone images with high-resolution features, complex backgrounds, and overlapping targets, methods like bidirectional growth networks, spatial channel enhancement modules, and multi-scale feature extraction modules have been proposed to optimize YOLOv8. While improvements are achieved, there remains a need for better balance between parameter count, computational complexity, and detection performance.

To overcome these limitations, we base our work on YOLOv11, which offers significant improvements in feature extraction, efficiency, speed, accuracy, and environmental adaptability compared to previous YOLO versions. We propose a small target detection algorithm for complex scenarios from the UAV drone viewpoint, named DLSRF-Net, which aims to enhance feature fusion and extraction stages, reduce model load, and improve detection accuracy for small targets in complex environments. Our contributions include: designing the Adaptive Depthwise Separable Receptive Field Attention Convolution (DSRFAConv) module to enhance the model’s ability to extract receptive field features of small targets while reducing computational cost; developing a novel Lightweight Multi-scale Linear Attention (LMLA) mechanism to address high computational complexity in detecting dense small targets in high-resolution complex scenes and improve attention to small targets; and creating the Depthwise Separable Receptive Field Multi-level Feature Fusion module (RSCDI) to integrate DSRFAConv, upsampling modules, spatial attention, and channel attention mechanisms, achieving deep fusion of multi-scale features and enhancing the model’s feature representation capability for small target detection in complex scenarios.

Methodology

The overall structure of DLSRF-Net consists of three parts: the backbone network, the neck network, and the detection head. Their main functions are feature extraction, feature fusion, and detection output, respectively. Since small targets from the UAV drone perspective lack receptive field features, and traditional convolutions introduce large parameter counts, we employ DSRFAConv to enhance the model’s ability to extract receptive field features of small targets and optimize the computational process. Small targets in UAV drone views are easily occluded, and multiple targets exhibit multi-scale characteristics; thus, we design a multi-branch lightweight multi-scale linear attention mechanism to improve the model’s attention to small targets. The original model’s upsampling and fully connected layers can cause information loss and increase computational costs; therefore, we introduce newly designed upsampling and fully connected layers to better preserve target feature information extracted by the model, improving detection accuracy for small targets and enhancing robustness and generalization.

Depthwise Separable Receptive Field Attention Convolution (DSRFAConv)

In traditional convolution operations, receptive field features overlap, and each receptive field uses the same features, causing target attention weights to be shared across all receptive field features. This prevents distinct differentiation of information differences at various positions, limiting model performance. RFAConv addresses this by focusing on the spatial features of receptive fields to improve detection effectiveness. However, aerial images from UAV drones have low pixels, leading to insufficient receptive fields during feature extraction, which reduces sensitivity to small target locations. Additionally, more receptive field extraction operations increase computational load. Compared to standard convolution, depthwise separable convolution does not require the same kernel across all input channels, resulting in lower parameter counts and computational costs, making the model more suitable for mobile devices or resource-constrained environments. Therefore, we propose DSRFAConv, which integrates RFAConv with depthwise separable convolution.

The processing flow of RFAConv is as follows: for an input feature $ X $, RFAConv first uses grouped convolution to convert spatial features into receptive field features, where each receptive field corresponds to a sensing window. By reshaping, each sensing window is adjusted to $ k $ times its original size, and each input receptive field feature is averaged via pooling to compress it into a constant. Then, grouped convolution is used to enhance feature connections between receptive fields, and Softmax generates attention weights for each feature. Finally, attention weights are multiplied with receptive field features to weight them based on importance, yielding the output. This process can be expressed as:

$$ F = \text{Softmax}(g_{i\times i}(\text{AvgPool}(X))) \times \text{ReLU}(\text{Norm}(g_{k\times k}(X))) = A_{rf} \times F_{rf} $$

where $ g_{i\times i} $ denotes grouped convolution of size $ i \times i $, $ k $ is the kernel size, Norm represents normalization, and $ F $ is the result of multiplying attention feature $ A_{rf} $ with receptive field feature $ F_{rf} $.

DSRFAConv replaces the basic convolution modules in the backbone layers. It employs RFAConv to capture more receptive field spatial features and uses depthwise separable convolution (DSConv) to perform grouped convolution along feature dimensions, followed by pointwise convolution (PConv) to merge all channels into the output feature map. This reduces computational load and improves efficiency, enhancing the model’s accuracy in detecting and recognizing small targets from UAV drones.

Lightweight Multi-scale Linear Attention (LMLA)

Attention mechanisms can enhance the model’s focus on small target features and reduce processing of complex backgrounds. Multi-scale Linear Attention (MLA) aggregates Q, K, V token information from around small target features to obtain multi-scale tokens. During aggregation, Q, K, V tokens are independent, using only small-kernel DSConv for separation to avoid hardware efficiency loss. In GPU training, all DSConv operations are aggregated into one DSConv, and all 1×1 convolutions are aggregated into a single 1×1 grouped convolution, improving the model’s ability to aggregate multi-scale features. Additionally, MLA leverages the associative property of matrix multiplication to reduce computational complexity and memory usage from quadratic to linear while retaining original performance. This process can be represented as:

$$ O_i = \frac{\sum_{j=1}^N [\text{ReLU}(Q_i)\text{ReLU}(K_j)^T] V_j}{\text{ReLU}(Q_i) \sum_{j=1}^N \text{ReLU}(K_j)^T} = \frac{\text{ReLU}(Q_i)(\sum_{j=1}^N \text{ReLU}(K_j)^T V_j)}{\text{ReLU}(Q_i)(\sum_{j=1}^N \text{ReLU}(K_j)^T)} $$

where $ Q = xW_Q $, $ K = xW_K $, and $ V = xW_V $ are linear projection matrices; $ O_i $ is the $ i $-th row of the linear attention output matrix; $ j $ is the summation index variable, iterating over all input positions; and $ N $ is the length of the input sequence, i.e., the number of tokens. With this method, computational complexity and memory usage are both $ O(N) $.

Compared to Softmax attention, linear attention decouples the Softmax function into two independent functions, allowing the attention computation order to be changed from $ (Q \cdot K) \cdot V $ to $ Q \cdot (K \cdot V) $, reducing overall computational complexity to linear. ReLU linear attention uses ReLU as the kernel function, which is more hardware-friendly. Unlike Softmax attention, ReLU linear attention leverages the associative property of the Matmul function to lower complexity from quadratic to linear without altering functionality, avoiding inefficiencies of Softmax operations on hardware.

In UAV drone perspectives, images often contain targets of various scales. For example, the VisDrone2021 dataset includes distant tiny targets like pedestrians, vans, bicycles, and motorcycles, as well as larger targets like buses, and others occluded by backgrounds such as trees or buildings. This complexity reduces the model’s attention to tiny targets, leading to poor detection performance. Therefore, we incorporate and reconstruct MLA by adding 3×3 DSConv and GConv fusion branches. The improved MLA is named the Multi-branch Lightweight Multi-scale Linear Attention (LMLA) mechanism. LMLA enhances the model’s attention to small targets, improves detection performance, reduces computational load, decreases training latency, and boosts overall model performance for UAV drones.

Depthwise Separable Receptive Field Multi-level Feature Fusion (RSCDI)

In YOLOv11, upsampling operations are used to restore feature map resolution, transmit and highlight contextual information, helping the model better understand image content. Concatenation (Concat) operations appear between different network layers for feature map fusion, merging multiple small feature maps into larger ones, but this significantly increases computational load. Since small targets from UAV drone perspectives have low resolution, traditional upsampling operations can cause information loss for such low-resolution small targets when integrating contextual information. Simultaneously, Concat operations add computational costs. Therefore, we address these issues by integrating DSRFAConv, Conv, upsampling modules, and the SDI module, proposing the RSCDI module.

The RSCDI module first divides input features into different levels of feature maps based on a channel number list, and performs DSRFAConv processing on each level separately. Then, 1×1 Conv is used for transmission, and the transmitted feature maps undergo spatial attention and channel attention processing to fuse local spatial information and global channel information. Next, feature maps at different levels are adjusted to the same resolution: if the current feature map’s resolution is lower than the target, it is upsampled; if higher, it is downsampled; if equal, it is used directly. SmoothConv (3×3 Conv) is applied to the adjusted feature maps for smoothing, removing noise and preserving detail information. Finally, Hadamard product (element-wise multiplication) yields the fused feature map, maximizing integration of local spatial and global channel information, highlighting key features of small targets, suppressing无效 information, and making the model more focused on small targets. This enhances the model’s detection effectiveness for small targets in complex scenarios involving UAV drones.

Experimental Validation

Datasets

We use the public dataset VisDrone2021, collected and released by the AISKYEY Laboratory of Machine Learning and Data Mining at Tianjin University. The benchmark dataset includes 288 video clips captured by various UAV drone cameras, covering different locations, environments, objects, and target densities, with data collection under varying weather and lighting conditions. The images include a training set of 6,471 images, a validation set of 548 images, and a test set of 1,610 images. Categories include pedestrians, people, bicycles, cars, vans, trucks, tricycles, awning-tricycles, buses, and motorcycles—10 classes in total. This dataset contains a large number of targets, mostly appearing as small or模糊 objects, making it highly relevant for research on target detection in UAV drone scenarios and effectively supporting performance evaluation and robustness verification in real small target detection, meeting our experimental requirements.

Experimental Environment and Evaluation Metrics

Experimental environment: We use Ubuntu 20.04 OS (kernel version 3.10), PyTorch 2.3.0, CUDA 12.1, CPU Intel(R) Xeon(R) Gold 6154 @ 3.00 GHz, and GPU NVIDIA GeForce RTX3090@24G. Parameter configuration: optimizer SGD, weight decay coefficient 0.0005, initial learning rate 0.01, epochs 200.

To evaluate model accuracy and usability in small target detection, we employ three evaluation metrics: Precision (P), Recall (R), and mean Average Precision (mAP50, mAP50-95).

1) Precision assesses the proportion of correctly predicted positive samples among all predicted positive samples. In object detection, if the predicted bounding box overlaps with the ground truth, it is considered correct. The formula is:

$$ P = \frac{TP}{TP + FP} $$

where TP is the number of true positives, and FP is the number of false positives.

2) Recall evaluates the proportion of all actual positive samples that the model can find. In object detection, if the ground truth bounding box overlaps with the predicted one, the sample is correctly recalled. The formula is:

$$ R = \frac{TP}{TP + FN} $$

where FN is the number of false negatives.

3) mean Average Precision comprehensively measures overall performance in multi-class object detection tasks. mAP50 denotes mAP at an IoU threshold of 50%; mAP50-95 is a stricter metric, requiring calculation of mAP at different IoU thresholds from 50% to 95%, then averaging, thus more accurately assessing model performance across IoU thresholds. The formulas are:

$$ AP = \int_0^1 P(R) dR $$

$$ mAP = \frac{1}{n} \sum_{i=1}^n AP(i) $$

where $ n $ is the number of classes.

Ablation Study Results and Analysis

To verify the effectiveness of our proposed methods, we conduct ablation experiments on the VisDrone2021 dataset using YOLOv11 as the baseline model. Results are shown in Table 1, where $ N_p $ is the parameter count, FPS is frames per second measuring real-time performance, and FLOP is floating-point operations per second measuring computational load.

Module	P (%)	R (%)	mAP50 (%)	mAP50-95 (%)	$ N_p $ (10^6)	FPS (fps)	FLOP (10^9/s)
—	42.7	33.0	32.9	18.8	2.6	82.3	6.3
RFAConv	44.5	33.5	33.6	19.7	2.6	61.4	6.7
MLA	43.4	34.1	33.5	19.6	2.5	69.7	7.1
SDI	45.4	33.9	34.3	20.2	3.3	74.4	12.8
DSRFAConv	44.9	34.1	34.4	19.9	2.4	66.6	6.6
LMLA	43.9	34.7	33.9	19.2	2.5	73.6	6.1
RSCDI	46.1	35.6	35.4	21.2	2.6	77.8	9.7
DSRFAConv + LMLA	45.4	36.2	35.1	20.9	2.5	80.1	6.9
DSRFAConv + RSCDI	46.6	36.1	36.4	21.7	2.6	83.7	8.7
LMLA + RSCDI	47.7	36.3	36.8	22.7	2.4	73.6	9.1
DSRFAConv + LMLA + RSCDI	49.9	38.1	40.1	24.2	2.8	109.8	11.9

Compared to RFAConv, the improved DSRFAConv enhances all metrics including FPS and reduces computational cost. Compared to MLA, the improved DSRFAConv effectively lowers computational load due to multi-branch design. Compared to the original SDI module, the improved RSCDI module not only reduces parameter count by 21% but also improves overall detection performance. Replacing the Conv modules in the neck structure with DSRFAConv increases both mAP50 and mAP50-95, indicating better extraction of receptive field features for small targets. Introducing LMLA to the base model reduces parameter count and improves overall performance, showing that LMLA effectively enhances attention to small targets and reduces computational load. When using RSCDI, all metrics improve significantly, demonstrating that RSCDI effectively preserves contextual information for small targets. Adding both DSRFAConv and LMLA further improves overall performance beyond DSRFAConv alone, as LMLA’s ReLU function addresses hardware inefficiencies from Softmax operations. Adding DSRFAConv and RSCDI boosts detection performance, as RSCDI integrates DSRFAConv, Conv, upsampling, and SDI modules, with DSRFAConv aiding in contextual integration and feature map connection. Adding both LMLA and RSCDI reduces parameter count by 7.7% and大幅度 improves comprehensive performance, benefiting from LMLA’s lightweight design and RSCDI’s suppression of无效 information. Including all three modules yields the best results, with highest comprehensive performance; although parameter count and computational load slightly increase compared to the baseline, P, R, mAP50, and mAP50-95 improve by 7.2, 5.1, 7.2, and 5.4 percentage points, respectively, and detection speed increases notably, with FPS rising by 27.5 fps. Ablation results confirm that each proposed module contributes to improved detection performance; compared to the baseline algorithm, our algorithm shows significant enhancement in small target detection tasks for complex scenarios from UAV drone perspectives.

Comparative Experiment Results and Analysis

To validate the effectiveness of our proposed algorithm, we compare it with other优秀 algorithms. Based on parameter count and computational load, we categorize model sizes into basic size N ($ N_p < 4.5 \times 10^6 $ and FLOP < $ 1.5 \times 10^{10} $/s) and larger size M ($ N_p > 4.5 \times 10^6 $ and FLOP > $ 1.5 \times 10^{10} $/s). This allows the model to better adapt to platforms with different loads, meeting real-time detection requirements. Test set results after 200 epochs of training are shown in Table 2.

Model Size	Algorithm	P (%)	R (%)	mAP50 (%)	mAP50-95 (%)	$ N_p $ (10^6)	FPS (fps)	FLOP (10^9/s)
N	Drone-YOLO	46.5	35.6	36.1	21.3	3.0	79.4	12.4
	YOLOv5n	42.8	32.0	32.3	18.2	2.2	96.6	5.8
	YOLOv6n	39.8	31.2	30.3	17.5	4.2	97.8	11.8
	YOLOv8n	44.2	31.9	32.4	19.0	3.0	101.5	8.1
	YOLOv8-Ghostp2	44.0	32.3	32.6	18.8	1.6	66.8	7.3
	YOLOv9t	44.0	33.1	33.3	19.3	2.0	71.8	7.6
	YOLOv10n	43.0	32.4	32.5	18.7	2.7	70.2	8.2
	YOLOv11n	42.7	33.0	32.9	18.8	2.6	82.3	6.3
	YOLOv12n	41.7	31.4	30.9	17.8	2.5	46.7	6.0
	DLSRF-Net	49.9	38.1	40.1	24.2	2.8	109.8	11.9
M	YOLOv8s	49.5	36.9	37.8	22.6	11.2	98.6	28.5
	YOLOv9s	50.0	38.7	39.4	23.4	7.2	36.7	26.7
	YOLOv10s	48.0	37.5	37.9	22.5	8.1	68.9	24.5
	YOLOv11s	48.2	37.7	37.9	22.7	9.4	79.6	21.3
	YOLOv12s	48.5	36.8	36.9	22.0	9.1	47.8	19.3
	RT-DETR	54.4	42.7	45.3	27.5	19.8	44.8	36.7
	DLSRF-Net	60.1	45.6	47.3	28.6	7.1	103.1	25.5

In the N size category, our algorithm achieves the best performance in P, R, mAP50, and mAP50-95, improving by 7.2, 5.1, 7.2, and 5.4 percentage points compared to the baseline model YOLOv11n. Although YOLOv8-Ghostp2 has the lowest parameter count among these algorithms, its detection performance still needs improvement. In the M size category, our algorithm achieves the best results across all metrics, improving by 11.9, 7.9, 9.4, and 5.9 percentage points over the baseline model YOLOv11s. While our method shows slight increases in FLOP for both N and M sizes, its FPS outperforms other algorithms, balancing parameter count, computational load, detection accuracy, and speed. Moreover, our algorithm’s overall performance surpasses the heavily parameterized RT-DETR algorithm, with only 35.9% of its parameters. In summary, our algorithm enhances extraction of receptive field features for small targets while increasing attention to small targets, effectively integrating contextual information and suppressing无效 information. Experimental results validate the effectiveness of our algorithm for small target detection in complex scenarios from UAV drone perspectives, achieving excellent performance in both basic and larger model sizes.

Visualization Results

For more intuitive comparison of performance differences between our algorithm and others, we conduct visualization analysis. Figures 1 and 2 show mAP50 and mAP50-95 curves for different models under N and M size categories on the VisDrone2021 dataset. Regardless of model size, our algorithm achieves the best performance in mAP50 and mAP50-95, with the fastest convergence speed, reaching higher accuracy in relatively short time, indicating good detection performance for small targets in complex scenarios and more accurate localization of细小 targets.

To visually compare detection accuracy improvements for each target category in the VisDrone2021 dataset, we plot precision comparison charts based on detection results from baseline and our models under N and M sizes, as shown in Figure 3. Under both N and M model sizes, our algorithm improves detection accuracy for every category in the VisDrone2021 dataset, especially for tiny targets like pedestrians, bicycles, and motorcycles, demonstrating the effectiveness of our algorithm for small target detection in UAV drone applications.

To visualize algorithm performance, we use GradCAM++ technology to generate heatmaps for trained models, with results shown in Figure 4. Our algorithm detects more feature points, focuses more on small targets, and significantly suppresses backgrounds under both model sizes, indicating that our proposed algorithm outperforms the baseline model in detecting small targets in complex backgrounds from UAV drone perspectives.

Generalization Ability Verification

To verify the generalization ability of our proposed method on other UAV drone small target datasets, we conduct experiments on the DOTA and SSDD datasets. The DOTA dataset contains aerial images from different sensors and platforms, including objects of various sizes, orientations, and shapes across 16 categories. The SSDD dataset involves images under various sea conditions, lighting conditions, ship types, and sizes. We perform experiments on both datasets according to model sizes N and M, with results shown in Table 3.

Dataset	Model Size	Model	P (%)	R (%)	mAP50 (%)	mAP50-95 (%)	$ N_p $ (10^6)	FPS (fps)	FLOP (10^9/s)
DOTA	N	YOLOv10n	55.7	34.0	35.0	20.9	2.7	62.7	8.2
		YOLOv11n	64.6	35.8	38.3	22.9	2.5	77.1	6.3
		YOLOv12n	65.1	35.5	37.3	22.1	2.5	49.3	6.0
		DLSRF-Net	66.2	39.6	42.9	24.9	2.8	99.3	11.9
DOTA	M	YOLOv10s	61.7	37.6	39.7	24.0	8.1	62.5	24.5
		YOLOv11s	69.5	39.3	43.3	26.5	9.4	70.4	21.3
		YOLOv12s	68.9	39.7	44.2	25.9	9.1	48.8	19.3
		DLSRF-Net	71.7	42.2	46.2	28.3	7.1	94.6	25.2
SSDD	N	YOLOv10n	94.9	92.7	97.2	72.2	2.7	61.1	8.2
		YOLOv11n	95.9	93.4	98.1	72.2	2.6	75.2	6.3
		YOLOv12n	96.6	92.7	97.4	71.0	2.5	50.3	6.0
		DLSRF-Net	97.9	95.6	98.7	74.8	2.8	101.7	11.9
SSDD	M	YOLOv10s	94.2	94.9	98.1	74.1	8.1	62.4	24.5
		YOLOv11s	95.7	95.4	98.2	74.6	9.4	80.3	21.3
		YOLOv12s	97.1	94.3	98.3	74.2	9.1	49.8	19.3
		DLSRF-Net	98.1	96.7	98.9	74.6	7.1	99.6	25.5

For the DOTA dataset, the improved model shows clear improvements in P, R, mAP50, and mAP50-95. For the SSDD dataset, the improved model demonstrates significant FPS enhancement. Experimental results indicate that under different model sizes, our algorithm outperforms other advanced algorithms on both datasets. Compared to the baseline model YOLOv11, our improved method shows varying degrees of improvement across all metrics for both datasets under both model sizes, proving the generalization ability of our improved method for UAV drone small target detection.

Discussion

Our proposed DLSRF-Net algorithm addresses key challenges in small target detection from UAV drone perspectives by integrating novel modules that enhance feature extraction, attention, and fusion. The DSRFAConv module effectively combines receptive field attention with depthwise separable convolution, improving the model’s ability to capture spatial features of small targets while reducing computational overhead. This is crucial for UAV drones operating in real-time environments where resources are limited. The LMLA mechanism provides a lightweight solution for multi-scale attention, enabling the model to focus on dense small targets without incurring high computational costs. The RSCDI module ensures that contextual information is preserved and无效 information is suppressed, leading to better feature representation for small targets in complex backgrounds.

The experimental results on the VisDrone2021 dataset demonstrate significant improvements over baseline and state-of-the-art models. In both N and M size categories, DLSRF-Net achieves higher precision, recall, and mAP values, indicating robust detection performance. The ablation studies confirm that each module contributes positively to the overall performance, with the combination of all three yielding the best results. The visualization results further support these findings, showing that our model attention is more focused on small targets and less distracted by backgrounds.

Moreover, the generalization experiments on DOTA and SSDD datasets show that our algorithm maintains high performance across different UAV drone scenarios, including aerial imagery with varied objects and maritime environments. This versatility is essential for practical applications of UAV drones in diverse fields such as surveillance, monitoring, and inspection. The balance between parameter count, computational load, and detection accuracy makes DLSRF-Net suitable for deployment on edge devices, aligning with the needs of the low-altitude economy.

Conclusion

In this work, we propose DLSRF-Net, a small target detection algorithm for complex environments from UAV drone perspectives, based on reconstructed YOLOv11. We design the DSRFAConv module to increase the model’s receptive field and enhance detection capability for small targets while reducing computational load. We develop the LMLA mechanism to improve attention to small targets in a lightweight manner. We create the RSCDI module to preserve feature information and suppress无效 information. By categorizing model sizes into N and M to meet different detection demands, our algorithm achieves significant improvements. Under size N, mAP50 and mAP50-95 reach 40.1% and 24.2%, improving by 7.2 and 5.4 percentage points over the baseline network. Under size M, mAP50 and mAP50-95 reach 47.3% and 28.6%, improving by 9.4 and 5.9 percentage points. Experimental validation shows that our model outperforms the baseline in both sizes, and in size M, it surpasses the heavily parameterized RT-DETR with only 35.9% of its parameters. Additionally, generalization experiments on DOTA and SSDD datasets confirm good generalization performance for UAV drone small target detection. In summary, our method effectively detects small targets from UAV drone perspectives, achieving excellent detection performance with relatively low parameter counts. Future work will focus on further lightweight design and detection speed improvement to deploy the model on lightweight UAV drones, better adapting to various UAV application scenarios under the development of the low-altitude economy. The integration of these advancements will contribute to the growing field of UAV drone technology, enabling more efficient and accurate operations in complex environments.