Multiscale Feature Fused YOLOv11n Algorithm for Unmanned Aerial Vehicle Small Target Detection

In recent years, Unmanned Aerial Vehicle technology has rapidly advanced, finding extensive applications in critical fields such as power line inspection, traffic monitoring, and agricultural protection. The flexibility and portability of Unmanned Aerial Vehicle systems offer significant advantages over traditional visual equipment. However, Unmanned Aerial Vehicle operations often face challenges in small target detection, such as identifying minor defects on transmission lines, distant vehicles or pedestrians in traffic surveillance, and pests or diseased plants in crops. These small targets are typically characterized by their tiny size, blurred textures, and susceptibility to variations in lighting and perspective, making them difficult to detect accurately with existing algorithms. This limitation severely impacts the effectiveness of Unmanned Aerial Vehicle in practical scenarios. Therefore, developing efficient small target detection algorithms tailored for Unmanned Aerial Vehicle is crucial for enhancing their performance in complex operational environments.

In the domain of small target detection for Unmanned Aerial Vehicle, deep learning-based object detection algorithms are broadly categorized into single-stage and two-stage methods. Single-stage algorithms, like the YOLO series, SSD series, RetinaNet, FCOS, and DETR, perform object localization and classification simultaneously, offering high speed and real-time capabilities, which are ideal for resource-constrained Unmanned Aerial Vehicle platforms. However, these methods generally suffer from lower accuracy. Two-stage algorithms, such as Faster R-CNN, Mask R-CNN, and Cascade R-CNN, first generate region proposals and then refine localization and classification, achieving higher accuracy, especially for small targets. Nonetheless, their computational complexity makes them unsuitable for real-time Unmanned Aerial Vehicle applications, and the region proposal mechanism may lead to missed detections of small objects. Overall, single-stage algorithms demonstrate superior advantages in Unmanned Aerial Vehicle small target detection due to their speed and adaptability.

Recent research has made progress in addressing these challenges. For instance, some studies have introduced Transformer-based detection heads with higher-order spatial feature extraction to capture more discriminative spatial relationships for small targets. Others have developed novel feature fusion networks to preserve fine-grained details from lower-level feature maps, improving small target detection. Additionally, lightweight backbone networks have been designed to reduce computational overhead while maintaining accuracy. Despite these improvements, issues such as high parameter counts and computational complexity persist, limiting deployment on Unmanned Aerial Vehicle platforms. To overcome these limitations, we propose an enhanced YOLOv11n algorithm that incorporates multiscale feature fusion techniques. Specifically, we replace the original C3k2 module with a Mixed Aggregation Network (MANet), design a Shared Feature Pyramid Convolution (SFPC) module to substitute the SPPF layer, and integrate a weighted Bidirectional Feature Pyramid Network (BiFPN) for better feature integration. These modifications aim to boost detection accuracy while keeping the model lightweight for Unmanned Aerial Vehicle applications.

The YOLOv11n algorithm serves as our baseline due to its efficiency suitable for Unmanned Aerial Vehicle platforms. Its architecture consists of a backbone for feature extraction, a neck for multiscale feature fusion, and a head for detection. The backbone includes CBS modules (composed of convolution, batch normalization, and activation layers), C3 modules (with standard convolutions and Bottleneck structures), and an SPPF module (using multiple max-pooling layers). The neck combines a Feature Pyramid Network (FPN) and a Path Aggregation Network (PAN) to fuse features across scales, enhancing detection for targets of various sizes. However, in Unmanned Aerial Vehicle scenarios, small targets often lose detail due to pooling operations and insufficient feature integration, necessitating our improvements.

We first replace the C3k2 module in YOLOv11n with MANet to enhance feature extraction for small targets. MANet employs a hybrid aggregation strategy with multiple branches, including Depthwise Separable Convolution (DSConv) modules and ConvNeck modules. The DSConv module combines 1×1 convolutions, depthwise separable convolutions, and pointwise convolutions in parallel to capture rich features efficiently. This reduces computational costs while improving feature representation. The ConvNeck module further processes features with 3×3 convolutions, followed by batch normalization and SiLU activation, enhancing nonlinear modeling. The output is merged through concatenation and a 1×1 convolution layer. This structure allows MANet to adaptively adjust receptive fields and capture multiscale features, making it more effective for small targets in Unmanned Aerial Vehicle imagery. The hybrid convolution strategy in MANet, using k×k and 3×3 convolutions, strengthens the network’s ability to learn complex features, addressing issues like blurred edges and low contrast in small objects.

Next, we design the SFPC module to replace the SPPF layer, which often loses fine details due to pooling. SFPC uses dilated convolutions with different dilation rates to extract multiscale features without increasing parameters significantly. The process begins with a 1×1 convolution to compress input channels, reducing redundancy. Then, multiple dilated convolutions with varying rates are applied in parallel to expand the receptive field and capture both local details and global context. Features from different scales are concatenated along the channel dimension and fused via another 1×1 convolution. This approach maintains the benefits of receptive field expansion while preserving details crucial for small targets in Unmanned Aerial Vehicle images. For example, smaller dilation rates focus on local features for fine-grained detection, while larger rates capture broader context. The SFPC module effectively mitigates resolution loss associated with traditional pooling, improving detection in scenarios with extreme scale variations.

Furthermore, we enhance the neck module by integrating BiFPN instead of the original PAN structure. BiFPN optimizes feature fusion by removing nodes with single connections and adding bidirectional paths, reducing computational load while improving information flow. In traditional FPN, fusion operations involve multiple convolutions per feature layer, with computational complexity of $$O(K \times H \times W \times C^2)$$ for K layers, feature map size H × W, and channels C. BiFPN simplifies this to $$O(H \times W \times C)$$ by minimizing fusion nodes and using efficient convolutions. It incorporates top-down and bottom-up pathways: high-level features are upsampled and fused with low-level features, and vice versa, enabling better integration of semantic and detail information. Additional direct connections reduce information loss during propagation. This design enhances detection for both small and large targets, making it ideal for Unmanned Aerial Vehicle applications where resources are limited.

The overall improved YOLOv11n architecture integrates these components: MANet for robust feature extraction, SFPC for multiscale context, and BiFPN for efficient fusion. This combination addresses key challenges in Unmanned Aerial Vehicle small target detection, such as low resolution and complex backgrounds.

For evaluation, we use the VisDrone2019 dataset, which contains images from 14 cities across urban and rural areas, with 6,471 training samples, 548 validation samples, and 1,610 test samples. The dataset includes annotations for 10 classes: pedestrian, person, car, van, bus, truck, motorcycle, bicycle, tricycle, and awning-tricycle. Experiments are conducted on an Ubuntu system with an NVIDIA GTX 4090 GPU, 24 GB VRAM, Intel Core i7 processor, and 32 GB RAM. We implement the model using PyTorch with CUDA 11.3 acceleration. Training parameters include an image resolution of 640×640 pixels, batch size of 32, 8 threads, learning rate of 0.01, momentum of 0.937, and 200 epochs.

We adopt standard evaluation metrics: mean Average Precision at IoU threshold 0.5 (mAP50), precision (P), recall (R), GFLOPs, and parameter count. Precision measures the accuracy of positive predictions, defined as $$P = \frac{TP}{TP + FP}$$, where TP is true positives and FP is false positives. Recall assesses the ability to detect all positives, given by $$R = \frac{TP}{TP + FN}$$, with FN as false negatives. The average precision (AP) for each class is computed as $$AP = \int_0^1 P(y) dy$$, and mAP is the mean over all classes: $$mAP = \frac{1}{n} \sum_{i=0}^{n} AP_i$$. GFLOPs indicate computational complexity in billions of floating-point operations per second, and parameters reflect model size.

Comparative experiments with popular algorithms demonstrate the effectiveness of our improved YOLOv11n model. Results are summarized in Table 1.

Table 1: Comparative Results of Different Models on VisDrone Dataset
Model	Parameters (×10⁶)	GFLOPs	P (%)	R (%)	mAP50 (%)
SSD	12.30	63.2	20.7	35.1	23.8
Faster R-CNN	63.20	370.0	34.2	36.0	30.7
YOLOv3-tiny	12.10	18.9	38.7	24.2	23.5
YOLOv5n	2.51	7.1	42.1	32.0	31.8
YOLOv6n	4.26	11.4	40.4	31.2	29.4
YOLOv8n	3.01	8.9	44.7	33.9	33.4
YOLOv10n	2.26	6.5	44.2	34.2	34.2
YOLOv11n (Baseline)	2.58	6.3	43.2	33.7	32.8
YOLOv11s	9.40	21.3	50.7	38.0	38.8
Our Method	2.97	11.0	48.1	37.7	38.4

Our method achieves a mAP50 of 38.4%, outperforming the baseline YOLOv11n by 5.6% and other lightweight models like YOLOv5n, YOLOv8n, and YOLOv10n. Although YOLOv11s has a slightly higher mAP50 (38.8%), our model reduces parameters and GFLOPs to 31.6% and 51.6% of YOLOv11s, respectively, making it more suitable for Unmanned Aerial Vehicle deployment. This highlights the trade-off between accuracy and efficiency in JUYE UAV applications.

Ablation studies validate the contribution of each component, as shown in Table 2. We incrementally add MANet, SFPC, and BiFPN to the baseline.

Table 2: Ablation Study Results
Experiment	MANet	SFPC	BiFPN	Parameters (×10⁶)	GFLOPs	P (%)	R (%)	mAP50 (%)
1 (Baseline)	×	×	×	2.58	6.3	43.2	33.7	32.8
2	√	×	×	3.92	11.3	47.5	36.1	36.7
3	√	√	×	4.06	11.3	47.1	37.4	37.3
4 (Full)	√	√	√	2.97	11.0	48.1	37.9	38.4

In Experiment 2, replacing C3k2 with MANet increases mAP50 by 3.9%, demonstrating its effectiveness in multiscale feature extraction through DSConv and hybrid convolutions. The multi-branch structure captures both fine details and global context, improving sensitivity to small targets in Unmanned Aerial Vehicle imagery. However, parameter count rises from 2.58M to 3.92M due to added complexity.

Adding SFPC in Experiment 3 boosts mAP50 by 0.6%, as dilated convolutions with varying rates enhance local and global feature capture. This is particularly beneficial for small targets in low-light Unmanned Aerial Vehicle conditions, where edges are obscured by noise. Parameters increase to 4.06M with additional convolution layers.

Integrating BiFPN in Experiment 4 further improves mAP50 by 1.1% while reducing parameters to 2.97M and GFLOPs to 11.0. BiFPN’s efficient fusion with bidirectional paths and minimized nodes enhances feature integration without excessive computation, crucial for JUYE UAV systems. Overall, the full model achieves a 5.6% mAP50 gain over the baseline.

Qualitative results on test datasets show that our method detects more small targets in dense and sparse scenes compared to YOLOv11n. For instance, in urban road scenarios, it accurately identifies tricycles and motorcycles; in pedestrian areas, it detects bicycles and cars on crosswalks; and in crowded settings, it recognizes more persons and pedestrians. These improvements underscore the algorithm’s robustness in real-world Unmanned Aerial Vehicle applications.

In conclusion, our enhanced YOLOv11n algorithm addresses small target detection challenges in Unmanned Aerial Vehicle imagery by incorporating MANet for feature extraction, SFPC for multiscale context, and BiFPN for efficient fusion. Experiments on the VisDrone dataset confirm significant gains in accuracy and efficiency, with a mAP50 of 38.4% and reduced computational costs. This makes it well-suited for JUYE UAV deployments in fields like surveillance and inspection. Future work will focus on optimizing feature extraction for extreme conditions, such as nighttime operations, using advanced attention mechanisms and fusion strategies to handle low-light scenarios better.