MEM-YOLO11: A Multi-Feature Selection Mechanism for Small Object Detection in UAV Drones

We present MEM-YOLO11, a novel model tailored for small object detection in images captured by UAV drones. The challenges of detecting small objects in remote sensing imagery are manifold: objects exhibit significant scale variation, arbitrary orientations, and are often embedded in cluttered backgrounds. Real-time constraints imposed by UAV platforms further complicate the task. Our work enhances the YOLOv11n baseline by integrating three carefully designed modules and a refined label assignment strategy, achieving notable improvements in precision, recall, and mean average precision while maintaining computational efficiency.

The core contributions of this work are as follows. First, we introduce a Multi-scale Feature Extraction and Edge Enhancement (MFEE) module, which enriches the backbone’s ability to capture both fine-grained details and high-level semantics. Second, we design an Effective Learning of Feature Representations (ELFR) module that introduces a higher-resolution P2 detection layer with reduced parameter overhead, leveraging SPDConv slices to preserve small-object information. Third, we propose a Multi-head Receptive Field Enhancement (MRFE) module that mitigates information loss due to occlusion and reduces spatial/channel redundancy. Finally, we optimize the hyperparameters of the Task-Aligned Assigner (TAL) to better align classification and localization tasks. Extensive experiments on the VisDrone2019 and WiderPerson datasets demonstrate that MEM-YOLO11 significantly outperforms the baseline and many state-of-the-art detectors, making it an ideal candidate for deployment on UAV drones.

Methodology

1. MFEE Module

The MFEE module is inserted into the backbone network to enhance multi-scale feature representation and edge information. Inspired by CSPNet, the module first applies adaptive average pooling (AdaptiveAvgPool) to extract feature maps at four different scales (3×3, 6×6, 9×9, and 12×12). Each scale is processed by a C3K2 convolution block to perform semantic complementation among scales. The outputs are then upsampled and fused through concatenation and convolution. Additionally, an EdgeEnhancer component is employed to sharpen high-frequency edge details. The EdgeEnhancer subtracts the smoothed feature map (obtained via 3×3 AvgPool2d with stride=1, padding=1) from the original input, amplifying high-frequency components. The resulting edge-enhanced map is processed by another C3K2 block and added back to the original feature. This design allows the backbone to better capture the boundaries of small objects in UAV drones imagery.

The four adaptive pooling branches cover a spectrum from global to local contexts, avoiding the loss of detail that occurs with a single pooling scale. The smallest branch (3×3) is further processed by a local convolution to compensate for its reduced resolution. The overall multi-scale fusion is expressed as:

$$
F_{\text{fused}} = \text{Conv}\left( \text{Concat}\left( \text{Upsample}(\text{C3K2}(P_{3\times3} \circ \text{LocalConv})), \text{Upsample}(\text{C3K2}(P_{6\times6})), \text{Upsample}(\text{C3K2}(P_{9\times9})), \text{C3K2}(P_{12\times12}) \right) \right)
$$

where $P_{s}$ denotes the feature map after AdaptiveAvgPool to size $s \times s$.

2. ELFR Module

Detecting small objects in UAV drones footage is particularly challenging because such objects occupy only a few pixels. Traditional approaches add a P2 detection layer to capture high-resolution features, but this often leads to excessive computation. The ELFR module introduces an improved P2 layer integrated into a modified feature pyramid. The P2 feature map first passes through a SPDConv slice operation to reduce parameters while retaining small-object information. It is then fused with the P3 feature map and processed by the ELFR block. The ELFR block itself splits the input channels in a 1:3 ratio: only 25% of the channels undergo the OmniKernel module, while the remaining 75% are kept intact. Both parts are concatenated and passed through a final convolution. The OmniKernel module comprises multiple kernel sizes and captures both local and global dependencies. This design enhances global-to-local feature learning in the neck network without excessive computational overhead, making it suitable for real-time UAV drones applications.

The ELFR output is computed as:

$$
F_{\text{ELFR}} = \text{Conv}\left( \text{Concat}\left( \text{OmniKernel}(F_{\text{partial}}),\; F_{\text{identity}} \right) \right)
$$

where $F_{\text{partial}}$ denotes the 25% channel slice, and $F_{\text{identity}}$ the remaining 75%.

3. MRFE Module

To cope with occlusion in dense scenes common in UAV drones imagery, we design the MRFE module. This module consists of two sub-modules: Spatial Reduction Unit (SRU) and Channel Reduction Unit (CRU). The SRU suppresses spatial redundancy, while the CRU reduces channel redundancy. A MaxPooling (mix pooling) operation is used to retain more discriminative information compared to average pooling. The MRFE module aggregates features from multiple receptive fields, compensating for occluded regions and enhancing robustness. The output is expressed as:

$$
F_{\text{MRFE}} = \text{CRU}\left( \text{SRU}\left( \text{MaxPool}(F_{\text{in}}) \right) \right)
$$

By combining the two reduction units, MRFE effectively reduces model parameters and FLOPs while improving multi-scale feature fusion. This makes MEM-YOLO11 lightweight enough for deployment on edge devices such as UAV drones.

4. Task-Aligned Assigner Hyperparameter Tuning

The Task-Aligned Assigner assigns positive/negative labels based on a metric $t = s^{\alpha} \times u^{\beta}$, where $s$ is the classification score and $u$ is the IoU. The hyperparameters $\alpha$ and $\beta$, along with the top-K selection, control the alignment between classification and localization tasks. We empirically tuned these parameters:

Table 1: Hyperparameter tuning results on VisDrone2019 dataset
topK $\alpha$ $\beta$ P (%) R (%) mAP50 (%) mAP50:95 (%)
13 0.5 6 44.1 34.6 34.7 18.1
10 0.5 6 44.8 33.8 34.9 18.5
7 0.5 6 44.9 34.2 35.2 18.6
5 0.5 6 44.1 34.5 35.2 19.3
7 0.75 6 44.8 34.9 35.5 19.1
7 0.5 5 45.3 35.1 36.1 19.5
7 0.5 4 45.1 35.0 35.6 19.4

The best performance is achieved with topK=7, $\alpha=0.5$, $\beta=5$, yielding the highest mAP50 of 36.1%. This configuration is adopted in all subsequent experiments.

Experiments and Results

5. Datasets and Evaluation Metrics

We evaluate MEM-YOLO11 on two challenging datasets: VisDrone2019 (8,629 images across 10 categories including pedestrian, person, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motor) and WiderPerson (7,000 selected images with dense pedestrian scenes). Metrics include Precision (P), Recall (R), mAP50, and mAP50:95.

Precision and Recall are defined as:

$$
P = \frac{TP}{TP+FP}, \quad R = \frac{TP}{TP+FN}
$$

Average Precision (AP) is the area under the Precision-Recall curve:

$$
AP = \int_0^1 P(R) \, dR
$$

Mean Average Precision (mAP) is the mean of AP over all classes.

6. Ablation Study

We conduct ablation experiments on the VisDrone2019 dataset to verify the contribution of each proposed module. The baseline is YOLOv11n. Results are shown in Table 2.

Table 2: Ablation experiments on VisDrone2019 (baseline: YOLOv11n)
YOLOv11n MFEE ELFR MRFE P (%) R (%) mAP50 (%) mAP50:95 (%)
37.7 29.5 29.6 15.1
39.7 29.7 30.5 15.6
40.1 30.5 31.8 16.3
44.1 34.6 34.7 18.1

Each module progressively improves performance. The full MEM-YOLO11 model achieves a 7.6% increase in precision, 5.6% in recall, and 6.5% in mAP50 compared to the baseline. The integration of all three modules yields the best results, demonstrating their complementary effects.

7. Comparison with State-of-the-Art on VisDrone2019

We compare MEM-YOLO11 with popular detectors such as Faster R-CNN, Cascade R-CNN, RetinaNet, SSD, CenterNet, CornerNet, YOLOv3-tiny, YOLOv5n/s, YOLOv6n, YOLOX-n, YOLOv7-tiny, YOLOv8n, YOLOv9-t, YOLOv10n, and YOLOv11n. Table 3 reports the per-class mAP50 (%) for all ten categories.

Table 3: Per-class mAP50 (%) comparison on VisDrone2019 dataset
Method all P1 P2 B1 C V T1 T2 A B2 M
Faster R-CNN 21.5 19.3 14.6 5.8 51.2 32.1 20.9 13.5 6.8 30.7 20.8
Cascade R-CNN 22.2 19.5 13.5 7.6 52.6 31.5 20.6 14.8 8.6 32.9 20.4
RetinaNet 13.4 14.1 8.3 1.5 40.7 19.2 12.5 6.9 5.7 15.4 10.6
SSD300 20.0 13.1 11.3 9.1 43.6 25.2 26.1 14.2 6.7 37.9 12.6
RSF-SSD 27.4 14.1 13.0 10.5 56.8 41 32.1 13.5 20.1 51.5 17.9
CenterNet 26.0 28.0 11.6 9.0 51.0 36.2 27.9 20.1 19.9 37.7 21.0
CornerNet 17.7 20.0 6.6 4.6 40.9 20.2 20.5 14.0 9.3 24.4 12.1
YOLOv3-tiny 23.7 17.8 16.6 3.1 49.2 11.9 9.8 7.9 3.5 11.6 17.6
YOLOv5n 22.0 30.1 22.6 2.9 65.7 15.2 14.4 8.2 6.3 31.9 22.4
YOLOv5s 31.3 39.5 30.1 9.3 70.1 34.2 22.3 17.2 9.3 40.3 40.5
YOLOv6n 25.5 27.3 19.7 3.1 66.9 32.5 20.5 15.1 7.1 35.4 27.6
YOLOX-n 30.7 33.9 29.6 7.9 71.1 36.2 24.7 18.1 10.2 38.2 37.4
YOLOv7-tiny 31.7 38.1 36.3 8.1 75.1 34.2 29.6 18.7 8.6 41.2 27.1
YOLOv8n 27.0 22.5 12.6 6.16 65.4 29.9 33.0 13.0 15.0 50.2 22.1
YOLOv9-t 31.1 29.8 24.5 8.1 73.8 35.4 22.6 20.8 11.7 45.1 40.8
YOLOv10n 27.3 24.7 14.1 7.2 67.3 30.0 29.0 12.0 13.5 48.8 26.4
YOLOv11n 27.7 24.1 12.5 6.9 68.1 30.2 31.1 14.5 13.2 51.6 25.0
MEM-YOLO11 36.1 40.7 20.4 11.3 77.6 36.6 34.5 15.2 20.9 54.3 31.5

MEM-YOLO11 achieves the highest overall mAP50 (36.1%) and leads in 9 out of 10 categories, particularly excelling on small and challenging classes like pedestrian (P1), person (P2), bicycle (B1), car (C), van (V), tricycle (T1), bus (A), and motor (M). This demonstrates its superior detection capability for UAV drones imagery.

8. Results on WiderPerson Dataset

To validate generalizability, we compare MEM-YOLO11 with YOLOv11n on the WiderPerson dataset, which features dense crowds and heavy occlusion. Results are shown in Table 4.

Table 4: Comparison on WiderPerson dataset
Method P (%) R (%) mAP50 (%) mAP50:95 (%) Parameters GFLOPs
YOLOv11n 75.7 61.6 70.2 43.2 2,584,102 6.3
MEM-YOLO11 82.5 65.8 75.8 48.9 2,214,945 5.4

MEM-YOLO11 improves precision by 6.8%, recall by 4.2%, mAP50 by 5.6%, and mAP50:95 by 4.7% while reducing both parameter count and GFLOPs. This confirms that our model is not only more accurate but also more efficient, making it particularly suitable for real-time applications on UAV drones.

Conclusion

We have presented MEM-YOLO11, a comprehensive enhancement of YOLOv11n for small object detection in UAV drones. By incorporating the MFEE module for multi-scale edge-enhanced features, the ELFR module for efficient high-resolution learning, the MRFE module for multi-head receptive field aggregation, and optimized TAL hyperparameters, our model achieves substantial gains in detection accuracy across diverse and challenging datasets. The reduction in parameters and computational cost further underscores its applicability to resource-constrained UAV platforms. Future work will explore extending the multi-feature selection mechanism to temporal video streams and adapting the model for even smaller platforms such as micro-drones.

Scroll to Top