Enhanced YOLOv11n for UAV-Based Small Object Detection in Forest Search and Rescue

The application of unmanned aerial vehicles (UAVs), or drones, in forest search and rescue (SAR) missions represents a significant technological advancement, offering the potential to cover vast, rugged, and inaccessible terrains rapidly. The core of such autonomous SAR systems lies in robust computer vision algorithms capable of reliably detecting humans, often appearing as small, occluded objects within complex, cluttered backgrounds. Traditional manual search methods are inefficient and perilous, underscoring the need for automated, vision-based solutions deployed on UAV platforms.

Deep learning has revolutionized object detection, with algorithms broadly categorized into two-stage and single-stage detectors. While two-stage detectors like R-CNN series offer high accuracy, their computational complexity hinders real-time performance, a critical requirement for agile UAV operations. Single-stage detectors, particularly the YOLO (You Only Look Once) family, have become the de facto standard for real-time applications due to their excellent speed-accuracy trade-off. However, detecting small targets from a UAV drone’s perspective remains a formidable challenge. Targets occupy few pixels, their features are minimal and easily confused with background elements like rocks or vegetation, and they are frequently occluded by tree canopies. Furthermore, varying lighting conditions, shadows, and the sheer scale of the search area exacerbate these difficulties. While recent research has introduced enhancements such as Transformer-based heads, parallel dilated convolutions, and specialized loss functions, issues persist regarding missed detections, false alarms, and an often unfavorable balance between model accuracy and computational footprint suitable for drone deployment.

To address these challenges, this work presents a comprehensively improved YOLOv11n model specifically optimized for small object detection in UAV-based forest SAR scenarios. Our contributions are threefold: 1) We augment the backbone network by integrating a Context-Guided mechanism into the C3k2 module, enhancing its ability to suppress background noise and aggregate multi-contextual information crucial for distinguishing small targets. 2) We redesign the Neck with a novel, lightweight architecture combining a High-Level Screening Feature Fusion Pyramid (HSFPN) for controlled multi-scale fusion and a Context Anchor Attention (CAA) module for robust spatial dependency modeling, significantly boosting feature representation for multi-scale targets. 3) We embed an Efficient Multi-Scale Attention (EMA) module before the detection heads to dynamically refine feature weights, thereby increasing sensitivity to small, critical features. Extensive experiments on a private forest SAR dataset demonstrate that our proposed model achieves superior detection accuracy while reducing parameters and computational costs, offering an efficient, precise, and practical solution for real-time drone-based search and rescue.

1. Overview of the YOLOv11 Architecture

YOLOv11, released by Ultralytics, represents a continued evolution in the YOLO series, pushing the boundaries of real-time object detection with enhanced accuracy and efficiency. Its architecture retains the proven backbone-neck-head design pattern but introduces key refinements. The backbone is responsible for hierarchical feature extraction from the input image. A significant update in YOLOv11 is the replacement of the C2f module with the C3k2 module. The C3k2 module employs parallel convolutional branches with different kernel sizes (e.g., 3×3 and 5×5) within a Cross-Stage Partial framework. This design allows the network to capture multi-scale contextual information simultaneously, improving adaptability to objects of varying sizes—a vital property for UAV drone imagery where target scale can change rapidly with altitude. Another notable addition is the C2PSA module, which integrates Pyramid Slice Attention (PSA) into a CSP structure. The PSA mechanism adaptively allocates attention across different feature scales, enhancing the model’s focus on informative regions while maintaining computational efficiency, which is particularly beneficial in complex, cluttered backgrounds common in forest environments captured by drones.

The Neck, typically a Feature Pyramid Network (FPN) and Path Aggregation Network (PAN) combination, facilitates multi-scale feature fusion. It merges deep, semantically rich features from the backbone’s later layers with shallow, high-resolution features from earlier layers, enabling the detection of objects across a wide scale range. Finally, the detection Head performs the ultimate task of classification and bounding box regression on the fused features from the Neck, outputting the final detections. While this architecture is powerful, its default configuration, especially the lightweight ‘nano’ (n) variant, requires targeted optimizations to meet the stringent demands of small, obscured target detection in dynamic UAV operations.

2. Proposed Improvements to YOLOv11n

The standard YOLOv11n model, while efficient, shows limitations in the specific context of UAV-based forest SAR: its feature extraction may lack sufficient contextual reasoning for small targets against noisy backgrounds, its neck structure may lose fine-grained details during fusion, and its head may not optimally prioritize subtle features. To overcome these limitations, we introduce targeted enhancements across all three stages of the architecture.

2.1 Enhanced Backbone with Context-Guided C3k2

The C3k2 module in the backbone is pivotal for feature extraction. However, for small targets in UAV imagery, its standard convolutions may have a limited effective receptive field and insufficient integration of broader contextual cues. We integrate principles from the lightweight Context Guided Network (CGNet) to refine this module, creating the C3k2-ContextGuided block. The core idea is to jointly model local features, surrounding context, and global context. Within the module, a branch employs 3×3 depthwise separable convolutions to efficiently capture fine-grained local details of a potential target. A parallel branch uses 3×3 dilated convolutions to aggregate a wider surrounding context without reducing spatial resolution, helping to understand the target’s environment. These features are then concatenated. Finally, a global context is extracted via Global Average Pooling and fused back, providing a scene-level understanding that aids in suppressing irrelevant background activations. This process can be summarized as learning a joint feature map that combines localized detail with hierarchical contextual awareness, making the backbone more robust to the complex textures and patterns encountered in forest scenes by a SAR drone. The integration of this mechanism is lightweight and maintains a similar parameter count to the original module while significantly boosting discriminative power for small, ambiguous objects.

2.2 Redesigned Neck with CAA-HSFPN

The Neck is crucial for constructing a strong feature pyramid. Traditional FPN/PAN structures can suffer from information loss during up/down-sampling and may not optimally fuse features across scales for tiny objects. We propose a new Neck architecture that synergistically combines a High-Level Screening Feature Fusion Pyramid (HSFPN) with a Context Anchor Attention (CAA) mechanism, forming the CAA-HSFPN module.

The HSFPN component introduces a selective feature fusion strategy. Instead of direct element-wise addition, it uses a channel attention mechanism on high-level features to generate a soft selection weight map. This map is then used to gate and filter the corresponding lower-level features before fusion. This process, formalized for a high-level feature \( f_{high} \) and a low-level feature \( f_{low} \), can be described as:
$$ f_{att} = \sigma(MLP(AvgPool(f_{high})) + MLP(MaxPool(f_{high}))) $$
$$ f_{out} = f_{high} + f_{att} \otimes Conv(f_{low}) $$
where \( \sigma \) is the sigmoid function, \( MLP \) is a shared multi-layer perceptron, \( \otimes \) denotes element-wise multiplication, and \( Conv \) is a convolution layer for channel alignment. This allows semantically rich high-level features to guide the integration of precise spatial details from low-level features, selectively enhancing relevant information and mitigating noise.

However, channel attention alone lacks explicit spatial relationship modeling. This is where the CAA module complements perfectly. The CAA mechanism is designed to capture long-range spatial dependencies. It first uses average pooling and a convolution to gather local context. Then, it employs two sequential depthwise separable convolutions to efficiently model interactions across the entire spatial domain, establishing dependencies between distant pixels that might belong to the same occluded person. Finally, it generates a spatial attention map that re-weights the feature map, emphasizing important regions. The output of the CAA process enhances the feature \( X \) as:
$$ Y = X \otimes \sigma(Conv_2(DWConv_2(DWConv_1(Pool(X))))) $$
where \( DWConv \) denotes depthwise separable convolution.

In our CAA-HSFPN module, we replace the standard convolutions in the feature processing paths with CAA blocks. This equips the Neck with powerful spatial modeling capability. The overall flow in the new Neck involves refined upsampling (using nn.ConvTranspose2d), feature processing via CAA-HSFPN blocks, and controlled fusion using concatenation, addition, and multiplication operations. This design ensures that multi-scale features are not only fused selectively but are also spatially coherent and contextually enriched, leading to significantly improved feature maps for detecting small, partially visible targets from a UAV drone’s vantage point.

2.3 Detection Head Augmented with EMA Attention

Before the final classification and regression layers in the detection head, we insert an Efficient Multi-Scale Attention (EMA) module. The EMA module is designed to preserve precise spatial information while capturing cross-dimensional dependencies. Its operation involves several innovative steps. First, it groups channel features and reshapes part of the batch dimension to allow grouped processing. It then establishes attention mechanisms along both the height and width dimensions independently. For a grouped feature subset \( G \) with shape \( (C/G, H, W) \), it performs average pooling along the height and width to get descriptors \( X_{avg}^H \) (shape \( C/G, 1, W \)) and \( X_{avg}^W \) (shape \( C/G, H, 1 \)). These are concatenated, processed by a shared 1×1 convolution, and split again to produce spatial attention maps for height and width, which are then applied back to the original features. Simultaneously, another branch uses 3×3 convolutions to capture local multi-scale context. The outputs from both branches are combined and calibrated using a global descriptor generated via 2D global average pooling, resulting in a channel attention weight. The final output is a feature map where both spatial and channel dimensions have been dynamically and efficiently recalibrated. Mathematically, the cross-dimensional interaction in EMA enhances feature \( F \) by emphasizing informative pixels and channels, which is critical for highlighting the subtle signatures of a distant person in a forest scene captured by a drone. By embedding EMA before the detection heads, the network learns to allocate higher attention weights to the most discriminative features for small objects, effectively reducing false negatives and improving localization accuracy.

3. Experiments and Results Analysis

3.1 Dataset and Implementation Details

We evaluate our method on a private Forest SAR dataset comprising 4,098 high-resolution (4000×3000) images captured by UAV drones under various forest conditions, lighting, and seasons. The dataset is annotated with bounding boxes for persons (the target class) and is split into training, validation, and test sets in an 8:1:1 ratio. All experiments are conducted on a system with an NVIDIA RTX 2080 Ti GPU. We train models for 500 epochs with a batch size of 8, an input image size of 1280×1280, and using the PyTorch framework. No pre-trained weights are used to ensure a fair comparison focused on architectural improvements.

3.2 Evaluation Metrics

We employ standard object detection metrics: Precision (P), Recall (R), mean Average Precision at IoU threshold 0.5 (mAP@0.5), and mAP across IoU thresholds from 0.5 to 0.95 (mAP@0.5:0.95). Precision and Recall are defined as:
$$ P = \frac{TP}{TP + FP}, \quad R = \frac{TP}{TP + FN} $$
where \( TP \), \( FP \), and \( FN \) are true positives, false positives, and false negatives, respectively. The Average Precision (AP) is the area under the Precision-Recall curve, and mAP is the mean AP over all classes (in our case, one class). Given the resource constraints of UAV platforms, we also report the number of parameters (Params) and Giga Floating Point Operations (GFLOPs) to measure model complexity and computational cost.

3.3 Comparative Experiments

We compare our improved YOLOv11n model against several state-of-the-art and popular object detectors. The results, presented in Table 1, highlight the effectiveness of our approach for the UAV-based small target detection task.

Method P R mAP50 mAP50-95 Params (M) GFLOPs
Faster R-CNN 0.593 0.612 0.547 0.498 61.519 360.0
SSD 0.469 0.540 0.441 0.325 13.487 64.5
YOLOv7n 0.817 0.691 0.754 0.507 6.527 15.3
YOLOv8n 0.825 0.702 0.765 0.503 3.241 7.3
YOLOv10n 0.856 0.711 0.684 0.492 2.167 6.5
YOLOv11n (Baseline) 0.843 0.698 0.781 0.510 2.582 6.3
Ours 0.914 0.746 0.846 0.538 1.799 5.9

Our model achieves the highest Precision (0.914), Recall (0.746), and mAP50 (0.846), demonstrating a significant improvement over the baseline YOLOv11n (+6.5% in mAP50) while simultaneously reducing both the parameter count (by 0.78M) and computational cost (by 0.4 GFLOPs). It substantially outperforms two-stage detectors like Faster R-CNN in both accuracy and efficiency, making it far more suitable for real-time UAV drone applications. It also surpasses other lightweight YOLO variants (v7n, v8n, v10n), establishing a new state-of-the-art balance of accuracy and efficiency for this specific task.

3.4 Ablation Studies

To dissect the contribution of each proposed component, we conduct a series of ablation experiments starting from the YOLOv11n baseline. The results are systematically presented in Table 2.

Baseline C3k2-CG CAA HSFPN EMA P R mAP50 mAP50-95 Params (M) GFLOPs
0.843 0.698 0.781 0.510 2.582 6.3
0.884 0.740 0.843 0.531 2.180 5.4
0.842 0.719 0.791 0.510 2.000 6.4
0.828 0.732 0.791 0.500 1.832 5.6
0.895 0.678 0.813 0.519 2.000 6.4
0.874 0.744 0.844 0.529 2.596 6.5
0.851 0.706 0.782 0.501 1.797 5.8
0.867 0.733 0.827 0.520 2.194 5.6
0.872 0.678 0.782 0.518 2.002 6.5
0.914 0.746 0.846 0.538 1.799 5.9

The ablation study yields clear insights. The C3k2-ContextGuided module provides the most substantial individual boost to mAP50 (+6.2%), confirming the importance of enriched contextual feature extraction in the backbone for UAV imagery. The CAA-HSFPN neck (combining CAA and HSFPN) effectively improves overall metrics while reducing parameters. The EMA module notably increases Recall (+4.6% when added alone to the baseline), showing its strength in reducing missed detections. The full combination of all three proposed improvements yields the best overall performance, achieving the peak scores in Precision, Recall, mAP50, and mAP50-95, while also resulting in the most parameter-efficient model. This synergistic effect validates our holistic design approach for optimizing a drone-based detection system.

3.5 Qualitative Results

Visual comparisons further illustrate the advantages of our model. In challenging scenarios with complex backgrounds, partial occlusion, and varying lighting, the baseline YOLOv11n often produces false positives (mistaking background elements for persons) or fails to detect very small or obscured targets. In contrast, our improved model demonstrates superior robustness. It successfully suppresses false alarms caused by background clutter like rocks and dense vegetation, and it achieves higher confidence and more accurate localization for genuine small targets. This visual evidence aligns with the quantitative metrics, showing that our enhancements effectively address the core difficulties of small object detection from a UAV drone’s perspective in forest environments.

4. Conclusion

In this work, we have presented a comprehensively enhanced YOLOv11n model tailored for the critical task of small object detection in UAV-based forest search and rescue operations. By integrating a Context-Guided mechanism into the backbone, designing a novel CAA-HSFPN neck for efficient and spatially-aware multi-scale fusion, and augmenting the detection heads with an EMA attention module, we have addressed key limitations related to background clutter, feature degradation, and insufficient focus on subtle target signatures. Extensive experiments on a dedicated forest SAR dataset demonstrate that our model sets a new state-of-the-art trade-off, achieving a significant mAP50 improvement of 6.5% over the baseline while simultaneously reducing the model’s parameter count and computational footprint. This makes it an ideal candidate for deployment on resource-constrained UAV drone platforms requiring real-time, high-precision detection capabilities in complex, dynamic environments. Future work will focus on expanding the dataset to include more extreme environmental conditions and exploring further optimizations for edge deployment on diverse UAV hardware. The proposed method offers a practical and effective step toward more autonomous and reliable drone-assisted search and rescue systems.

Scroll to Top