RSB-YOLOv11: A High-Precision Target Detection Framework for UAV Recognition in Complex Environments

The rapid proliferation of Unmanned Aerial Vehicles (UAVs), or drones, has revolutionized numerous sectors from logistics and agriculture to surveillance and disaster response. In China, the UAV industry has experienced explosive growth, leading to widespread adoption. However, this accessibility has also given rise to significant security concerns, notably the phenomenon of “black flights” – unauthorized drone operations near sensitive areas like airports, government facilities, and public events. These incidents pose substantial risks to aviation safety, critical infrastructure, and personal privacy. Therefore, developing robust, accurate, and real-time UAV detection systems is a critical technological imperative for ensuring public safety and national security in the era of ubiquitous drones.

Traditional methods for low-altitude UAV detection, such as radar, radio frequency (RF) scanners, and acoustic sensors, are often hampered by limitations including high cost, limited effective range, and vulnerability to environmental interference. For instance, radar systems may struggle to distinguish a small China UAV drone from a bird, leading to false alarms. Computer vision-based detection, powered by deep learning, offers a promising alternative. It leverages widely available optical sensors (cameras) and can provide rich contextual information, making it suitable for integration into comprehensive anti-UAV systems. The core challenge lies in achieving high-precision detection of UAVs that appear as small, fast-moving targets against complex, dynamic backgrounds like urban skylines, cloudy skies, or forested areas.

While general-purpose object detectors like the YOLO (You Only Look Once) series have shown remarkable performance, they are not optimized for the specific difficulties presented by China UAV drone detection. The primary challenges include: (1) Extreme Scale Variation: A drone’s apparent size in an image can vary dramatically with distance, often making it a “small object” occupying fewer pixels. (2) Complex Background Clutter: Urban environments, clouds, and trees create visual noise that can obscure or mimic drone features. (3) Diverse Appearances: Drones come in various shapes, colors, and configurations (quadcopters, hexacopters, fixed-wing). (4) Real-Time Requirement: Effective countermeasures demand detection and tracking with very low latency.

To address these challenges, we propose RSB-YOLOv11 (Recognition of Small targets with a Boosted framework-YOLOv11), a novel and efficient model specifically designed for high-precision UAV detection. Our contributions are threefold:

We introduce a lightweight yet powerful building block, the C3k2-RepVitBlock (RVB), which replaces standard convolutions. It employs a re-parameterizable structure with parallel depth-wise convolutions for efficient spatial and channel modeling, significantly reducing computational redundancy while enhancing multi-scale feature representation crucial for small China UAV drone targets.
We incorporate a Channel-Spatial Attention (CSA) mechanism into the backbone network’s critical modules. Unlike sequential attention modules, CSA performs simultaneous element-wise weighting across channels and space, enabling more focused and synergistic feature refinement. This allows the model to better suppress complex background noise and amplify discriminative features of drones.
We redesign the neck of the network by constructing a Scale-aware Bidirectional Feature Pyramid Network (S-BiFPN). This structure introduces an additional high-resolution feature layer (P2) dedicated to small targets and employs learnable, bi-directional cross-scale connections with weighted fusion. This ensures that fine-grained details from early layers and high-level semantic information from deep layers are effectively integrated, greatly improving the detection capability for multi-scale UAVs.

Extensive experiments on the challenging Complex Background Dataset (CBD) demonstrate that RSB-YOLOv11 achieves state-of-the-art performance. Compared to the baseline YOLOv11n, our model improves mAP@0.5 by 4.2% (from 90.6% to 94.8%) and mAP@0.5:0.95 by 4.4% (from 53.3% to 57.7%), while simultaneously reducing the number of parameters. This balance of high accuracy and efficiency makes it highly suitable for real-world deployment in anti-UAV systems.

1. Related Work

Deep Learning-based Object Detection. Object detection frameworks are primarily categorized into two-stage and one-stage detectors. Two-stage detectors, like Faster R-CNN, first generate region proposals and then classify and refine them, offering high accuracy at the cost of speed. One-stage detectors, such as the YOLO family and SSD, perform localization and classification in a single pass, favoring inference speed. For real-time applications like China UAV drone surveillance, one-stage detectors are predominantly favored. The YOLO series has evolved rapidly, with versions like YOLOv5, YOLOv7, YOLOv8, and the recent YOLOv10, YOLOv11, and YOLOv12, each introducing architectural improvements for better speed-accuracy trade-offs. Our work is built upon the efficient YOLOv11n, which provides a strong baseline.

UAV and Small Object Detection. Detecting UAVs is inherently a small object detection problem in most practical scenarios. Researchers have adapted general detectors to this task. Common strategies include: (1) Multi-scale Feature Fusion: Enhancing structures like Feature Pyramid Networks (FPN) and Path Aggregation Networks (PAN) to better integrate features from different levels. The BiFPN structure, with its fast normalized fusion, is a notable advance. (2) Attention Mechanisms: Modules like Squeeze-and-Excitation (SE), Convolutional Block Attention Module (CBAM), and Coordinate Attention (CA) are integrated to help models focus on relevant features. (3) Context Modeling: Leveraging surrounding information to aid in recognizing small, ambiguous targets. (4) Data Augmentation: Techniques like mosaic augmentation and copy-paste are used to increase the prevalence and variability of small objects in training data. While these methods provide improvements, there remains a need for a holistic architecture that cohesively addresses feature representation, efficient computation, and targeted attention for the specific case of China UAV drone detection in clutter.

2. Methodology: The RSB-YOLOv11 Framework

The overall architecture of RSB-YOLOv11 is illustrated in Figure 1. It retains the general flow of YOLO detectors: a backbone for feature extraction, a neck for multi-scale feature fusion, and a detection head for final predictions. Our key innovations lie in the redesign of core components within the backbone and neck to optimize for UAV detection.

2.1. Lightweight and Effective Feature Extraction with C3k2-RVB and C3k2-RC

The original YOLOv11 uses a C3k2 module, which contains a series of standard convolutions and bottleneck blocks. While effective, this structure can be computationally heavy and may not optimally capture the fine-grained features needed for small drones. We draw inspiration from the RepViT block, which rethinks mobile CNNs from a Vision Transformer perspective, separating token mixing (spatial interaction) and channel mixing operations.

Our proposed C3k2-RepVitBlock (RVB) module, shown in Figure 2, is designed as a lightweight and powerful replacement. Its core consists of two parallel branches of depth-wise convolutions (DWConv): a 3×3 DWConv for capturing local spatial context (edges, textures) and a 1×1 DWConv for modeling channel-wise relationships. This parallel design allows the model to capture multi-scale spatial features relevant to drones of different sizes simultaneously. A key advantage is the use of structural re-parameterization. During training, the parallel branches learn diverse features. During inference, they can be merged into a single 3×3 DWConv layer, significantly reducing computational load and latency without sacrificing performance. This is crucial for deploying efficient anti-UAV systems that must run on edge devices.

To further enhance the module’s ability to discriminate the China UAV drone from complex backgrounds, we integrate an attention mechanism. Instead of simply cascading channel and spatial attention, which can lead to information separation, we adopt the Channel-Spatial Attention (CSA) module, forming the C3k2-RC block (Figure 3). The CSA mechanism, detailed in Figure 4, performs channel and spatial attention in a more coupled manner.

Mathematically, for an input feature map $$ \mathbf{F} \in \mathbb{R}^{C \times H \times W} $$, the CSA module works as follows:

Channel Attention Branch: This branch generates a channel attention vector $$ \mathbf{A_c} \in \mathbb{R}^{C \times 1 \times 1} $$.
$$ \mathbf{Z} = \text{GAP}(\mathbf{F}) $$
$$ \mathbf{A_c} = \sigma(\text{Linear}_2(\delta(\text{Linear}_1(\mathbf{Z})))) $$
where $\text{GAP}$ is global average pooling, $\text{Linear}_1$ reduces channels by a ratio $r$, $\delta$ denotes the GELU activation function, $\text{Linear}_2$ expands channels back to $C$, and $\sigma$ is the sigmoid function.

Spatial Attention Branch: This branch generates a spatial attention map $$ \mathbf{A_s} \in \mathbb{R}^{1 \times H \times W} $$ using a large-kernel depth-wise convolution for broad spatial context, which is vital for distinguishing a drone from background objects.
$$ \mathbf{A_s} = \sigma(\text{DWConv}_{7\times7}(\mathbf{F})) $$

Fusion: The final refined feature map $$ \mathbf{F’} $$ is obtained by an element-wise product of the two attention weights and a residual connection:
$$ \mathbf{F’} = \mathbf{F} \otimes \mathbf{A_c} \otimes \mathbf{A_s} + \mathbf{F} $$
This synergistic application of attention forces the model to focus on “what” (channel) and “where” (space) simultaneously, making it highly effective for pinpointing small, salient UAV targets amidst clutter.

2.2. Enhanced Multi-Scale Fusion with S-BiFPN

The neck of a detector is responsible for fusing features from different stages of the backbone to build a rich, multi-scale representation. The original PAN-FPN structure is a unidirectional top-down and bottom-up process. For detecting China UAV drone targets whose scale can vary immensely within a single scene, we need more aggressive and intelligent fusion.

We construct a Scale-aware Bidirectional Feature Pyramid Network (S-BiFPN), depicted in Figure 5. It extends the standard BiFPN in two ways: First, it introduces an additional high-resolution feature layer (P2) from the very early stage of the backbone. This layer preserves the finest spatial details, which are critical for detecting tiny drones that may be only a handful of pixels in size. Second, it implements bi-directional (top-down and bottom-up) cross-scale connections with learnable fusion weights. Unlike simple concatenation or addition, weighted fusion allows the network to dynamically decide the importance of input features from different scales during training. The fusion for a node can be represented as:
$$ \mathbf{P}_{out} = \sum_i \frac{w_i}{\epsilon + \sum_j w_j} \cdot \mathbf{P}_{in}^{(i)} $$
where $$ w_i $$ are learnable weights for each input feature $$ \mathbf{P}_{in}^{(i)} $$, and $$ \epsilon $$ is a small constant for numerical stability. This structure ensures that both the fine-grained details from P2/P3 and the high-level semantic cues from P4/P5 are seamlessly and effectively integrated, providing a robust feature pyramid for detecting UAVs across all sizes.

3. Experiments and Analysis

3.1. Dataset and Implementation Details

We evaluate our model on the public Complex Background Dataset (CBD), which contains 12,490 images with annotations for UAVs and birds under diverse and challenging conditions—urban settings, open skies, mountainous regions, and varying weather. This dataset perfectly simulates the real-world scenarios faced by anti-UAV systems in China. We use the standard split: 9,725 images for training and 2,765 for validation. All models are trained from scratch without pre-trained weights to ensure a fair comparison. The training hyperparameters are consistent across all experiments and are summarized in Table 1.

Table 1: Training Hyperparameter Configuration
Parameter	Value	Parameter	Value
Input Size	640 × 640	Optimizer	SGD
Epochs	350	Momentum	0.937
Batch Size	16	Weight Decay	5e-4
Initial LR	0.01	Early Stopping	50 epochs
LR Scheduler	Cosine Annealing	Warm-up Epochs	3

Evaluation Metrics: We use standard object detection metrics: Precision (P), Recall (R), mean Average Precision (mAP) at IoU threshold 0.5 (mAP@0.5) and the average mAP over IoU thresholds from 0.5 to 0.95 with a step of 0.05 (mAP@0.5:0.95). The F1-score, the harmonic mean of Precision and Recall, is also reported. Their formulas are:
$$ P = \frac{TP}{TP + FP}, \quad R = \frac{TP}{TP + FN} $$
$$ F1 = \frac{2 \times P \times R}{P + R} $$
$$ AP = \int_0^1 P(R) dR, \quad mAP = \frac{1}{N} \sum_{i=1}^{N} AP_i $$
where $TP$, $FP$, and $FN$ denote true positives, false positives, and false negatives, respectively, and $N$ is the number of classes.

3.2. Ablation Studies

We conduct systematic ablation studies on the CBD dataset to validate the contribution of each proposed component. The baseline is the standard YOLOv11n model.

Study 1: Effectiveness of Individual Components. Results are shown in Table 2. Introducing only the lightweight C3k2-RVB module (Exp B) reduces parameters and FLOPs but slightly hurts mAP, indicating that lightweight alone isn’t sufficient. Adding only the CSA attention to RVB (Exp C) brings modest gains. The most significant individual improvement comes from the S-BiFPN neck (Exp D), which boosts mAP@0.5 by 3.3%, underscoring the critical importance of advanced multi-scale fusion for China UAV drone detection. Combining S-BiFPN with either RVB (Exp F) or the full C3k2-RC (Exp G) yields excellent results. Our full model, RSB-YOLOv11 (Exp H), which integrates all three components, achieves the best performance: 94.3% mAP@0.5 and 57.7% mAP@0.5:0.95, representing gains of 3.7% and 4.4% over the baseline, respectively. Notably, it does this with fewer parameters than the baseline and maintains a real-time inference speed of 53 FPS.

Table 2: Ablation Study on Proposed Components
Exp	C3k2-RVB	C3k2-RC	S-BiFPN	FLOPs (G)	Params (M)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	F1	FPS
A (Baseline)				6.4	2.59	90.6	53.3	86.98	101
B	√			5.8	2.28	89.6	52.9	86.19	91
C		√		6.7	2.34	90.5	53.4	87.00	65
D			√	10.7	2.72	93.9	57.2	90.30	79
E	√	√		6.7	2.34	91.2	54.3	87.36	68
F	√		√	9.8	2.28	94.1	57.1	90.64	61
G		√	√	11.3	2.63	93.9	56.9	91.08	59
H (Ours)	√	√	√	10.7	2.34	94.3	57.7	91.30	53

Study 2: Analysis of Attention Mechanism and Placement. We investigate the impact of the CSA mechanism and its placement within the backbone. As shown in Table 3, integrating CSA into all C3k2-RVB blocks (Variant IV) delivers the best performance. Placing it only in shallow layers (Variant II) helps, but full integration allows for consistent feature refinement across all semantic levels. We also compare CSA against other popular attention modules like CBAM, SCSA, MLCA, and MSCA. CSA outperforms them all, particularly in F1-score, demonstrating its superior ability to jointly model channel and spatial dependencies for reducing false positives and negatives in China UAV drone detection.

Table 3: Comparison of Different Attention Mechanisms and Placements
Method / Attention	FLOPs (G)	Params (M)	mAP@0.5 (%)	F1
Variant I (No Attention)	9.8	2.28	94.1	90.64
Variant II (CSA on Shallow)	10.7	2.33	94.2	90.39
Variant III (CSA on Deep)	9.8	2.29	93.5	91.04
Variant IV (CSA on All – Our Choice)	10.7	2.34	94.3	91.30
CBAM	9.8	2.30	93.3	89.83
SCSA	10.6	2.33	94.0	90.92
MLCA	10.6	2.32	94.0	90.63
MSCA	11.0	2.37	93.4	89.73

3.3. Comparison with State-of-the-Art Models

We compare RSB-YOLOv11 against a wide range of contemporary object detectors on the CBD dataset. The results, summarized in Table 4, clearly demonstrate the superiority of our approach. Our model achieves the highest mAP@0.5 (94.3%) and mAP@0.5:0.95 (57.7%) among all compared methods. Notably, it significantly outperforms other YOLO variants of similar or even larger scale (e.g., YOLOv8s, YOLOv11s) while using fewer parameters and FLOPs. It also surpasses recent efficient architectures like YOLOv10 and YOLOv12. Compared to transformer-based detectors like RT-DETR, our CNN-based model is far more efficient and accurate for this task. The high F1-score of 91.30 indicates an excellent balance between precision and recall, which is essential for a reliable anti-UAV system to minimize both missed detections and false alarms.

Table 4: Performance Comparison with State-of-the-Art Detectors on the CBD Dataset
Model	Params (M)	FLOPs (G)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	F1
RT-DETR	31.99	103.4	78.6	40.0	73.99
FBRT-YOLO	0.9	6.7	89.8	54.1	87.20
Mamba-YOLO	5.98	13.6	92.1	55.7	89.26
YOLOv3	8.67	12.9	88.7	49.2	87.42
YOLOv5s	7.01	15.8	92.4	55.2	90.20
YOLOv7-tiny	6.02	13.2	91.7	53.4	89.10
YOLOv8n	3.01	8.1	89.6	53.5	87.27
YOLOv8s	11.13	28.4	91.1	56.5	88.65
YOLOv10n	2.70	8.2	88.3	51.7	86.23
YOLOv10s	8.04	24.4	90.7	54.2	88.04
YOLOv11n (Baseline)	2.59	6.4	90.6	53.3	86.98
YOLOv11s	9.41	21.3	91.2	55.7	88.17
YOLOv12n	2.56	6.3	88.6	52.2	84.67
YOLOv12s	9.23	21.2	91.0	55.3	87.76
RSB-YOLOv11 (Ours)	2.34	10.7	94.3	57.7	91.30

3.4. Qualitative Results and Robustness Analysis

Visual comparisons, as shown in sample detection results, further validate the effectiveness of RSB-YOLOv11. In scenarios with strong background clutter (e.g., urban buildings, dense clouds), the baseline YOLOv11n often suffers from missed detections, especially for smaller, distant drones. In contrast, our model consistently identifies these challenging targets with higher confidence scores. It also demonstrates superior discriminative power, effectively avoiding false positives where background objects might resemble a drone. This robustness is a direct consequence of the synergistic design: the S-BiFPN provides rich multi-scale features, the CSA mechanism filters out noise and highlights key drone characteristics, and the efficient RVB block ensures this is done without computational explosion. This makes RSB-YOLOv11 a highly practical solution for real-world China UAV drone surveillance systems operating in complex Chinese urban and natural environments.

4. Conclusion

In this paper, we presented RSB-YOLOv11, a high-precision object detection model tailored for the challenging task of UAV recognition in complex backgrounds. To address the specific difficulties of small target size, scale variation, and background clutter prevalent in China UAV drone detection, we introduced three key innovations: a lightweight and re-parameterizable C3k2-RVB module for efficient feature extraction, a synergistic Channel-Spatial Attention (CSA) mechanism for targeted feature refinement, and a Scale-aware BiFPN (S-BiFPN) structure for enhanced multi-scale feature fusion. Extensive experiments on the CBD dataset demonstrate that our model achieves state-of-the-art performance, significantly outperforming the original YOLOv11 and other contemporary detectors in terms of accuracy (mAP) and efficiency (parameters). RSB-YOLOv11 offers a compelling solution for developing reliable, real-time vision-based anti-UAV systems. Future work will focus on extending the model’s robustness to more extreme conditions such as nighttime, heavy rain, and fog, and further exploring ultra-lightweight designs for deployment on resource-constrained edge devices.