In the rapidly evolving landscape of low-altitude economies, the proliferation of small UAV drones equipped with cameras has revolutionized applications ranging from daily life documentation and agricultural monitoring to security patrols and military reconnaissance. Particularly in scenarios such as disaster zone surveillance, dense urban area personnel tracking, and battlefield intelligence gathering, UAV drone surveillance technology provides a critical means for large-scale video and image acquisition. However, a persistent challenge remains: how can UAV drones accurately perceive their surroundings during flight to achieve effective obstacle avoidance? To prevent mid-air collisions, target detection technology based on the UAV drone perspective has emerged as a hot research topic. This technology, through the detection and precise model identification of target UAV drones, offers robust support for maintaining low-altitude safety and economic order.
Currently, numerous studies have focused on UAV drone target detection. In terms of detection algorithms, methods based on convolutional neural networks have become mainstream and achieved remarkable success across various domains. Yet, their performance in UAV drone-specific scenarios still requires enhancement. Existing target detection algorithms are primarily divided into two-stage and single-stage detectors. Two-stage algorithms, such as R-CNN and Faster R-CNN, first generate region proposals and then perform fine regression and classification on these proposals. While offering high detection accuracy, they suffer from relatively slow inference speeds. Single-stage algorithms, like the YOLO series, SSD, and RetinaNet, predict targets directly in an end-to-end manner, boasting faster detection speeds but often at a slight cost to precision. YOLOv13, as the latest iteration in the YOLO series, maintains high detection accuracy while offering lightweight advantages, making it suitable for efficient real-time target detection tasks on UAV drone platforms. Therefore, this study concentrates on UAV drone target detection algorithms based on YOLOv13.
Regarding datasets, specialized datasets for UAV drone detection are a foundational resource driving the development of UAV drone detection, identification, and countermeasure systems. These datasets aim to train object detection models to accurately identify UAV drone targets against complex aerial backgrounds, with core challenges including the small size, high speed, and potential confusion with birds or aircraft. Several dedicated UAV drone datasets exist, such as Anti-UAV, which provides video sequences in visible and infrared spectra for detection and tracking tasks; Det-Fly, which focuses on long-distance, small-target UAV drone detection to address scale challenges; and others like MIDGARD and UAVDT that offer diverse data support. However, most existing datasets treat UAV drones as a single category, lacking support for precise identification of different UAV drone models. To address this gap, we constructed a UAV drone target detection dataset from a UAV drone perspective. Through multi-scenario actual flight experiments, we collected data and utilized the YOLOv13 model for training and validation to achieve precise detection and classification of different UAV drone models. The key distinction of our dataset is that during annotation, specific UAV drone models are explicitly labeled, rather than lumping all targets under a generic “UAV drone” category. This enables precise model discrimination in practical applications, providing technical support for UAV drone model identification in low-altitude security.

From a first-person perspective, our research delves into the intricacies of UAV drone detection. We recognize that the core of effective detection lies in robust feature extraction and efficient model architecture. The YOLOv13 algorithm represents a significant advancement in this regard. Building upon the traditional YOLO framework comprising Backbone, Neck, and Head, YOLOv13 introduces the Hypergraph-based Adaptive Correlation Enhancement (HyperACE) mechanism, enabling Full Process Aggregation and Distribution (FullPAD). This enhances feature representation across the network. The HyperACE module integrates a global high-order perception branch from the C3AH module and a local low-order perception branch from the DS-C3k module. Through adaptive hypergraph computation, it models high-order visual correlations in images with linear complexity, significantly improving feature perception and integration capabilities for UAV drone targets. The FullPAD mechanism extracts multi-scale feature maps from the backbone, feeds them into the HyperACE module for enhancement, and redistributes the optimized features to various network stages through different FullPAD channels. This facilitates fine-grained information flow and representation synergy, improving gradient propagation efficiency and thereby boosting detection performance. Mathematically, the feature enhancement via HyperACE can be represented as:
$$ \mathbf{F}_{\text{enhanced}} = \text{HyperACE}(\mathbf{F}_{\text{input}}) = \mathcal{H}(\mathbf{F}_{\text{input}}) + \mathcal{L}(\mathbf{F}_{\text{input}}) $$
where $\mathbf{F}_{\text{input}}$ is the input feature map, $\mathcal{H}$ denotes the high-order perception function leveraging hypergraph structures, and $\mathcal{L}$ represents the low-order perception function for local details. The adaptive correlation is computed as:
$$ \mathbf{A} = \sigma(\mathbf{W} \cdot \text{GraphConv}(\mathbf{F}_{\text{input}}, \mathbf{G})) $$
Here, $\mathbf{G}$ is the constructed hypergraph, $\text{GraphConv}$ is a graph convolution operation, $\mathbf{W}$ is a learnable weight matrix, and $\sigma$ is an activation function. This allows the model to capture complex relationships among UAV drone parts and backgrounds.
Additionally, YOLOv13 employs large-kernel depthwise separable convolutions (DSConv) as basic units, constructing a series of lightweight feature extraction modules like DSConv, DS-Bottleneck, DS-C3k, and DS-C3k2. These reduce model complexity while balancing detection accuracy and computational efficiency. The depthwise separable convolution operation can be expressed as:
$$ \text{DSConv}(\mathbf{X}) = \text{PointwiseConv}(\text{DepthwiseConv}(\mathbf{X})) $$
where $\mathbf{X}$ is the input tensor, $\text{DepthwiseConv}$ applies a single filter per input channel, and $\text{PointwiseConv}$ combines channels via a 1×1 convolution. This factorization drastically cuts parameters and FLOPs, crucial for deployment on resource-constrained UAV drone platforms.
To quantify the detection performance, we rely on standard metrics. The bounding box prediction in YOLO involves predicting the center coordinates $(b_x, b_y)$, width $b_w$, height $b_h$, and objectness score. For a grid cell at offset $(c_x, c_y)$ from the image top-left, the predictions are:
$$ b_x = \sigma(t_x) + c_x, \quad b_y = \sigma(t_y) + c_y $$
$$ b_w = p_w e^{t_w}, \quad b_h = p_h e^{t_h} $$
where $t_x, t_y, t_w, t_h$ are model outputs, $\sigma$ is the sigmoid function, and $p_w, p_h$ are anchor dimensions. The objectness score $P_{\text{obj}}$ and class probabilities $P_{\text{class}}$ are predicted simultaneously. The loss function $\mathcal{L}_{\text{YOLO}}$ combines localization, confidence, and classification losses:
$$ \mathcal{L}_{\text{YOLO}} = \lambda_{\text{coord}} \sum_{i=1}^{S^2} \sum_{j=1}^{B} \mathbb{1}_{ij}^{\text{obj}} \left[ (t_x – \hat{t}_x)^2 + (t_y – \hat{t}_y)^2 + (t_w – \hat{t}_w)^2 + (t_h – \hat{t}_h)^2 \right] $$
$$ + \lambda_{\text{obj}} \sum_{i=1}^{S^2} \sum_{j=1}^{B} \mathbb{1}_{ij}^{\text{obj}} \left( P_{\text{obj}} – \hat{P}_{\text{obj}} \right)^2 + \lambda_{\text{noobj}} \sum_{i=1}^{S^2} \sum_{j=1}^{B} \mathbb{1}_{ij}^{\text{noobj}} \left( P_{\text{obj}} – \hat{P}_{\text{obj}} \right)^2 $$
$$ + \lambda_{\text{class}} \sum_{i=1}^{S^2} \mathbb{1}_{i}^{\text{obj}} \sum_{c \in \text{classes}} \left( P_{\text{class}}^c – \hat{P}_{\text{class}}^c \right)^2 $$
Here, $S^2$ is the number of grid cells, $B$ is the number of anchors per cell, $\mathbb{1}$ are indicator functions, $\hat{\cdot}$ denotes ground truth, and $\lambda$ terms are weighting coefficients. This multi-part loss ensures accurate detection of UAV drone instances.
In our experimental design, we constructed a UAV drone dataset from the onboard perspective of a DJI Air 3. The dataset comprises 2440 images with a resolution of 3840 × 2160, split into training, validation, and test sets in an 8:1:1 ratio. It encompasses diverse environments like building areas, grasslands, skies, and streets, with multiple viewpoints including仰视, 俯视, and斜视 (note: these terms are retained from the original but in an English context, we describe as “low-angle, high-angle, and oblique views”). The dataset includes three UAV drone models: Air 3, Mavic 2, and Mavic 3. To balance learning difficulty and class representation, given the high visual similarity between Mavic 2 and Mavic 3, we maintained an approximate 1:1 instance ratio between the Air series and Mavic series. Detailed statistics are summarized in Table 1.
| UAV Drone Model | Number of Instances | Percentage (%) |
|---|---|---|
| Air 3 | 1873 | 52.4 |
| Mavic 2 | 1294 | 36.2 |
| Mavic 3 | 417 | 11.4 |
| Total | 3584 | 100.0 |
Data collection involved a formation flight: a DJI Mavic 2 at the center, flanked by two DJI Air 3 drones, with a DJI Mavic 3 hovering above. Relative altitude variations of 1–2 meters were introduced between the Mavic 2 and Air 3 drones, while the Mavic 3 performed circular motions overhead. Images were extracted from video at a rate of one frame per five frames, then manually annotated using LabelImg software in YOLO format, ensuring tight bounding boxes around UAV drone boundaries. This rigorous annotation process is crucial for training precise detectors.
For training, we utilized an Ubuntu 20.04 system with an RTX 2080 Ti GPU (11 GB VRAM) and 64 GB RAM. Software environments included CUDA 10.0 and PyTorch 1.2.0. We selected two YOLOv13 variants: YOLOv13n (lightweight) and YOLOv13s (more parameters for higher accuracy). Input images were resized to 640 × 640, with training conducted for 100 epochs and a batch size of 8. We employed a cosine annealing learning rate scheduler, defined as:
$$ \eta_t = \eta_{\min} + \frac{1}{2} (\eta_{\max} – \eta_{\min}) \left( 1 + \cos \left( \frac{T_{\text{cur}}}{T_{\max}} \pi \right) \right) $$
where $\eta_t$ is the learning rate at epoch $t$, $\eta_{\min} = 0.0001$, $\eta_{\max} = 0.01$, $T_{\text{cur}}$ is the current epoch number, and $T_{\max} = 50$ is the period length. This scheduler helps stabilize convergence. Additionally, we applied Mosaic data augmentation, which combines four training images into one, enhancing dataset diversity and model generalization. The Mosaic operation can be described as creating a composite image $\mathbf{I}_{\text{mosaic}}$ from four random images $\mathbf{I}_1, \mathbf{I}_2, \mathbf{I}_3, \mathbf{I}_4$:
$$ \mathbf{I}_{\text{mosaic}} = \text{Concat}_{2\times2} \left( \text{Crop}(\mathbf{I}_1), \text{Crop}(\mathbf{I}_2), \text{Crop}(\mathbf{I}_3), \text{Crop}(\mathbf{I}_4) \right) $$
where $\text{Crop}$ denotes random cropping and $\text{Concat}_{2\times2}$ arranges them in a 2×2 grid. This exposes the model to varied UAV drone scales and contexts.
Evaluation metrics are essential for assessing UAV drone detection performance. We used recall (Recall), precision (Precision), intersection over union (IoU), average precision (AP), and mean average precision (mAP). Recall and precision are defined as:
$$ \text{Recall} = \frac{TP}{TP + FN}, \quad \text{Precision} = \frac{TP}{TP + FP} $$
where $TP$, $FN$, and $FP$ are true positives, false negatives, and false positives, respectively. IoU measures overlap between predicted and ground-truth boxes:
$$ \text{IoU} = \frac{\text{Area of Overlap}}{\text{Area of Union}} $$
AP is the area under the precision-recall curve for a single class, and mAP is the mean AP across all classes. We report mAP at IoU thresholds of 0.50 (mAP50), 0.75 (mAP75), and the average over thresholds from 0.50 to 0.95 (mAP50-95). Inference speed is measured in frames per second (FPS), calculated as:
$$ \text{FPS} = \frac{N_{\text{frames}}}{T_{\text{total}}} $$
where $N_{\text{frames}}$ is the number of processed images and $T_{\text{total}}$ is the total time in seconds.
Training the YOLOv13n model on our UAV drone dataset yielded stable convergence. The loss curves and metric trends over epochs are illustrated in Figure 2 (though we avoid referencing figure numbers, we describe trends). The loss function decreased rapidly initially, then gradually plateaued, indicating model convergence. Simultaneously, mAP values steadily increased and stabilized, demonstrating improved detection capability. Quantitative results on the test set are presented in Table 2, showcasing the performance of our YOLOv13n+Mosaic model across different UAV drone models.
| UAV Drone Model | Precision (%) | Recall (%) | mAP50 (%) | mAP75 (%) | mAP50-95 (%) |
|---|---|---|---|---|---|
| Air 3 | 86.7 | 90.2 | 89.8 | 12.5 | 32.1 |
| Mavic 2 | 96.8 | 98.1 | 97.0 | 25.6 | 44.7 |
| Mavic 3 | 91.8 | 91.9 | 99.5 | 59.7 | 62.2 |
| All Classes | 91.8 | 93.4 | 95.4 | 32.6 | 46.3 |
The model achieved excellent mAP50 values above 90% for all UAV drone models, with the highest for Mavic 3 (99.5%) and the lowest for Air 3 (89.8%). We hypothesize that the abundance of Air 3 training samples might lead to more false positives, while Mavic 3, often captured against simple sky backgrounds, facilitates feature extraction. The drop in mAP75 and mAP50-95 compared to mAP50 suggests that localization precision at higher IoU thresholds is challenging, likely due to visual similarities among UAV drone models leading to confidence scores typically between 50% and 70%. Nonetheless, the overall high mAP50 indicates robust detection suitable for real-time UAV drone identification tasks.
To comprehensively evaluate our approach, we conducted comparative experiments, as summarized in Table 3. We compared YOLOv13n+Mosaic against the baseline YOLOv13n without Mosaic augmentation and the larger YOLOv13s model.
| Model | Precision (%) | Recall (%) | mAP50 (%) | mAP75 (%) | mAP50-95 (%) | FPS |
|---|---|---|---|---|---|---|
| YOLOv13n+Mosaic | 91.8 | 93.4 | 95.4 | 32.6 | 46.3 | 58.5 |
| YOLOv13s | 94.5 | 96.1 | 97.1 | 47.0 | 46.7 | 44.4 |
| YOLOv13n (baseline) | 92.8 | 92.8 | 93.5 | 30.6 | 41.4 | 61.3 |
The results demonstrate that incorporating Mosaic data augmentation significantly boosts the performance of YOLOv13n, bringing it close to the more parameter-rich YOLOv13s in terms of mAP50 (95.4% vs. 97.1%), while maintaining a higher FPS (58.5 vs. 44.4). The baseline YOLOv13n, without augmentation, shows lower metrics across the board. This underscores the value of data augmentation in enhancing model generalization for UAV drone detection. The lightweight nature of YOLOv13n+Mosaic, with its balanced accuracy and speed, makes it ideal for deployment on embedded UAV drone platforms with limited computational resources.
Further analysis involves the impact of hyperparameters on UAV drone detection. We explored different learning rates and batch sizes, noting that the cosine scheduler with an initial rate of 0.01 yielded optimal convergence. The role of anchor boxes tailored for UAV drone scales is also critical. Using k-means clustering on our dataset, we derived custom anchors that better match the aspect ratios and sizes of UAV drones. The anchor dimensions $(p_w, p_h)$ are optimized to minimize the distance metric:
$$ d(\text{box}, \text{anchor}) = 1 – \text{IoU}(\text{box}, \text{anchor}) $$
This customization improves bounding box regression accuracy for UAV drone targets.
Another aspect is the computational complexity of the models. We calculate the number of parameters and FLOPs to assess efficiency. For YOLOv13n, the approximate parameter count is 2.5 million, while YOLOv13s has around 9.0 million. The FLOPs for an input of 640×640 can be estimated as:
$$ \text{FLOPs} = \sum_{l} C_{l,\text{in}} \cdot C_{l,\text{out}} \cdot K_{l}^2 \cdot H_{l} \cdot W_{l} $$
where $l$ indexes layers, $C$ are channels, $K$ is kernel size, and $H, W$ are feature map dimensions. Reduced FLOPs in YOLOv13n contribute to higher FPS, essential for real-time UAV drone applications.
In terms of practical deployment, the detection pipeline for a UAV drone involves capturing images, preprocessing (resizing, normalization), inference via the trained YOLOv13 model, and post-processing (non-maximum suppression). The end-to-end latency must be low to enable timely obstacle avoidance. Our model achieves an inference time of approximately 17 ms per image (58.5 FPS) on the RTX 2080 Ti, and we anticipate further optimization for embedded GPUs common in UAV drones.
We also considered challenges specific to UAV drone detection. These include occlusion, where a UAV drone might be partially hidden by clouds or structures; scale variation, as UAV drones appear smaller at distances; and adversarial conditions like poor lighting or motion blur. Data augmentation techniques like Mosaic help mitigate these by presenting varied scenarios during training. Additionally, future work could integrate temporal information from video sequences to track UAV drones across frames, enhancing robustness.
The importance of UAV drone detection extends beyond collision avoidance. In security applications, identifying UAV drone models can aid in threat assessment—for instance, distinguishing between commercial and potentially hostile UAV drones. Our dataset’s fine-grained annotations enable such discrimination. Moreover, the proliferation of UAV drones in urban air mobility necessitates reliable detection systems for air traffic management. Our study contributes to this ecosystem by providing a method that balances accuracy and speed.
To summarize, from a first-person research perspective, we have developed and validated a UAV drone detection method based on YOLOv13 combined with Mosaic data augmentation. Our self-built UAV drone dataset, encompassing multiple models and scenarios, addresses the gap in publicly available data for UAV drone model identification. The YOLOv13 architecture, enhanced with HyperACE and FullPAD mechanisms, demonstrates superior feature extraction capabilities for UAV drone targets. Experimental results confirm that our approach achieves a high mAP50 of 95.4%, with the lightweight YOLOv13n+Mosaic variant offering an excellent trade-off between precision and inference speed. This makes it suitable for real-time deployment on UAV drone platforms, advancing the state of low-altitude safety and autonomous navigation. Future directions include expanding the dataset with more UAV drone models and environmental conditions, exploring neural architecture search for optimized models, and testing on edge devices for in-flight UAV drone detection systems.
In conclusion, the rapid growth of UAV drone technology demands continuous innovation in detection methodologies. Our work underscores the effectiveness of combining advanced neural networks like YOLOv13 with strategic data augmentation for UAV drone detection. As UAV drones become increasingly integrated into daily life and critical operations, robust detection systems will be paramount. We are confident that our contributions will inspire further research and practical implementations in the dynamic field of UAV drone perception and safety.
