A Deep Learning-Based Method for Multi-Motion Target Recognition in UAV Platforms

In recent years, the integration of unmanned aerial vehicles (UAVs), commonly known as drones, into various sectors has revolutionized data acquisition and monitoring capabilities. UAV drones offer unparalleled flexibility in capturing high-resolution imagery from multiple altitudes and angles, making them ideal for applications such as environmental surveillance, security patrols, and urban management. Particularly in road traffic safety supervision, UAV drones equipped with target detection algorithms enable real-time monitoring of vehicle movements. However, recognizing multiple moving targets from aerial perspectives poses significant challenges, including large scale variations, irregular dense distributions, and class imbalances, which often limit model performance and scalability. To address these issues, I propose an improved vehicle target recognition method based on YOLOv7, named YOLO-QYF, designed specifically for UAV drone platforms. This method enhances detection accuracy and efficiency through several key modifications, which I will detail in this article.

The core of my approach involves four strategic enhancements to the baseline YOLOv7 model. First, I introduce the QARepVGG module to replace the computationally expensive Extended Efficient Layer Aggregation Network (E-ELAN) in the backbone, reducing model parameters while maintaining performance. Second, I incorporate a Content-Aware Feature Reassembly (CARAFE) module during feature fusion to minimize information loss during upsampling, thereby reducing missed detections in dense scenarios. Third, I add a Coordinate Attention (CA) mechanism to the early layers of the feature extraction network to enhance the localization capability for targets of interest, especially small objects. Finally, I adopt the WIoU loss function to mitigate sample imbalance issues and improve model generalization. These improvements collectively aim to optimize the model for the dynamic and complex environments captured by UAV drones.

To understand the methodology, let me start with the network architecture of YOLO-QYF. The overall framework builds upon YOLOv7 but integrates the aforementioned modules to address specific limitations in UAV-based target detection. The backbone network utilizes QARepVGG blocks, which employ a multi-branch design during training that fuses into a single 3×3 convolutional branch during inference. This fusion process reduces computational overhead without sacrificing accuracy. The fusion of convolutional and batch normalization layers can be expressed as: $$ \text{BN}(\text{Conv}(x)) = W_{\text{fused}} * x + b_{\text{fused}} $$ where $$ W_{\text{fused}} = \frac{\gamma \times W}{\sqrt{\sigma^2 + \epsilon}} $$ and $$ b_{\text{fused}} = \frac{\gamma \times (b – \mu)}{\sqrt{\sigma^2 + \epsilon}} + \beta $$. This linear transformation simplifies the structure, making it more suitable for deployment on resource-constrained UAV drone platforms.

In the neck of the network, the CARAFE module performs content-aware upsampling to preserve semantic information during feature map transmission. Unlike traditional upsampling methods that focus on local pixel details, CARAFE predicts a reassembly kernel based on input features, allowing for adaptive recombination. The process involves two stages: kernel prediction and feature reassembly. For a given input feature map \( X \), the upsampling kernel \( W_r \) is predicted as: $$ W_r = \psi(N(X_l, K_{\text{encoder}})) $$ where \( \psi \) denotes the kernel prediction function, \( X_l \) is the input at location \( l = (i, j) \), and \( N(X_l, K) \) represents a \( K \times K \) sub-region centered at \( l \). The output feature map \( X’_{l’} \) is then computed as: $$ X’_{l’} = \sum_{n=-r}^{r} \sum_{m=-r}^{r} W_r(n, m) X(i+n, j+m) $$ where \( r = \lfloor K_{\text{up}} / 2 \rceil \). This mechanism enhances the utilization of feature information, crucial for detecting small and densely packed vehicles in UAV drone imagery.

To further improve target localization, I integrate the Coordinate Attention (CA) mechanism into the feature extraction network. CA captures long-range dependencies along spatial dimensions by decomposing attention into height and width directions, which is beneficial for identifying targets in cluttered scenes. For an input feature map of size \( H \times W \times C \), average pooling is applied along the horizontal and vertical directions to obtain feature maps \( Z^h_c(h) \) and \( Z^w_c(w) \): $$ Z^h_c(h) = \frac{1}{W} \sum_{0 \leq i \leq W} x_c(h, i) $$ $$ Z^w_c(w) = \frac{1}{H} \sum_{0 \leq j \leq H} x_c(j, w) $$. These are then concatenated and transformed through a shared 1×1 convolution, followed by activation and splitting to produce attention weights \( g^h \) and \( g^w \). The final output is weighted as: $$ y_c(i, j) = x_c(i, j) g^h_c(i) g^w_c(j) $$. This attention mechanism helps the model focus on relevant regions, enhancing detection accuracy for vehicles captured by UAV drones.

For loss function optimization, I replace the CIoU loss with WIoU to address sample imbalance. The WIoU loss function introduces a dynamic non-monotonic focusing mechanism that reduces the influence of low-quality samples during training. It is defined as: $$ L_{\text{WIoU}} = r R_{\text{WIoU}} L_{\text{IoU}} $$ where \( L_{\text{IoU}} = 1 – \frac{|B \cap B_{\text{gt}}|}{|B \cup B_{\text{gt}}|} \) is the IoU loss, \( r \) is the non-monotonic focusing coefficient given by \( r = \frac{\beta}{\delta^{\alpha \beta – \delta}} \), and \( \beta = \frac{L_{\text{IoU}}}{\bar{L}} \) represents the outlier degree. The distance loss \( R_{\text{WIoU}} \) is computed as: $$ R_{\text{WIoU}} = \exp \left( \frac{(x – x_{\text{gt}})^2 + (y – y_{\text{gt}})^2}{(W_g^2 + H_g^2)^*} \right) $$ where \( (x, y) \) and \( (x_{\text{gt}}, y_{\text{gt}}) \) are the center coordinates of the anchor and ground truth boxes, respectively, and \( W_g \) and \( H_g \) are the width and height of the ground truth box. This loss function improves model generalization across diverse traffic scenarios observed by UAV drones.

To evaluate the proposed method, I designed experiments using a custom dataset that simulates real-world traffic conditions. This dataset includes three distinct traffic flow environments—free flow, synchronous flow, and blocking flow—captured by UAV drones at altitudes of 30m, 60m, and 90m. The dataset comprises 6,604 images with five vehicle categories: cars, SUVs, vans, buses, and trucks. The distribution of vehicles varies across flows, reflecting realistic challenges in UAV-based detection. Additionally, I tested on the VisDrone2021 dataset to assess generalizability. Experimental settings are summarized in the table below, highlighting key parameters for training and evaluation.

Parameter Configuration Parameter Configuration
CPU Intel i9-12900F Training-Test Split 8:2
GPU RTX 3090 Ti Image Size 640×640
CUDA Version 11.7 Batch Size 32
Operating System Ubuntu 18.04 Optimizer Adam
Programming Language Python 3.8 Epochs 300

Evaluation metrics include parameters (Parameters), floating-point operations (GFLOPs), frames per second (FPS), precision-recall (P-R), and average precision (AP). The mean average precision (mAP) is computed as: $$ \text{mAP} = \frac{\sum_{i=1}^{N} \text{AP}_i}{N} $$ where \( N \) is the number of categories. GFLOPs are calculated using: $$ \text{GFLOPs} = \frac{2HW(K_h K_w C_{\text{in}} + 1) C_{\text{out}}}{10^9} $$ where \( H \) and \( W \) are feature map dimensions, \( K_h \) and \( K_w \) are kernel sizes, and \( C_{\text{in}} \) and \( C_{\text{out}} \) are input and output channels. These metrics provide a comprehensive assessment of model performance for UAV drone applications.

I conducted ablation studies to validate the contributions of each module. First, I compared different attention mechanisms under various traffic flows. The results, shown in the table below, demonstrate that the Coordinate Attention (CA) mechanism achieves optimal balance between accuracy and efficiency across all scenarios, making it suitable for real-time detection on UAV drones.

Dataset Model Precision (%) Recall (%) mAP (%) Parameters (M) GFLOPs FPS
Free Flow YOLOv7 89.0 89.9 94.4 37.218 105.2 85.21
+CBAM 90.1 90.3 94.6 37.901 106.5 80.52
+SE 89.2 88.7 94.6 37.225 106.1 78.62
+SimAM 91.2 90.2 94.8 37.712 106.7 80.65
+ECA 91.5 91.2 94.3 37.232 105.2 83.58
+CoTNet 91.8 90.3 95.4 39.375 110.2 77.36
Proposed (CA) 92.9 90.7 95.8 37.803 105.7 81.98
Synchronous Flow YOLOv7 91.3 86.2 94.0 37.218 105.2 84.34
+CBAM 90.8 88.1 94.5 37.901 106.5 71.89
+SE 90.2 87.5 93.8 37.225 106.1 71.02
+SimAM 90.4 87.7 94.0 37.712 106.7 72.05
+ECA 89.7 87.1 93.4 37.232 105.2 74.46
+CoTNet 91.4 88.7 94.6 39.375 110.2 69.86
Proposed (CA) 90.4 89.6 94.9 37.803 105.7 72.46
Blocking Flow YOLOv7 83.0 90.5 91.1 37.218 105.2 82.72
+CBAM 86.8 90.8 93.1 37.901 106.5 67.16
+SE 87.2 90.5 92.9 37.225 106.1 67.03
+SimAM 86.7 92.1 93.3 37.712 106.7 68.26
+ECA 86.1 90.3 92.5 37.232 105.2 69.07
+CoTNet 87.1 91.2 93.2 39.375 110.2 65.37
Proposed (CA) 88.4 91.8 94.0 37.803 105.7 68.97

Next, I performed comprehensive ablation experiments on the proposed YOLO-QYF model by incrementally adding modules. The results, summarized in the table below, confirm that each component contributes to performance gains. For instance, replacing E-ELAN with QARepVGG reduces parameters while improving mAP across all traffic flows. Adding CARAFE enhances feature utilization, and integrating CA attention boosts localization accuracy. The WIoU loss further refines model generalization. Overall, YOLO-QYF achieves significant improvements over the baseline, with mAP values of 96.4%, 95.6%, and 94.5% for free flow, synchronous flow, and blocking flow, respectively, while maintaining real-time FPS rates above 76.

Dataset Model Configuration Precision (%) Recall (%) mAP (%) Parameters (M) GFLOPs FPS
Free Flow Baseline (YOLOv7) 89.0 89.9 94.4 37.218 105.2 85.21
+ QARepVGG 94.1 90.7 96.0 36.590 129.5 78.74
+ CARAFE 91.3 91.1 95.9 37.263 105.3 81.30
+ CA 92.9 90.7 95.8 37.803 105.7 81.98
+ WIoU 89.3 90.9 94.8 37.218 105.2 75.76
YOLO-QYF (Full) 88.7 94.0 96.4 38.176 129.9 81.62
Synchronous Flow Baseline (YOLOv7) 91.3 86.2 94.0 37.218 105.2 84.34
+ QARepVGG 91.1 92.6 94.3 36.590 129.5 76.92
+ CARAFE 92.4 90.5 95.8 37.263 105.3 80.64
+ CA 90.4 89.6 94.9 37.803 105.7 72.46
+ WIoU 90.7 91.5 95.4 37.218 105.2 79.37
YOLO-QYF (Full) 90.4 91.6 95.6 38.176 129.9 78.13
Blocking Flow Baseline (YOLOv7) 83.0 90.5 91.1 37.218 105.2 82.72
+ QARepVGG 88.7 90.7 93.4 36.590 129.5 65.34
+ CARAFE 88.6 95.2 94.2 37.263 105.3 75.72
+ CA 88.4 91.8 94.0 37.803 105.7 68.97
+ WIoU 89.1 93.4 94.1 37.218 105.2 71.94
YOLO-QYF (Full) 91.0 94.3 94.5 38.176 129.9 76.34

To assess feasibility and advancement, I compared YOLO-QYF with state-of-the-art models across different traffic flows. The tables below present AP values for each vehicle category under free flow, synchronous flow, and blocking flow environments. These results highlight the robustness of my method in handling scale variations and dense distributions, common in UAV drone imagery. For instance, in free flow, YOLO-QYF achieves an mAP of 96.4%, outperforming models like YOLOv5, YOLOv6, YOLOv7, YOLOv8, FCOS, DETR, and QueryDet. Similar trends are observed in synchronous and blocking flows, demonstrating consistent performance improvements.

Performance Comparison on Free Flow Environment
Model Car AP (%) SUV AP (%) Van AP (%) Bus AP (%) Truck AP (%) mAP (%)
YOLOv5 98.0 95.8 87.9 98.6 95.4 95.2
YOLOv6 97.1 93.1 85.4 96.6 93.6 93.2
YOLOv7 97.9 95.7 86.5 99.3 92.5 94.4
YOLOv8 97.6 94.6 86.8 97.1 97.3 94.7
FCOS 97.5 94.8 87.6 98.2 94.8 94.6
DETR 96.8 95.1 88.5 97.0 97.6 95.0
QueryDet 98.2 95.8 89.2 98.9 96.4 95.7
YOLO-QYF 98.4 96.4 90.6 99.3 97.2 96.4
Performance Comparison on Synchronous Flow Environment
Model Car AP (%) SUV AP (%) Van AP (%) Bus AP (%) Truck AP (%) mAP (%)
YOLOv5 98.2 92.3 93.6 96.1 92.0 94.4
YOLOv6 96.5 91.7 90.8 95.1 92.9 93.4
YOLOv7 98.2 90.9 91.8 95.4 93.7 94.0
YOLOv8 98.1 92.6 91.3 95.0 91.9 93.8
FCOS 97.8 92.1 91.6 95.6 92.6 94.0
DETR 96.8 93.0 92.4 96.3 93.1 94.3
QueryDet 98.7 93.2 92.6 95.8 92.8 94.6
YOLO-QYF 98.4 94.1 95.5 97.4 92.9 95.6
Performance Comparison on Blocking Flow Environment
Model Car AP (%) SUV AP (%) Van AP (%) Bus AP (%) Truck AP (%) mAP (%)
YOLOv5 96.0 93.4 92.4 96.6 92.2 94.1
YOLOv6 93.1 92.8 83.6 96.1 91.1 91.3
YOLOv7 88.8 97.2 82.3 95.2 92.2 91.1
YOLOv8 95.9 93.5 91.9 97.1 88.1 93.3
FCOS 93.8 94.0 88.4 96.5 91.7 92.9
DETR 91.6 92.7 86.6 96.0 90.7 91.5
QueryDet 93.8 95.1 89.3 96.1 92.8 93.4
YOLO-QYF 94.0 97.2 90.1 96.4 94.7 94.5

I also evaluated the method on the VisDrone2021 dataset to test generalizability in mixed traffic density scenarios. The results, shown in the table below, indicate that YOLO-QYF achieves an mAP of 60.6%, outperforming baseline models and demonstrating its effectiveness for diverse UAV drone applications. Notably, it shows improvements in car and bus detection, which are critical for traffic monitoring.

Performance Comparison on VisDrone2021 Dataset
Model Car AP (%) Van AP (%) Bus AP (%) Truck AP (%) mAP (%)
YOLOv5 78.2 45.8 62.0 49.1 58.8
YOLOv6 77.4 47.6 60.5 50.1 58.9
YOLOv7 79.1 47.1 61.1 49.3 59.2
YOLOv8 78.9 48.3 60.8 50.6 59.7
FCOS 78.5 46.9 61.0 50.1 59.1
DETR 78.7 48.0 62.1 51.1 60.0
QueryDet 78.3 46.6 63.2 50.2 59.6
YOLO-QYF 79.6 48.2 63.6 50.9 60.6

Furthermore, I conducted cross-validation experiments to assess the generalization performance of YOLO-QYF. By training the model on one traffic flow and testing on others, I observed that weights from synchronous flow environments yield the best overall results, achieving mAP values of 97.3% on free flow and 95.8% on blocking flow. This indicates that the model learns robust features adaptable to various conditions, which is essential for real-world deployment of UAV drones in dynamic settings.

In conclusion, the proposed YOLO-QYF method effectively addresses key challenges in UAV-based multi-vehicle target recognition, including scale variations, dense distributions, and class imbalances. By integrating QARepVGG, CARAFE, Coordinate Attention, and WIoU loss, the model achieves superior accuracy and real-time performance across diverse traffic scenarios captured by UAV drones. Experimental results on custom and benchmark datasets validate its feasibility and advancement over existing methods. Future work will focus on optimizing model deployment for UAV drone platforms, exploring more efficient feature extraction and fusion techniques to further reduce model size and enhance practicality. This research contributes to the growing field of aerial surveillance, empowering UAV drones with reliable and efficient target detection capabilities for applications in traffic management, security, and beyond.

Scroll to Top