A Deep Learning-Based Method for Multi-Motion Target Recognition in UAV Platforms

In recent years, the integration of unmanned aerial vehicles (UAVs), commonly known as drones, into various sectors has revolutionized data acquisition and monitoring capabilities. UAV drones offer unparalleled flexibility in capturing high-resolution imagery from multiple altitudes and angles, making them ideal for applications such as environmental surveillance, security patrols, and urban management. Particularly in road traffic safety supervision, UAV drones equipped with target detection algorithms enable real-time monitoring of vehicle movements. However, recognizing multiple moving targets from aerial perspectives poses significant challenges, including large scale variations, irregular dense distributions, and class imbalances, which often limit model performance and scalability. To address these issues, I propose an improved vehicle target recognition method based on YOLOv7, named YOLO-QYF, designed specifically for UAV drone platforms. This method enhances detection accuracy and efficiency through several key modifications, which I will detail in this article.

The core of my approach involves four strategic enhancements to the baseline YOLOv7 model. First, I introduce the QARepVGG module to replace the computationally expensive Extended Efficient Layer Aggregation Network (E-ELAN) in the backbone, reducing model parameters while maintaining performance. Second, I incorporate a Content-Aware Feature Reassembly (CARAFE) module during feature fusion to minimize information loss during upsampling, thereby reducing missed detections in dense scenarios. Third, I add a Coordinate Attention (CA) mechanism to the early layers of the feature extraction network to enhance the localization capability for targets of interest, especially small objects. Finally, I adopt the WIoU loss function to mitigate sample imbalance issues and improve model generalization. These improvements collectively aim to optimize the model for the dynamic and complex environments captured by UAV drones.

To understand the methodology, let me start with the network architecture of YOLO-QYF. The overall framework builds upon YOLOv7 but integrates the aforementioned modules to address specific limitations in UAV-based target detection. The backbone network utilizes QARepVGG blocks, which employ a multi-branch design during training that fuses into a single 3×3 convolutional branch during inference. This fusion process reduces computational overhead without sacrificing accuracy. The fusion of convolutional and batch normalization layers can be expressed as: $$ \text{BN}(\text{Conv}(x)) = W_{\text{fused}} * x + b_{\text{fused}} $$ where $$ W_{\text{fused}} = \frac{\gamma \times W}{\sqrt{\sigma^2 + \epsilon}} $$ and $$ b_{\text{fused}} = \frac{\gamma \times (b – \mu)}{\sqrt{\sigma^2 + \epsilon}} + \beta $$. This linear transformation simplifies the structure, making it more suitable for deployment on resource-constrained UAV drone platforms.

In the neck of the network, the CARAFE module performs content-aware upsampling to preserve semantic information during feature map transmission. Unlike traditional upsampling methods that focus on local pixel details, CARAFE predicts a reassembly kernel based on input features, allowing for adaptive recombination. The process involves two stages: kernel prediction and feature reassembly. For a given input feature map $ X $, the upsampling kernel $ W_r $ is predicted as: $$ W_r = \psi(N(X_l, K_{\text{encoder}})) $$ where $ \psi $ denotes the kernel prediction function, $ X_l $ is the input at location $ l = (i, j) $, and $ N(X_l, K) $ represents a $ K \times K $ sub-region centered at $ l $. The output feature map $ X’_{l’} $ is then computed as: $$ X’_{l’} = \sum_{n=-r}^{r} \sum_{m=-r}^{r} W_r(n, m) X(i+n, j+m) $$ where $ r = \lfloor K_{\text{up}} / 2 \rceil $. This mechanism enhances the utilization of feature information, crucial for detecting small and densely packed vehicles in UAV drone imagery.

To further improve target localization, I integrate the Coordinate Attention (CA) mechanism into the feature extraction network. CA captures long-range dependencies along spatial dimensions by decomposing attention into height and width directions, which is beneficial for identifying targets in cluttered scenes. For an input feature map of size $ H \times W \times C $, average pooling is applied along the horizontal and vertical directions to obtain feature maps $ Z^h_c(h) $ and $ Z^w_c(w) $: $$ Z^h_c(h) = \frac{1}{W} \sum_{0 \leq i \leq W} x_c(h, i) $$ $$ Z^w_c(w) = \frac{1}{H} \sum_{0 \leq j \leq H} x_c(j, w) $$. These are then concatenated and transformed through a shared 1×1 convolution, followed by activation and splitting to produce attention weights $ g^h $ and $ g^w $. The final output is weighted as: $$ y_c(i, j) = x_c(i, j) g^h_c(i) g^w_c(j) $$. This attention mechanism helps the model focus on relevant regions, enhancing detection accuracy for vehicles captured by UAV drones.

For loss function optimization, I replace the CIoU loss with WIoU to address sample imbalance. The WIoU loss function introduces a dynamic non-monotonic focusing mechanism that reduces the influence of low-quality samples during training. It is defined as: $$ L_{\text{WIoU}} = r R_{\text{WIoU}} L_{\text{IoU}} $$ where $ L_{\text{IoU}} = 1 – \frac{|B \cap B_{\text{gt}}|}{|B \cup B_{\text{gt}}|} $ is the IoU loss, $ r $ is the non-monotonic focusing coefficient given by $ r = \frac{\beta}{\delta^{\alpha \beta – \delta}} $, and $ \beta = \frac{L_{\text{IoU}}}{\bar{L}} $ represents the outlier degree. The distance loss $ R_{\text{WIoU}} $ is computed as: $$ R_{\text{WIoU}} = \exp \left( \frac{(x – x_{\text{gt}})^2 + (y – y_{\text{gt}})^2}{(W_g^2 + H_g^2)^*} \right) $$ where $ (x, y) $ and $ (x_{\text{gt}}, y_{\text{gt}}) $ are the center coordinates of the anchor and ground truth boxes, respectively, and $ W_g $ and $ H_g $ are the width and height of the ground truth box. This loss function improves model generalization across diverse traffic scenarios observed by UAV drones.

To evaluate the proposed method, I designed experiments using a custom dataset that simulates real-world traffic conditions. This dataset includes three distinct traffic flow environments—free flow, synchronous flow, and blocking flow—captured by UAV drones at altitudes of 30m, 60m, and 90m. The dataset comprises 6,604 images with five vehicle categories: cars, SUVs, vans, buses, and trucks. The distribution of vehicles varies across flows, reflecting realistic challenges in UAV-based detection. Additionally, I tested on the VisDrone2021 dataset to assess generalizability. Experimental settings are summarized in the table below, highlighting key parameters for training and evaluation.

Parameter	Configuration	Parameter	Configuration
CPU	Intel i9-12900F	Training-Test Split	8:2
GPU	RTX 3090 Ti	Image Size	640×640
CUDA Version	11.7	Batch Size	32
Operating System	Ubuntu 18.04	Optimizer	Adam
Programming Language	Python 3.8	Epochs	300

Evaluation metrics include parameters (Parameters), floating-point operations (GFLOPs), frames per second (FPS), precision-recall (P-R), and average precision (AP). The mean average precision (mAP) is computed as: $$ \text{mAP} = \frac{\sum_{i=1}^{N} \text{AP}_i}{N} $$ where $ N $ is the number of categories. GFLOPs are calculated using: $$ \text{GFLOPs} = \frac{2HW(K_h K_w C_{\text{in}} + 1) C_{\text{out}}}{10^9} $$ where $ H $ and $ W $ are feature map dimensions, $ K_h $ and $ K_w $ are kernel sizes, and $ C_{\text{in}} $ and $ C_{\text{out}} $ are input and output channels. These metrics provide a comprehensive assessment of model performance for UAV drone applications.

I conducted ablation studies to validate the contributions of each module. First, I compared different attention mechanisms under various traffic flows. The results, shown in the table below, demonstrate that the Coordinate Attention (CA) mechanism achieves optimal balance between accuracy and efficiency across all scenarios, making it suitable for real-time detection on UAV drones.

Dataset	Model	Precision (%)	Recall (%)	mAP (%)	Parameters (M)	GFLOPs	FPS
Free Flow	YOLOv7	89.0	89.9	94.4	37.218	105.2	85.21
	+CBAM	90.1	90.3	94.6	37.901	106.5	80.52
	+SE	89.2	88.7	94.6	37.225	106.1	78.62
	+SimAM	91.2	90.2	94.8	37.712	106.7	80.65
	+ECA	91.5	91.2	94.3	37.232	105.2	83.58
	+CoTNet	91.8	90.3	95.4	39.375	110.2	77.36
	Proposed (CA)	92.9	90.7	95.8	37.803	105.7	81.98
Synchronous Flow	YOLOv7	91.3	86.2	94.0	37.218	105.2	84.34
	+CBAM	90.8	88.1	94.5	37.901	106.5	71.89
	+SE	90.2	87.5	93.8	37.225	106.1	71.02
	+SimAM	90.4	87.7	94.0	37.712	106.7	72.05
	+ECA	89.7	87.1	93.4	37.232	105.2	74.46
	+CoTNet	91.4	88.7	94.6	39.375	110.2	69.86
	Proposed (CA)	90.4	89.6	94.9	37.803	105.7	72.46
Blocking Flow	YOLOv7	83.0	90.5	91.1	37.218	105.2	82.72
	+CBAM	86.8	90.8	93.1	37.901	106.5	67.16
	+SE	87.2	90.5	92.9	37.225	106.1	67.03
	+SimAM	86.7	92.1	93.3	37.712	106.7	68.26
	+ECA	86.1	90.3	92.5	37.232	105.2	69.07
	+CoTNet	87.1	91.2	93.2	39.375	110.2	65.37
	Proposed (CA)	88.4	91.8	94.0	37.803	105.7	68.97

Next, I performed comprehensive ablation experiments on the proposed YOLO-QYF model by incrementally adding modules. The results, summarized in the table below, confirm that each component contributes to performance gains. For instance, replacing E-ELAN with QARepVGG reduces parameters while improving mAP across all traffic flows. Adding CARAFE enhances feature utilization, and integrating CA attention boosts localization accuracy. The WIoU loss further refines model generalization. Overall, YOLO-QYF achieves significant improvements over the baseline, with mAP values of 96.4%, 95.6%, and 94.5% for free flow, synchronous flow, and blocking flow, respectively, while maintaining real-time FPS rates above 76.

Dataset	Model Configuration	Precision (%)	Recall (%)	mAP (%)	Parameters (M)	GFLOPs	FPS
Free Flow	Baseline (YOLOv7)	89.0	89.9	94.4	37.218	105.2	85.21
	+ QARepVGG	94.1	90.7	96.0	36.590	129.5	78.74
	+ CARAFE	91.3	91.1	95.9	37.263	105.3	81.30
	+ CA	92.9	90.7	95.8	37.803	105.7	81.98
	+ WIoU	89.3	90.9	94.8	37.218	105.2	75.76
	YOLO-QYF (Full)	88.7	94.0	96.4	38.176	129.9	81.62
Synchronous Flow	Baseline (YOLOv7)	91.3	86.2	94.0	37.218	105.2	84.34
	+ QARepVGG	91.1	92.6	94.3	36.590	129.5	76.92
	+ CARAFE	92.4	90.5	95.8	37.263	105.3	80.64
	+ CA	90.4	89.6	94.9	37.803	105.7	72.46
	+ WIoU	90.7	91.5	95.4	37.218	105.2	79.37
	YOLO-QYF (Full)	90.4	91.6	95.6	38.176	129.9	78.13
Blocking Flow	Baseline (YOLOv7)	83.0	90.5	91.1	37.218	105.2	82.72
	+ QARepVGG	88.7	90.7	93.4	36.590	129.5	65.34
	+ CARAFE	88.6	95.2	94.2	37.263	105.3	75.72
	+ CA	88.4	91.8	94.0	37.803	105.7	68.97
	+ WIoU	89.1	93.4	94.1	37.218	105.2	71.94
	YOLO-QYF (Full)	91.0	94.3	94.5	38.176	129.9	76.34

To assess feasibility and advancement, I compared YOLO-QYF with state-of-the-art models across different traffic flows. The tables below present AP values for each vehicle category under free flow, synchronous flow, and blocking flow environments. These results highlight the robustness of my method in handling scale variations and dense distributions, common in UAV drone imagery. For instance, in free flow, YOLO-QYF achieves an mAP of 96.4%, outperforming models like YOLOv5, YOLOv6, YOLOv7, YOLOv8, FCOS, DETR, and QueryDet. Similar trends are observed in synchronous and blocking flows, demonstrating consistent performance improvements.

Performance Comparison on Free Flow Environment
Model	Car AP (%)	SUV AP (%)	Van AP (%)	Bus AP (%)	Truck AP (%)	mAP (%)
YOLOv5	98.0	95.8	87.9	98.6	95.4	95.2
YOLOv6	97.1	93.1	85.4	96.6	93.6	93.2
YOLOv7	97.9	95.7	86.5	99.3	92.5	94.4
YOLOv8	97.6	94.6	86.8	97.1	97.3	94.7
FCOS	97.5	94.8	87.6	98.2	94.8	94.6
DETR	96.8	95.1	88.5	97.0	97.6	95.0
QueryDet	98.2	95.8	89.2	98.9	96.4	95.7
YOLO-QYF	98.4	96.4	90.6	99.3	97.2	96.4

Performance Comparison on Synchronous Flow Environment
Model	Car AP (%)	SUV AP (%)	Van AP (%)	Bus AP (%)	Truck AP (%)	mAP (%)
YOLOv5	98.2	92.3	93.6	96.1	92.0	94.4
YOLOv6	96.5	91.7	90.8	95.1	92.9	93.4
YOLOv7	98.2	90.9	91.8	95.4	93.7	94.0
YOLOv8	98.1	92.6	91.3	95.0	91.9	93.8
FCOS	97.8	92.1	91.6	95.6	92.6	94.0
DETR	96.8	93.0	92.4	96.3	93.1	94.3
QueryDet	98.7	93.2	92.6	95.8	92.8	94.6
YOLO-QYF	98.4	94.1	95.5	97.4	92.9	95.6

Performance Comparison on Blocking Flow Environment
Model	Car AP (%)	SUV AP (%)	Van AP (%)	Bus AP (%)	Truck AP (%)	mAP (%)
YOLOv5	96.0	93.4	92.4	96.6	92.2	94.1
YOLOv6	93.1	92.8	83.6	96.1	91.1	91.3
YOLOv7	88.8	97.2	82.3	95.2	92.2	91.1
YOLOv8	95.9	93.5	91.9	97.1	88.1	93.3
FCOS	93.8	94.0	88.4	96.5	91.7	92.9
DETR	91.6	92.7	86.6	96.0	90.7	91.5
QueryDet	93.8	95.1	89.3	96.1	92.8	93.4
YOLO-QYF	94.0	97.2	90.1	96.4	94.7	94.5

I also evaluated the method on the VisDrone2021 dataset to test generalizability in mixed traffic density scenarios. The results, shown in the table below, indicate that YOLO-QYF achieves an mAP of 60.6%, outperforming baseline models and demonstrating its effectiveness for diverse UAV drone applications. Notably, it shows improvements in car and bus detection, which are critical for traffic monitoring.

Performance Comparison on VisDrone2021 Dataset
Model	Car AP (%)	Van AP (%)	Bus AP (%)	Truck AP (%)	mAP (%)
YOLOv5	78.2	45.8	62.0	49.1	58.8
YOLOv6	77.4	47.6	60.5	50.1	58.9
YOLOv7	79.1	47.1	61.1	49.3	59.2
YOLOv8	78.9	48.3	60.8	50.6	59.7
FCOS	78.5	46.9	61.0	50.1	59.1
DETR	78.7	48.0	62.1	51.1	60.0
QueryDet	78.3	46.6	63.2	50.2	59.6
YOLO-QYF	79.6	48.2	63.6	50.9	60.6

Furthermore, I conducted cross-validation experiments to assess the generalization performance of YOLO-QYF. By training the model on one traffic flow and testing on others, I observed that weights from synchronous flow environments yield the best overall results, achieving mAP values of 97.3% on free flow and 95.8% on blocking flow. This indicates that the model learns robust features adaptable to various conditions, which is essential for real-world deployment of UAV drones in dynamic settings.

In conclusion, the proposed YOLO-QYF method effectively addresses key challenges in UAV-based multi-vehicle target recognition, including scale variations, dense distributions, and class imbalances. By integrating QARepVGG, CARAFE, Coordinate Attention, and WIoU loss, the model achieves superior accuracy and real-time performance across diverse traffic scenarios captured by UAV drones. Experimental results on custom and benchmark datasets validate its feasibility and advancement over existing methods. Future work will focus on optimizing model deployment for UAV drone platforms, exploring more efficient feature extraction and fusion techniques to further reduce model size and enhance practicality. This research contributes to the growing field of aerial surveillance, empowering UAV drones with reliable and efficient target detection capabilities for applications in traffic management, security, and beyond.