Improved Target Vehicle Detection Algorithm for UAV Infrared Imagery

The rapid advancement of UAV drones has significantly expanded their application potential across various fields, including target detection, urban traffic monitoring, and military reconnaissance. The integration of infrared imaging systems further enhances the capabilities of UAV drones, enabling all-weather detection and recognition by capturing thermal radiation from targets, even in low-light or adverse weather conditions. However, target detection in infrared imagery captured by UAV drones presents several challenges. Infrared images often exhibit low resolution, with targets appearing as small objects containing limited feature information, making accurate identification difficult. Additionally, complex backgrounds with thermal characteristics similar to those of targets can lead to false positives and missed detections. Imaging interference factors, such as atmospheric attenuation or sensor noise, can further degrade image quality, complicating the detection process. Therefore, improving detection accuracy in infrared imagery from UAV drones has become a critical research focus. In this context, I propose an enhanced algorithm based on YOLOv8n, termed YOLOv8n-LMS (YOLOv8n with Local-aware Attention, Multi-scale Feature Fusion, and Small Target Optimization), designed to address these limitations and boost detection performance for target vehicles in infrared imagery from UAV drones.

Traditional methods for detecting weak and small targets in infrared imagery often rely on spatial filtering techniques that analyze the characteristics of targets and their surrounding backgrounds. For instance, algorithms like the Multiscale Patch Contrast Measure (MPCM) process infrared images through filters to detect small targets across various scales and orientations. However, such methods frequently lack robustness in complex scenarios. In recent years, the development of deep learning has substantially improved the accuracy of target detection for UAV drones. Single-stage detection algorithms, such as SSD (Single Shot MultiBox Detector) and the YOLO (You Only Look Once) series, have transformed target detection into a regression problem, enhancing both detection precision and real-time performance. Despite these advancements, challenges persist in detecting small targets and extracting discriminative features from infrared imagery, such as weak feature representation and susceptibility to background clutter. This paper introduces an improved algorithm based on YOLOv8n to address these issues and enhance detection accuracy in infrared imagery from UAV drones.

The contributions of this work are threefold: First, I incorporate a Local Region Self-Attention (LRSA) mechanism into the backbone network of YOLOv8n, constructing a C2f-LRSA feature extraction module. This module employs an efficient local self-attention mechanism suitable for tasks like image super-resolution, low-light enhancement, object detection, and lightweight computer vision applications. It enhances local feature representation while maintaining model lightweightness, thereby improving the accuracy and quality of visual tasks for UAV drones. Second, I design a Multi-scale Edge-enhanced Upsampling Module (MEUM) that upsamples feature maps to higher resolutions while addressing the loss of detailed edge features post-upsampling. This enhancement improves the model’s ability to capture object boundaries in imagery from UAV drones. Third, I introduce a P2 small-target detection layer that leverages high-resolution shallow feature maps to capture fine-grained local features of small targets. This layer is particularly effective in scenarios requiring high-precision small-target detection, such as aerial imagery from UAV drones or video surveillance, effectively reducing the miss rate for small targets.

Overview of the YOLOv8n Algorithm

The YOLOv8 series encompasses multiple model variants, including YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x, to cater to diverse accuracy and performance requirements. Among these, YOLOv8n is the most lightweight version, offering a balance between detection efficiency and computational cost, making it suitable for real-time applications with limited resources, such as those involving UAV drones. The architecture of YOLOv8n consists of three core components: the Backbone, Neck, and Head. The Backbone extracts fundamental features from input images. The Neck integrates features from different hierarchical levels, enhancing the model’s capability to detect multi-scale targets—a crucial aspect for imagery from UAV drones where object sizes vary significantly. The Head produces the final detection results.

YOLOv8n introduces several structural improvements over its predecessors. It replaces the C3 module used in YOLOv5 with a more efficient C2f module, which enhances feature fusion capabilities and improves training efficiency and detection robustness. In the feature integration stage, YOLOv8n incorporates an enhanced Path Aggregation Feature Pyramid Network (PAFPN) that facilitates cross-scale information flow between upsampling and downsampling paths, enabling more comprehensive fusion of multi-level feature maps in both spatial and semantic dimensions. This significantly strengthens the model’s ability to discern multi-scale targets, which is vital for applications involving UAV drones. For the Head, YOLOv8n adopts an anchor-free mechanism, eliminating the dependency on predefined anchor boxes and enhancing the model’s generalization ability. Furthermore, YOLOv8n optimizes its loss functions: it combines Distribution Focal Loss (DFL) for bounding box regression to improve localization accuracy and spatial positioning capability.

The overall structure of YOLOv8n can be summarized with key equations. The detection process involves predicting bounding boxes and class probabilities directly from feature maps. For a given input image, the model outputs a tensor of dimensions $S \times S \times (B \times (5 + C))$, where $S$ is the grid size, $B$ is the number of bounding boxes per grid cell, and $C$ is the number of classes. The bounding box parameters include the center coordinates $(x, y)$, width $w$, height $h$, and objectness score. The loss function $L$ is a combination of classification loss $L_{cls}$, objectness loss $L_{obj}$, and bounding box regression loss $L_{box}$:

$$L = \lambda_{cls} L_{cls} + \lambda_{obj} L_{obj} + \lambda_{box} L_{box}$$

where $\lambda_{cls}$, $\lambda_{obj}$, and $\lambda_{box}$ are weighting coefficients. In YOLOv8n, $L_{box}$ incorporates DFL to better handle bounding box distributions, which is beneficial for detecting vehicles in challenging infrared imagery from UAV drones.

The Proposed YOLOv8n-LMS Algorithm

Despite the strong performance of YOLOv8n in general object detection tasks, it exhibits limitations when dealing with small targets and low-quality images, often resulting in false positives and missed detections. To address these issues, I propose YOLOv8n-LMS, an improved algorithm specifically tailored for target vehicle detection in infrared imagery from UAV drones. The enhancements include the construction of a C2f-LRSA module with local-aware attention, the addition of a MEUM module for multi-scale feature fusion, and the integration of a P2 small-target detection head. These modifications collectively boost detection accuracy and robustness.

C2f-LRSA Module with Local Region Self-Attention

In infrared vehicle detection tasks using UAV drones, complex environmental conditions pose significant challenges. Thermal radiation signatures of target vehicles can become indistinct due to background interference, atmospheric attenuation, or partial occlusion. Moreover, the low-contrast nature of infrared imagery makes it difficult for models to capture key vehicle features, such as edges or thermal regions like headlights. To enhance local feature extraction, I introduce a lightweight Local Region Self-Attention (LRSA) module into the backbone network of YOLOv8n. The LRSA module is an efficient feature modeling approach that combines local partitioning with self-attention mechanisms. Its core idea is to divide the input feature map into overlapping local patches using a sliding window approach, then perform multi-head self-attention independently within each patch. This strengthens the modeling of relationships between pixels in local regions while maintaining sensitivity to detailed information.

The LRSA module operates as follows: Given an input feature map $X \in \mathbb{R}^{H \times W \times C}$, it is first partitioned into patches of size $ps \times ps$. For each patch, linear transformations generate query ($Q$), key ($K$), and value ($V$) matrices. Self-attention is computed within the patch, and the outputs are reconstructed to form the complete feature map. A depthwise separable convolution-based feedforward network (ConvFFN) is then applied to further extract local contextual information. The process incorporates LayerNorm and residual connections to ensure training stability and feature expressiveness. The attention computation for a patch can be expressed as:

$$\text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

where $d_k$ is the dimension of the key vectors. By processing locally, LRSA reduces computational complexity compared to global self-attention while effectively capturing fine-grained patterns essential for small targets in imagery from UAV drones.

Based on the LRSA mechanism, I construct the C2f-LRSA module to replace the standard C2f module in YOLOv8n. The C2f-LRSA module first applies two convolutional layers to the input features, then integrates the LRSA module to enhance local feature extraction. This design not only improves local feature representation but also augments global perceptual ability and context utilization, making it more suitable for handling the nuances of infrared imagery from UAV drones. The structure of the C2f-LRSA module can be summarized in a table:

Layer	Operation	Output Shape
Input	–	$H \times W \times C$
Conv1	1×1 Convolution	$H \times W \times C’$
Conv2	3×3 Convolution	$H \times W \times C’$
LRSA	Local Region Self-Attention	$H \times W \times C’$
Output	Concatenation & Conv	$H \times W \times C$

Multi-scale Edge-enhanced Upsampling Module (MEUM)

To address the issue of lost edge details during upsampling in the neck network, I design the Multi-scale Edge-enhanced Upsampling Module (MEUM). This module enhances the semantic representation of feature maps by fusing multi-scale edge information with main branch features, which is crucial for accurately delineating vehicle boundaries in infrared imagery from UAV drones. The MEUM module operates by first upsampling the input feature map $X_{in}$ to double its resolution. A main branch is constructed using a 1×1 convolution, channel normalization, and Sigmoid activation to produce initial features. In parallel, the module introduces $N$ scale paths, each employing successive 3×3 average pooling operations to obtain multi-scale blurred features. A custom edge enhancer is then applied to strengthen edges by computing the residual difference between the feature map and its blurred versions, followed by a convolutional layer to adjust and add back to the original features. This process significantly improves boundary sensitivity. Finally, features from all scale branches are concatenated with the main branch and fused through a convolutional layer to produce the enhanced output feature map $X_{out}$.

The edge enhancement operation can be formulated as:

$$E = \text{Conv}(X_{\text{in}} – \text{Blur}(X_{\text{in}}))$$

where $\text{Blur}(\cdot)$ denotes the average pooling operation, and $\text{Conv}(\cdot)$ is a convolutional layer. The output $X_{\text{out}}$ is then:

$$X_{\text{out}} = \text{Conv}(\text{Concat}(X_{\text{main}}, E_1, E_2, \dots, E_N))$$

where $X_{\text{main}}$ is the main branch feature, and $E_i$ are the edge-enhanced features from different scales. This multi-scale approach ensures that both coarse and fine edge information is preserved, enhancing the model’s ability to detect vehicles in varied conditions encountered by UAV drones.

P2 Small-Target Detection Layer

In the original YOLOv8n architecture, the detection head consists of three layers: P3, P4, and P5, corresponding to feature maps at different depths. While effective for general object detection, this design struggles with small targets because deep feature maps lose spatial details due to downsampling, leading to missed detections. To mitigate this, I introduce a shallow detection head, P2, into YOLOv8n, forming a four-head detection structure. The P2 head operates on high-resolution shallow feature maps that retain more spatial details and texture information. Compared to deeper layers, the P2 feature maps have a smaller receptive field and focus more on local information, making them ideal for capturing fine-grained features like edges and contours of small targets. This enables more precise localization and recognition of vehicles in infrared imagery from UAV drones.

The P2 layer is designed to be lightweight, adding only a minimal computational overhead. It integrates with the feature fusion structures (e.g., FPN or PAN) to maintain information flow with deeper layers. The inclusion of P2 significantly improves the recall rate and detection accuracy for small targets, enhancing overall performance in complex scenarios typical of UAV drone operations. The four-head detection structure can be summarized as:

Detection Head	Feature Map Resolution	Primary Role
P2	High (e.g., 160×160)	Small target detection
P3	Medium (e.g., 80×80)	Medium target detection
P4	Low (e.g., 40×40)	Large target detection
P5	Very Low (e.g., 20×20)	Very large target detection

Experimental Setup and Evaluation Metrics

To validate the proposed YOLOv8n-LMS algorithm, I conduct experiments on a publicly available infrared UAV drone vehicle detection dataset. This dataset comprises 17,990 infrared images captured from a top-down perspective by UAV drones, covering various times and locations to ensure diversity. The dataset is split into training, validation, and test sets in an 8:1:1 ratio, resulting in 14,392 images for training, 1,799 for validation, and 1,799 for testing.

The experimental environment is configured with a Windows 11 operating system, an Intel Core i5-13500HX CPU, and an NVIDIA GeForce RTX 4060 GPU. The deep learning framework uses Python 3.10.0, CUDA 11.8, and PyTorch 2.6.0. For model training, uniform settings are applied: input image size (imgsz) is set to 640 pixels, batch size is 8, and the total number of training epochs is 200. The optimizer employed is Stochastic Gradient Descent (SGD) with an initial learning rate of 0.01.

To evaluate algorithm performance, I use standard metrics: Precision ($P$), Recall ($R$), and mean Average Precision (mAP). Precision and Recall are defined as:

$$P = \frac{TP}{TP + FP}$$

$$R = \frac{TP}{TP + FN}$$

where $TP$ (True Positive) is the number of correctly identified positive samples, $FP$ (False Positive) is the number of negative samples incorrectly identified as positive, and $FN$ (False Negative) is the number of positive samples incorrectly identified as negative. The Average Precision (AP) for a class is computed as the area under the Precision-Recall curve:

$$AP = \int_{0}^{1} P(R) \, dR$$

where $P(R)$ is the precision at recall $R$. The mean Average Precision (mAP) is then the average of AP over all classes. In this study, I use two mAP variants: $mAP_{0.5}$, which computes mAP at an Intersection over Union (IoU) threshold of 0.5, and $mAP_{0.5:0.95}$, which averages mAP over IoU thresholds from 0.5 to 0.95 with a step size of 0.05. The latter provides a more comprehensive assessment of model robustness across varying localization accuracies, which is critical for applications involving UAV drones.

Ablation Study Results

To analyze the individual contributions of each proposed module, I conduct an ablation study by incrementally adding components to the baseline YOLOv8n model. The experiments are designed as follows:

Process Algorithm 1: YOLOv8n + C2f-LRSA module.
Process Algorithm 2: YOLOv8n + C2f-LRSA + MEUM module.
YOLOv8n-LMS: YOLOv8n + C2f-LRSA + MEUM + P2 detection layer.

The results are summarized in the table below, demonstrating the performance improvements at each stage.

Model	C2f-LRSA	MEUM	P2	P (%)	R (%)	$mAP_{0.5}$ (%)	$mAP_{0.5:0.95}$ (%)
YOLOv8n				88.7	85.2	91.6	56.9
Process Algorithm 1	√			89.7	85.9	91.9	58.1
Process Algorithm 2	√	√		89.6	85.4	92.2	58.3
YOLOv8n-LMS	√	√	√	89.1	85.9	92.4	58.5

The ablation study reveals that each module contributes positively to overall performance. Process Algorithm 1, which incorporates the C2f-LRSA module, increases $mAP_{0.5}$ by 0.3% and $mAP_{0.5:0.95}$ by 1.2% compared to the baseline YOLOv8n. This indicates that the local-aware attention mechanism enhances feature extraction capabilities, which is beneficial for detecting vehicles in infrared imagery from UAV drones. Process Algorithm 2, with the addition of the MEUM module, further boosts $mAP_{0.5}$ by 0.6% and $mAP_{0.5:0.95}$ by 1.4% over the baseline. The MEUM module’s multi-scale edge enhancement effectively enriches feature representations, improving detection accuracy. Finally, the complete YOLOv8n-LMS algorithm, which includes the P2 small-target detection layer, achieves the highest improvements: $mAP_{0.5}$ increases by 0.8% and $mAP_{0.5:0.95}$ by 1.6% over the baseline. This underscores the significance of the P2 layer in enhancing small-target detection, a common challenge in imagery from UAV drones.

Comparative Experimental Results and Analysis

To further validate the efficacy of YOLOv8n-LMS, I compare it with several state-of-the-art object detection algorithms on the same infrared UAV drone vehicle detection dataset. The compared algorithms include YOLOv5n, YOLOv6n, YOLOv8n, YOLOv10n, YOLOv11n, YOLOv12n, and RTDETR-L, a representative Transformer-based architecture known for end-to-end detection efficiency. The comparative results are presented in the table below.

Model	P (%)	R (%)	$mAP_{0.5}$ (%)	$mAP_{0.5:0.95}$ (%)
YOLOv5n	89.9	84.1	91.4	57.4
YOLOv6n	87.4	84.8	91.6	57.4
YOLOv8n	88.7	85.2	91.6	56.9
YOLOv10n	89.0	82.6	90.7	55.8
YOLOv11n	88.9	83.7	91.5	57.7
YOLOv12n	90.0	84.1	91.8	57.3
RTDETR-L	84.5	81.6	87.9	53.3
YOLOv8n-LMS	89.1	85.9	92.4	58.5

The comparative results demonstrate that YOLOv8n-LMS outperforms all other algorithms in both $mAP_{0.5}$ and $mAP_{0.5:0.95}$ metrics. Specifically, YOLOv8n-LMS achieves a $mAP_{0.5}$ of 92.4%, which is 1.0% higher than YOLOv5n, 0.8% higher than YOLOv6n and YOLOv8n, 1.7% higher than YOLOv10n, 0.9% higher than YOLOv11n, 0.6% higher than YOLOv12n, and 4.5% higher than RTDETR-L. In terms of $mAP_{0.5:0.95}$, YOLOv8n-LMS reaches 58.5%, showing improvements of 1.1% over YOLOv5n and YOLOv6n, 1.6% over YOLOv8n, 2.7% over YOLOv10n, 0.8% over YOLOv11n, 1.2% over YOLOv12n, and 5.2% over RTDETR-L. These results highlight the superior detection accuracy and robustness of YOLOv8n-LMS for target vehicle detection in infrared imagery from UAV drones. The enhancements in local feature extraction, multi-scale edge awareness, and small-target detection collectively enable the model to handle challenges such as small object sizes, lighting variations, and occlusions more effectively than existing methods.

Visual comparisons further illustrate the advantages of YOLOv8n-LMS. When applied to sample infrared images from the test set, YOLOv8n-LMS exhibits fewer false positives and missed detections compared to the baseline YOLOv8n. For instance, in scenarios with complex backgrounds or small vehicles, YOLOv8n-LMS accurately identifies targets that are missed by other algorithms, demonstrating its enhanced capability to discern vehicles in difficult conditions encountered by UAV drones. This performance gain is attributed to the synergistic effects of the proposed modules, which improve feature representation and detection sensitivity across scales.

Conclusion

In this paper, I address the challenges of detecting target vehicles in infrared imagery from UAV drones, where issues such as small target sizes, weak feature signatures, and insufficient detection accuracy are prevalent. I propose an improved algorithm, YOLOv8n-LMS, which builds upon YOLOv8n with three key modifications: the integration of a C2f-LRSA module to enhance local feature extraction, the design of a MEUM module to improve multi-scale edge perception, and the addition of a P2 small-target detection layer to leverage high-resolution shallow features. These enhancements collectively boost detection performance, particularly for small targets, while maintaining model lightweightness suitable for deployment on resource-constrained UAV drone platforms.

Experimental results on a public infrared UAV drone vehicle dataset confirm the effectiveness of YOLOv8n-LMS. The algorithm achieves a $mAP_{0.5}$ of 92.4% and a $mAP_{0.5:0.95}$ of 58.5%, outperforming several state-of-the-art detectors, including other YOLO variants and Transformer-based models. Ablation studies validate the individual contributions of each proposed module, demonstrating progressive improvements in detection metrics.

Future research directions will focus on further optimizing the algorithm for UAV drone applications. Potential areas include exploring multi-modal information fusion (e.g., combining infrared with visible light or LiDAR data), refining lightweight architectures to reduce computational overhead, and developing adaptive feature enhancement mechanisms to improve robustness in extreme environmental conditions. Additionally, extending the algorithm to other target types and scenarios, such as pedestrian detection or urban traffic analysis, could broaden its applicability. By continuing to advance detection technologies for UAV drones, we can unlock new possibilities in surveillance, reconnaissance, and autonomous operations, contributing to the growing field of aerial intelligence systems.