Real-time Detection of Ground Targets in Complex Scenes Using Deep Learning for UAV Infrared Remote Sensing

In recent years, unmanned aerial vehicles (UAVs), commonly referred to as drones, have become indispensable tools in various fields such as urban traffic monitoring, emergency mapping, and military reconnaissance. The integration of infrared remote sensing systems on UAV drones enables the capture of thermal images that are resilient to adverse weather conditions like low illumination, fog, and haze. These infrared images provide a continuous, multi-angle view of ground targets, making them ideal for all-weather surveillance. However, detecting targets in infrared imagery poses significant challenges due to inherent limitations such as low signal-to-noise ratio, blurred textures, and small target sizes, which are consequences of the diffraction limit in infrared imaging. Traditional detection methods often struggle in complex scenes, leading to low detection rates and high false alarms. Moreover, existing deep learning models for infrared target detection are typically computationally heavy and require substantial resources, rendering them unsuitable for real-time deployment on resource-constrained UAV platforms. To address these issues, I propose an optimized deep learning model based on MobileNetv3 for real-time ground target detection in UAV infrared remote sensing. This model enhances feature extraction through cross-layer shortcuts and an efficient channel attention mechanism, improves multi-scale feature fusion with an expanded path aggregation network, and employs a decoupled detection head for accurate localization and classification. The goal is to achieve high detection accuracy while maintaining low computational cost, enabling real-time performance on UAV drones.

The use of UAV drones for infrared remote sensing has gained traction due to their ability to operate in challenging environments. Infrared cameras mounted on UAV drones capture thermal radiation emitted by objects, which is less affected by visible light conditions. This makes infrared imaging valuable for applications like search and rescue, where targets may be obscured in darkness or smoke. However, infrared images often suffer from low contrast and lack of detailed features, especially for small or distant targets. Deep learning approaches, particularly convolutional neural networks (CNNs), have shown promise in automating target detection in such imagery. Yet, most state-of-the-art models, such as Faster R-CNN and YOLO variants, are designed for visible-light images and may not generalize well to infrared data without modifications. Additionally, their large model sizes and high computational demands hinder deployment on UAV drones, which typically have limited processing power and battery life. Therefore, there is a pressing need for lightweight, efficient models tailored for UAV infrared target detection. In this work, I focus on optimizing MobileNetv3, a compact CNN architecture, to strike a balance between accuracy and speed. By incorporating attention mechanisms and enhanced feature fusion, the model can better handle the nuances of infrared imagery, such as weak target signatures and complex backgrounds. The proposed improvements are validated on a public UAV infrared dataset, demonstrating significant gains in detection performance while ensuring real-time capability.

To provide context, infrared target detection for UAV drones involves processing sequences of thermal images to identify and localize objects like pedestrians, vehicles, and bicycles. The complexity arises from factors like scale variations, occlusions, and clutter. Traditional methods rely on handcrafted features, but they often fail in dynamic scenarios. Deep learning, with its ability to learn hierarchical features from data, has revolutionized this field. However, adapting these models for UAV-based infrared sensing requires careful design. For instance, models must be invariant to thermal noise and able to detect small targets that occupy few pixels in the image. My approach builds upon MobileNetv3, which is known for its efficiency, and introduces modifications to address these specific challenges. The core innovations include: (1) adding cross-layer shortcut connections in the backbone network to preserve fine-grained details from shallow layers; (2) replacing the standard squeeze-and-excitation module with an efficient channel attention mechanism to reduce computational overhead; (3) extending the path aggregation network to output four feature maps for better small-target detection; and (4) using a decoupled detection head to separate localization and classification tasks. These enhancements are guided by the need for real-time processing on UAV drones, ensuring that the model can be deployed in practical scenarios where latency is critical.

In the following sections, I will detail the methodology, including the architectural modifications and mathematical formulations. I will then describe the experimental setup, dataset, and evaluation metrics. Results from ablation studies and comparisons with state-of-the-art models will be presented, followed by a discussion of the implications. Finally, I will conclude with insights and future directions for improving UAV drone-based infrared target detection.

Methodology

The proposed model is designed for real-time ground target detection in UAV infrared remote sensing. It consists of three main components: an optimized MobileNetv3 backbone for feature extraction, an enhanced path aggregation network (PANet) for multi-scale feature fusion, and a decoupled detection head for output prediction. I will explain each component in detail, emphasizing the modifications made to address the challenges of infrared imagery and UAV deployment constraints.

Optimized MobileNetv3 Backbone

MobileNetv3 is a lightweight CNN architecture that uses depthwise separable convolutions and bottleneck structures to reduce computational cost. However, for UAV infrared images, where targets are often small and features are subtle, the standard MobileNetv3 may lose important details due to aggressive downsampling. To mitigate this, I introduce cross-layer shortcut connections between feature extraction layers. Specifically, in addition to the standard sequential flow, I add direct links from shallow layers to deeper layers. This allows fine-grained information from early layers, which contain more spatial details, to be propagated directly to later stages, enriching the feature maps with target-specific cues. The shortcut connection uses a strided 3×3 convolution (with a stride of 2) to downsample the shallow feature map, followed by concatenation with the deep feature map along the channel dimension. A 1×1 convolution is then applied to adjust the channel count. This process can be represented as:

Let $ F_s $ be the shallow feature map and $ F_d $ be the deep feature map. The cross-layer shortcut operation is:

$$ F_{down} = \text{Conv}_{3\times3, \text{stride}=2}(F_s) $$
$$ F_{concat} = \text{Concat}(F_d, F_{down}) $$
$$ F_{out} = \text{Conv}_{1\times1}(F_{concat}) $$

where $ \text{Conv}_{k\times k, \text{stride}=s} $ denotes a convolution with kernel size $ k $ and stride $ s $, and $ \text{Concat} $ denotes channel-wise concatenation. This ensures that the deep features retain both high-level semantics and low-level details, crucial for detecting small targets in UAV infrared imagery.

Another key improvement is the replacement of the squeeze-and-excitation (SE) attention module with an efficient squeeze-and-extraction (ESE) channel attention mechanism. The SE module compresses the feature map through global average pooling, followed by two fully connected layers to compute channel weights, which can lead to information loss. In contrast, the ESE module avoids channel reduction by using only one fully connected layer, thereby preserving inter-channel dependencies more effectively. Given an input feature map $ F \in \mathbb{R}^{H \times W \times C} $, the ESE mechanism computes channel weights as follows:

$$ z = \text{GAP}(F) $$
$$ w = \sigma(\text{FC}(z)) $$
$$ F’ = w \cdot F $$

where $ \text{GAP} $ is global average pooling reducing $ F $ to $ \mathbb{R}^{1 \times 1 \times C} $, $ \text{FC} $ is a fully connected layer, $ \sigma $ is the sigmoid activation function defined as $ \sigma(x) = \frac{1}{1 + e^{-x}} $, and $ \cdot $ denotes channel-wise multiplication. This lightweight attention mechanism enhances relevant features while suppressing background noise, which is particularly beneficial for infrared images where target signatures are weak.

Enhanced Path Aggregation Network

For multi-scale feature fusion, I employ a modified version of the path aggregation network (PANet). The standard PANet uses a top-down and bottom-up pathway to combine features from different scales, but it typically outputs only three feature maps for detection. In UAV infrared scenes, small targets are prevalent, and using more feature maps can improve their detection. Therefore, I extend the PANet to output four feature maps at different scales (e.g., 80×80, 40×40, 20×20, and 10×10 for an input size of 640×640). This allows the model to capture targets of varying sizes, from tiny objects to larger ones. To maintain efficiency, I replace all standard convolutions in the PANet with depthwise separable convolutions, which factorize convolutions into depthwise and pointwise operations, reducing the computational cost significantly. The depthwise separable convolution for an input $ X $ is defined as:

$$ Y_d = \text{DepthwiseConv}(X) $$
$$ Y = \text{PointwiseConv}(Y_d) $$

where $ \text{DepthwiseConv} $ applies a single filter per input channel, and $ \text{PointwiseConv} $ (a 1×1 convolution) combines the outputs. This modification helps keep the model lightweight, suitable for real-time inference on UAV drones.

The enhanced PANet structure is summarized in Table 1, which compares the original and modified versions in terms of output scales and convolution types. The addition of a fourth output scale increases the number of anchor boxes from 9 to 12, better covering the range of target sizes in UAV infrared images.

Component	Original PANet	Enhanced PANet
Output Scales	3 (e.g., 80×80, 40×40, 20×20)	4 (e.g., 80×80, 40×40, 20×20, 10×10)
Convolution Type	Standard Convolution	Depthwise Separable Convolution
Anchor Boxes	9	12
Computational Cost	High	Reduced by ~30% (estimated)

Decoupled Detection Head

In typical detection models, a single head predicts both bounding box coordinates and class labels. However, these tasks have different requirements: localization needs precise spatial features, while classification benefits from contextual semantics. To address this conflict, I use a decoupled detection head with two separate branches. One branch predicts the bounding box regression and Intersection over Union (IoU) confidence, and the other predicts class probabilities. This separation allows each branch to focus on its specific task, improving overall accuracy. The decoupled head first reduces the channel dimension of the input feature map using a 1×1 convolution, then splits into two streams. The localization stream uses a 3×3 convolution followed by a 1×1 convolution to output four coordinates (center x, y, width, height) and an IoU score for each anchor box. The classification stream uses a similar structure to output class scores. The architecture is illustrated in Figure 1 (conceptual representation).

For training, I employ specialized loss functions tailored for infrared target detection. The bounding box regression loss is computed using the MPD-IoU (Minimum Point Distance IoU) metric, which considers the distances between corner points of predicted and ground-truth boxes. For a predicted box $ B $ and ground-truth box $ A $, MPD-IoU is defined as:

$$ \text{MPD-IoU} = \text{IoU} – \frac{(x_{B1} – x_{A1})^2 + (y_{B1} – y_{A1})^2}{w^2 + h^2} – \frac{(x_{B2} – x_{A2})^2 + (y_{B2} – y_{A2})^2}{w^2 + h^2} $$

where $ (x_{A1}, y_{A1}) $ and $ (x_{A2}, y_{A2}) $ are the top-left and bottom-right corners of $ A $, similarly for $ B $, and $ w $ and $ h $ are the width and height of the minimal enclosing rectangle of $ A $ and $ B $. IoU is the standard intersection over union. This loss encourages precise corner alignment, which is important for small targets in UAV images.

For classification, I use the Generalized Focal Loss (GFL), which handles class imbalance and hard examples common in infrared datasets. Given a predicted probability $ t $ and ground-truth label $ y $ (0 or 1), GFL is defined as:

$$ \text{GFL}(t) = -|y – t|^\beta \left( (1 – y) \log(1 – t) + y \log(t) \right) $$

where $ \beta $ is a hyperparameter set to 1.2 in my experiments. This loss down-weights easy examples and focuses on hard ones, improving learning efficiency.

Preprocessing with Adaptive Contrast Enhancement

To further enhance target visibility in infrared images, I apply an adaptive contrast enhancement algorithm before feeding images into the model. This preprocessing step adjusts the contrast locally based on the image statistics, making targets more distinguishable from the background. For an input infrared image $ I $, the enhanced image $ I’ $ is computed as:

$$ I'(x,y) = \alpha(x,y) \cdot (I(x,y) – \mu(x,y)) + \mu(x,y) $$

where $ \mu(x,y) $ is the local mean intensity in a window around pixel $ (x,y) $, and $ \alpha(x,y) $ is a gain factor derived from the local standard deviation. This step is crucial for UAV drone imagery, where thermal contrasts can be subtle due to environmental factors.

Experimental Setup

To evaluate the proposed model, I conduct experiments on a public UAV infrared dataset and compare it with state-of-the-art methods. The experiments are designed to assess detection accuracy, speed, and model size, ensuring suitability for real-time deployment on UAV drones.

Dataset and Preprocessing

I use the HIT-UAV dataset, which contains 3,784 thermal infrared images captured by a Zenmuse XT2 camera mounted on a DJI Matrice M210 V2 UAV drone. The images cover various scenes like schools and roads, with annotations for four object classes: pedestrian, bicycle, car, and other vehicle. The dataset is split into training (70%), validation (15%), and test (15%) sets. All images are resized to 640×640 pixels and normalized. The adaptive contrast enhancement is applied as a preprocessing step to improve target saliency, which is especially helpful for small objects in complex backgrounds.

Training Parameters

The model is trained using the Adam optimizer with an initial learning rate of 0.001 and momentum of 0.9. Training runs for 300 epochs with a batch size determined automatically via an Auto Batch mechanism. Data augmentation techniques like random flipping and rotation are used to increase robustness. The training is performed on a workstation with an NVIDIA RTX 3060 GPU, Intel Xeon E5-2680 CPU, and 64 GB RAM, using PyTorch 1.12.1 and CUDA 11.7. The key training parameters are summarized in Table 2.

Parameter	Value
Optimizer	Adam
Initial Learning Rate	0.001
Momentum	0.9
Epochs	300
Batch Size	Auto-determined (typically 16-32)
Loss Functions	MPD-IoU for regression, GFL for classification
Data Augmentation	Random flip, rotation, contrast enhancement

Evaluation Metrics

I use standard detection metrics to evaluate performance: precision (P), recall (R), average precision (AP) per class, and mean average precision (mAP) over all classes. For real-time capability, I measure inference time per frame (in seconds) and model size (in megabytes). Precision and recall are defined as:

$$ P = \frac{TP}{TP + FP}, \quad R = \frac{TP}{TP + FN} $$

where $ TP $ is true positives, $ FP $ is false positives, and $ FN $ is false negatives. AP is computed as the area under the precision-recall curve, and mAP is the average of AP across classes. These metrics are calculated at an IoU threshold of 0.5. The inference time is measured on the test set using the same hardware, and model size is the file size of the saved model.

Results and Analysis

The experiments include ablation studies to validate individual components and comparisons with baseline and state-of-the-art models. All results are based on the test set of the HIT-UAV dataset, focusing on the challenges of UAV drone-based infrared target detection.

Ablation Study

To assess the contribution of each modification, I conduct ablation experiments by incrementally adding improvements to the baseline MobileNetv3-PANet model. The improvements are: (A) cross-layer shortcuts and ESE attention in the backbone, (B) enhanced PANet with four output scales, and (C) decoupled detection head. The results are shown in Table 3, where “√” indicates the inclusion of a component.

Model	A	B	C	P (%)	R (%)	mAP (%)	Inference Time (s)
Baseline (MobileNetv3-PANet)	–	–	–	80.5	82.4	83.7	0.016
Model 1	√	√	–	83.3	87.5	86.3	0.018
Model 2	–	√	√	84.2	88.6	89.7	0.019
Model 3	√	–	√	85.4	89.8	88.5	0.020
Proposed Model	√	√	√	89.2	90.7	91.5	0.019

The baseline model achieves an mAP of 83.7% with an inference time of 0.016 seconds per frame. Adding cross-layer shortcuts and ESE attention (Model 1) improves mAP to 86.3%, showing the importance of feature enrichment and attention. Model 2, with enhanced PANet and decoupled head, reaches 89.7% mAP, highlighting the benefits of multi-scale fusion and task separation. Model 3 combines A and C, yielding 88.5% mAP. The full proposed model integrates all three improvements, achieving the best performance: 91.5% mAP, with precision at 89.2% and recall at 90.7%. The inference time increases slightly to 0.019 seconds, but this is still within real-time limits (typically <0.1 seconds per frame for UAV drones). This demonstrates that the modifications effectively boost accuracy without compromising speed.

Comparison with State-of-the-Art Models

I compare the proposed model with several popular detection models: YOLOv7, YOLOX, MobileNet-SSD, ShuffleNetv2, and the baseline MobileNetv3-PANet. All models are trained and tested on the same HIT-UAV dataset under identical conditions. The results are presented in Table 4, including AP per class, mAP, inference time, and model size.

Model	AP Pedestrian (%)	AP Bicycle (%)	AP Car (%)	AP Other Vehicle (%)	mAP (%)	Inference Time (s)	Model Size (MB)
YOLOv7	83.2	81.5	85.4	82.7	83.2	0.013	12.8
YOLOX	85.4	84.6	89.7	85.4	86.3	0.021	20.1
MobileNet-SSD	73.5	74.2	78.3	76.8	75.7	0.034	37.5
ShuffleNetv2	79.8	80.7	84.6	81.2	81.6	0.025	29.6
MobileNetv3-PANet	82.9	83.3	85.1	83.6	83.7	0.016	15.3
Proposed Model	89.7	90.8	94.1	91.4	91.5	0.019	18.9

The proposed model outperforms all competitors in mAP, achieving 91.5%, which is 8.3% higher than YOLOv7, 5.2% higher than YOLOX, 15.8% higher than MobileNet-SSD, 9.9% higher than ShuffleNetv2, and 7.8% higher than the baseline. Notably, it excels in detecting small targets like pedestrians and bicycles, with APs of 89.7% and 90.8%, respectively. This is attributed to the enhanced feature fusion and attention mechanisms tailored for UAV infrared imagery. In terms of speed, the inference time of 0.019 seconds is comparable to YOLOX (0.021 s) and faster than MobileNet-SSD (0.034 s), making it suitable for real-time applications on UAV drones. The model size is 18.9 MB, slightly larger than YOLOv7 (12.8 MB) but smaller than YOLOX (20.1 MB) and much smaller than MobileNet-SSD (37.5 MB). This balance between accuracy and efficiency is critical for deployment on resource-constrained UAV platforms.

Qualitative Results

Visual inspections of detection outputs confirm the model’s effectiveness in complex scenes. For instance, in cluttered urban environments, the proposed model successfully detects pedestrians and vehicles even when they are partially occluded or small in size. In contrast, the baseline model often misses such targets or produces false positives. The adaptive contrast enhancement preprocessing also helps in highlighting thermal signatures, reducing confusion with background elements. These qualitative observations align with the quantitative metrics, reinforcing the model’s robustness for UAV drone-based surveillance.

Discussion

The results demonstrate that the proposed modifications collectively address the challenges of UAV infrared target detection. The cross-layer shortcuts in the backbone mitigate information loss from downsampling, which is vital for preserving small-target details. The ESE attention mechanism efficiently highlights relevant channels without excessive computation, improving feature discriminability. The enhanced PANet with four output scales captures multi-scale contexts, boosting recall for objects of different sizes. The decoupled detection head reduces task interference, leading to more accurate localization and classification. Together, these innovations enable the model to achieve high accuracy while maintaining real-time performance.

However, there are limitations. The model’s performance may degrade in extreme conditions, such as heavy rain or snow, where thermal signatures are highly attenuated. Additionally, the dataset, though diverse, may not cover all possible scenarios encountered by UAV drones in real-world missions. Future work could involve collecting more diverse infrared data, including adverse weather conditions, and exploring domain adaptation techniques to improve generalization. Another direction is to integrate temporal information from video sequences, as UAV drones often capture continuous streams. By leveraging optical flow or recurrent networks, the model could track targets across frames, enhancing stability and reducing false alarms. Furthermore, optimizing the model for specific UAV hardware, like NVIDIA Jetson modules, could further improve efficiency for edge deployment.

From a practical perspective, the proposed model is well-suited for integration into UAV drone systems for applications like border patrol, where detecting illicit activities in low-light conditions is crucial. The real-time capability allows for immediate alert generation, enabling rapid response. In disaster management, UAV drones equipped with this model could locate survivors in rubble based on their heat signatures, even at night. The lightweight design ensures longer flight times and lower energy consumption, which are key considerations for UAV operations.

Conclusion

In this work, I have presented a deep learning model for real-time ground target detection in complex scenes using UAV infrared remote sensing. The model builds upon MobileNetv3 with several key enhancements: cross-layer shortcut connections, an efficient channel attention mechanism, an expanded path aggregation network, and a decoupled detection head. These improvements are designed to tackle the unique challenges of infrared imagery, such as low contrast and small target sizes, while ensuring computational efficiency for UAV drone deployment. Extensive experiments on the HIT-UAV dataset show that the proposed model achieves a mean average precision of 91.5%, outperforming state-of-the-art methods, with an inference time of 0.019 seconds per frame and a model size of 18.9 MB. This balance of accuracy and speed makes it a viable solution for real-time applications like surveillance and emergency response. Future research will focus on incorporating temporal analysis and hardware-specific optimizations to further enhance the model’s utility for UAV-based infrared sensing. As UAV drones continue to proliferate, advanced detection algorithms like this will play a pivotal role in unlocking their full potential for safety and security missions.