A Robust Multimodal Fusion Approach for Unmanned Drone Detection

The proliferation of unmanned aerial vehicles (UAVs), or drones, has ushered in transformative applications across surveillance, agriculture, logistics, and cinematography. However, this rapid adoption concurrently poses significant security and privacy threats, particularly when drones operate in unauthorized or sensitive airspace near airports, critical infrastructure, or secure facilities. The reliable and timely detection of these unmanned drones is therefore a paramount challenge for modern security systems. Traditional detection methods, predominantly reliant on visible spectrum cameras or radar, struggle against the inherent characteristics of modern small unmanned drones: their diminutive size, low radar cross-section, high maneuverability, and ability to fly at low altitudes amidst cluttered backgrounds.

This work specifically addresses the critical challenge of detecting small unmanned drones under complex environmental conditions. In such scenarios, the unmanned drone presents as a weak, small, and often low-contrast target, easily confused with birds or obscured by background noise like clouds, terrain, or urban structures. To overcome these limitations, we propose a novel detection framework that synergistically fuses information from two complementary sensing modalities: infrared (IR) video and audio. The core premise is that while a single sensor may fail under certain conditions, a multimodal system can leverage the strengths of one modality to compensate for the weaknesses of the other, thereby achieving superior robustness and accuracy.

Our proposed methodology follows a decision-level fusion strategy. Initially, the unmanned drone target is independently detected within each modality using specialized pipelines optimized for the respective data type. For infrared video, we adopt a “tracking-then-detecting” paradigm to handle the weak and small target problem. This involves a Dynamic Saliency Difference Enhancement (DSDE) module to amplify the contrast between the potential unmanned drone and its background, followed by a Spatio-Temporal Trajectory Encoding and Association (STTEA) module to maintain continuous tracks, mitigating issues of occlusion and trajectory fragmentation. For audio, we process the acoustic signal by extracting its time-frequency representation (Log-Mel Spectrogram) and employing a Convolutional Neural Network (CNN) for classification. Finally, the probabilistic outputs from the infrared and audio detectors are fused at the decision level using a logical regression model to yield the final, more confident classification of the presence or absence of an unmanned drone.

Challenges in Unmanned Drone Detection

The effective detection of a small unmanned drone is hindered by several intertwined factors that render traditional single-sensor approaches inadequate.

Low Signal-to-Clutter Ratio (SCR): In both visual and infrared domains, a small unmanned drone occupies only a few pixels. Its thermal signature can be weak and similar to background elements like warm rooftops or distant vehicles, leading to a low SCR.
Complex and Dynamic Backgrounds: Urban skies, cloud formations, and moving vegetation create high levels of background clutter that can generate false alarms or obscure the target unmanned drone.
Target Confusion: Birds are a primary source of confusion due to overlapping size, flight altitude, and sometimes similar motion patterns. Distinguishing an unmanned drone from a bird is a non-trivial classification task.
Sensor Limitations: Visible light cameras fail in low-light conditions and are sensitive to weather. Radar systems are often expensive, have limited resolution for very small targets, and may struggle with slow-moving or hovering unmanned drones near ground clutter. Acoustic sensors alone have limited range and are sensitive to ambient noise.

These challenges necessitate an intelligent fusion approach. Infrared sensors provide day/night capability and often better contrast for heat-emitting objects like a drone’s motors and battery. Acoustic sensors passively capture the distinctive harmonic and broadband noise generated by drone propellers, which is largely independent of visual obscuration like fog or partial occlusion.

Proposed Multimodal Detection Framework

The overall architecture of our proposed system is designed to be modular and robust. It independently processes infrared video streams and audio signals, extracting high-confidence detections from each before a final fusion stage.

Infrared Video Processing Pipeline

The infrared processing branch is engineered to tackle the weak and small target problem through enhanced contrast and robust tracking. The pipeline consists of several key stages.

1. Dynamic Saliency Difference Enhancement (DSDE) Module

This module is the first critical step for enhancing the visibility of a potential unmanned drone. Its objective is to generate a saliency map $S_{final}(x,y)$ where the target region is markedly highlighted against its background. The DSDE operates through a multi-modal feature fusion mechanism.

Gradient-Grayscale Fusion: For an input IR frame $I_t$, we compute gradient magnitude maps to emphasize edges. The x and y directional gradients, $G_x$ and $G_y$, are computed using Sobel or similar operators. The gradient magnitude is:
$$G(x,y) = \sqrt{G_x(x,y)^2 + G_y(x,y)^2}$$
This gradient map $G$ is then adaptively fused with the original grayscale image $I_t$ using a weight $\alpha_{local}$ that depends on local texture complexity, forming an initial fused feature map $F_{base}$.
$$F_{base}(x,y) = \alpha_{local}(x,y) \cdot I_t(x,y) + (1 – \alpha_{local}(x,y)) \cdot G(x,y)$$

Local Contrast Calculation with Entropy-Guided Windowing: To quantify the contrast between a potential target and its immediate surround, we employ a dual-window approach. An inner window $W_{in}$ encompasses the target area, and an outer window $W_{out}$ covers the local background. The local contrast $C(x,y)$ is often computed as a difference of means or a more robust statistical measure. A key innovation is scaling the window sizes based on local image entropy $E_{local}$, which measures randomness or texture complexity.
$$E_{local} = -\sum_{i=0}^{L-1} p_i \log_2 p_i$$
where $p_i$ is the probability of gray level $i$ in the local window. If $E_{local} > \tau_{entropy}$ (e.g., 1.8), indicating complex texture, the background window size is reduced to prevent contamination. Otherwise, a larger window is used to capture sufficient background context. This yields a multi-scale contrast map $C_{ms}(x,y)$.

Temporal Motion Verification: To suppress transient noise (e.g., sensor noise, brief cloud gaps), we enforce temporal consistency across a short sequence of frames ($I_t, I_{t-1}, I_{t-2}$). The saliency response must be persistent. We compute a temporal saliency measure:
$$S_{temp}(x,y) = \frac{1}{3} \sum_{i=0}^{2} S_{t-i}(x,y)$$
where $S_{t-i}$ is the saliency from the contrast stage at frame $t-i$. A dynamic threshold $T_{motion} = \mu_S + 2\sigma_S$ (mean plus two standard deviations of $S_{temp}$) is applied. The final enhanced saliency map is:
$$
S_{final}(x,y) =
\begin{cases}
S_{temp}(x,y), & \text{if } S_{temp}(x,y) > T_{motion} \\
0, & \text{otherwise}
\end{cases}
$$
This process robustly highlights true moving targets like an unmanned drone while effectively suppressing false alarms.

2. Target Detection and Feature Extraction

The enhanced saliency map $S_{final}$ is thresholded to generate binary candidate regions. Connected component analysis extracts blobs corresponding to potential targets. For each candidate blob $j$, we extract its bounding box coordinates $(x_j, y_j, w_j, h_j)$ and a feature vector $\phi(z_j)$. This feature vector combines handcrafted features (intensity, shape descriptors) and a deep feature extracted by a lightweight CNN (e.g., a truncated ResNet34) from the original image patch. This multi-feature representation is crucial for the subsequent data association step.

3. Spatio-Temporal Trajectory Encoding and Association (STTEA) Module

This module manages the “tracking” part of the “tracking-then-detecting” paradigm. Its goal is to associate new detections (measurements $z_j$) with existing target tracks $T_t$, and to initiate or terminate tracks, thereby forming continuous trajectories for each unmanned drone.

Short-Term Trajectory Feature Extraction: For each active track $T_t$, we maintain a short-term history of its last $K$ states (e.g., $K=5$), including position and velocity $[x, y, \dot{x}, \dot{y}]$. This sequence is fed into a Long Short-Term Memory (LSTM) network to encode the motion pattern into a compact feature vector $h_t$.
$$h_t = \text{LSTM}([s_{t-K}, …, s_{t-1}])$$
where $s_i$ is the state vector at time $i$. The LSTM’s gating mechanisms allow it to learn and remember consistent motion patterns typical of an unmanned drone (e.g., smooth curves, hover, linear flight) while filtering out noise.

Trajectory-Measurement Matching: To associate a new measurement $z_j$ with feature $\phi(z_j)$ to an existing track $T_t$ with encoded feature $h_t$, we compute a comprehensive matching score. This score combines:

Motion Similarity (Trajectory Match – TM): The cosine similarity between the LSTM-encoded track feature $h_t$ and a projected measurement feature $\phi(z_j)$.
$$TM_{jt} = \frac{h_t \cdot \phi(z_j)}{\|h_t\| \|\phi(z_j)\|}$$
Spatial Proximity: The Mahalanobis distance $D_M(j,t)$ between the predicted position of track $t$ and the measurement $z_j$, which accounts for estimation uncertainty.
Appearance Similarity: The cosine similarity between the deep appearance features of the track’s last confirmed target and the current measurement.

Dynamic Spatio-Temporal Fusion (DSTF): The final association cost is a weighted fusion of spatial and trajectory-based cues. We define a novel Dynamic Spatio-Temporal Fusion Factor $\lambda$ that adapts based on target maneuverability.
$$ \text{Association\_Cost}_{jt} = \lambda \cdot \frac{D_E(j,t)}{D_M(j,t)} + (1 – \lambda) \cdot TM_{jt} $$
Here, $D_E$ is the Euclidean distance. The factor $\lambda$ is adjusted dynamically: it decreases (e.g., from 0.5 to 0.3) when high maneuverability is detected, giving more weight to the trajectory matching score $TM_{jt}$, which is more robust during sudden turns or occlusions. This adaptive mechanism is key to maintaining track continuity for an evasive unmanned drone. The Hungarian algorithm is then used for optimal global data association.

Once stable trajectories are established over multiple frames, the sequence of image patches along a track is fed into a dedicated classifier (a deeper CNN) for final confirmation that the tracked object is indeed an unmanned drone and not a bird or other artifact. This “detection-after-tracking” leverages the aggregated temporal information, leading to a more reliable classification than single-frame analysis.

Audio Signal Processing Pipeline

Concurrently, the audio stream is analyzed to detect the characteristic acoustic signature of an unmanned drone.

Feature Extraction – Log-Mel Spectrogram: The raw audio signal is divided into short, overlapping frames. For each frame, a Short-Time Fourier Transform (STFT) is applied to obtain the power spectrum. This spectrum is then warped onto the Mel scale, which approximates human auditory perception and emphasizes frequencies relevant to drone noise. A bank of triangular Mel filters is applied, and the logarithm of the filter bank energies is computed to create a Log-Mel Spectrogram, a 2D time-frequency representation where the distinctive harmonic lines and broadband noise of an unmanned drone’s rotors become visible features.

Classification with CNN: The generated spectrogram image (typically resized to 128×128 pixels) is treated as an input image to a Convolutional Neural Network. The CNN architecture, comprising several convolutional, pooling, and fully-connected layers, learns to distinguish the pattern of an unmanned drone from other sounds like helicopter noise, bird calls, or general background noise. The output is a probability $P_{audio}(Drone)$ indicating the likelihood that an unmanned drone is present in the current audio segment.

Decision-Level Multimodal Fusion

The final stage integrates the evidence from both independent detectors. Let $P_{ir}(Drone)$ be the probability output from the infrared video classifier (based on tracked sequences), and $P_{audio}(Drone)$ be the probability from the audio classifier. A simple yet effective fusion rule is weighted averaging or logical regression. We employ a logistic regression model that learns the optimal weights $w_{ir}$ and $w_{audio}$ from training data:
$$ P_{fusion} = \sigma(w_0 + w_{ir} \cdot P_{ir}(Drone) + w_{audio} \cdot P_{audio}(Drone)) $$
where $\sigma(\cdot)$ is the sigmoid function. The model is trained to output a high probability only when both modalities provide supporting evidence, thereby significantly reducing false alarms that might occur in a single sensor (e.g., a bird in IR, a distant motorcycle in audio). A final decision is made by thresholding $P_{fusion}$.

Experimental Evaluation and Discussion

To validate the efficacy of our proposed multimodal framework, we conducted extensive experiments on a publicly available multisensor drone detection dataset. The dataset contains synchronized infrared video (320×256 @ 30Hz) and audio (44.1 kHz) sequences featuring drones (Inspire2, Mavic3, Phantom4), confounding objects (airplanes, helicopters, birds), and various background scenes.

Ablation Study on Infrared Pipeline Components

We first dissect the contribution of each major component within our infrared video processing pipeline. The baseline (Method 1) is a standard detection-by-tracking approach without our proposed DSDE and STTEA modules. The results clearly demonstrate the incremental and significant performance gains provided by each module.

Method	Configuration	Precision (%)	Recall (%)	F1-Score	Average Precision (AP%)
1	Baseline (Tracking-only)	75.2 ± 0.13	68.1 ± 0.15	0.71 ± 0.0014	60.1 ± 0.28
2	Baseline + DSDE Module	79.2 ± 0.18	75.2 ± 0.19	0.77 ± 0.0016	62.4 ± 0.26
3	Baseline + STTEA Module	78.7 ± 0.21	77.2 ± 0.15	0.78 ± 0.0018	63.1 ± 0.29
4	Baseline + DSDE + STTEA (Our Full IR Pipeline)	86.7 ± 0.17	82.1 ± 0.22	0.84 ± 0.0019	70.5 ± 0.34

Analysis: The DSDE module (Method 2) primarily boosts Recall by 7.1%, meaning it successfully retrieves more true unmanned drone targets that were previously lost in low-contrast backgrounds. The STTEA module (Method 3) also improves Recall and offers the best F1-score among the partial upgrades, indicating its strength in maintaining consistent tracks and reducing identity switches. The full integration (Method 4) achieves the best performance across all metrics, with a notable 11.5% increase in Precision and a 14% increase in Recall over the baseline. This synergy confirms that enhancing target saliency and enforcing robust spatio-temporal association are complementary and both essential for reliable unmanned drone detection in IR video.

Comparative Evaluation with State-of-the-Art Methods

We compare our complete multimodal system (infrared-audio fusion) against several state-of-the-art object detection methods applied solely to the infrared video data, as well as against alternative fusion strategies. The results underscore the superiority of our approach.

Method	Modality / Fusion	Precision (%)	Recall (%)	F1-Score	AP (%)
Faster R-CNN	IR Video Only	73.2 ± 0.21	60.3 ± 0.17	0.66 ± 0.0018	58.7 ± 0.23
RetinaNet	IR Video Only	70.1 ± 0.15	62.4 ± 0.26	0.66 ± 0.0021	60.4 ± 0.21
YOLOv5	IR Video Only	80.3 ± 0.17	75.1 ± 0.20	0.78 ± 0.0018	67.2 ± 0.20
CenterNet	IR Video Only	56.4 ± 0.18	70.2 ± 0.13	0.63 ± 0.0017	52.4 ± 0.19
RT-DETR	IR Video Only	82.4 ± 0.16	78.7 ± 0.19	0.81 ± 0.0017	68.1 ± 0.18
Our Method (Visual Pipeline Only)	IR Video Only	86.7 ± 0.17	82.1 ± 0.22	0.84 ± 0.0019	70.5 ± 0.34
YOLOv5 + Feature Fusion	Early Fusion (IR+Audio)	82.4 ± 0.23	76.8 ± 0.21	0.80 ± 0.0029	68.9 ± 0.22
YOLOv5 + Weighted Fusion	Decision Fusion	84.7 ± 0.19	79.5 ± 0.17	0.82 ± 0.0023	70.3 ± 0.21
RT-DETR + Weighted Fusion	Decision Fusion	86.1 ± 0.18	82.4 ± 0.15	0.84 ± 0.0019	71.2 ± 0.18
Our Method (Full Multimodal Fusion)	Decision Fusion (IR+Audio)	89.5 ± 0.18	85.7 ± 0.19	0.88 ± 0.0018	75.4 ± 0.23

Analysis: Several key observations can be made. First, our visual-only pipeline already outperforms all other single-modality IR detectors, including the highly regarded RT-DETR, particularly in Precision and Recall. This validates the effectiveness of the DSDE and STTEA modules tailored for the unmanned drone detection task. Second, simply adding audio fusion to other detectors (YOLOv5, RT-DETR) improves their performance, confirming the value of multimodal information. However, our full multimodal system achieves the highest scores across the board. The 89.5% Precision and 85.7% Recall represent a significant margin over the best visual-only method and other fused baselines. The 75.4% AP is a comprehensive metric reflecting high performance across all confidence thresholds. The decision-level fusion strategy, trained via logistic regression, effectively combines the strengths of both sensing streams: the IR system’s ability to locate and track the unmanned drone, and the audio system’s ability to confirm its identity based on acoustic signature, leading to a more robust and reliable final detection outcome.

Conclusion

In this work, we have presented a comprehensive and robust multimodal framework for detecting small unmanned drones in challenging environments. The core innovation lies in the synergistic fusion of infrared video and audio sensing, coupled with advanced processing modules within each pipeline. For infrared video, the Dynamic Saliency Difference Enhancement module successfully amplifies the low-contrast signature of a distant unmanned drone, while the Spatio-Temporal Trajectory Encoding and Association module, with its adaptive fusion factor, ensures robust tracking continuity despite occlusion or complex motion. The audio pipeline provides an orthogonal source of evidence through the discriminative analysis of acoustic spectrograms. By fusing the probabilistic outputs of these two streams at the decision level, our system achieves a high degree of confidence in its detections, significantly reducing false alarms caused by birds, environmental clutter, or sensor-specific noise.

Experimental results on a multisensor dataset demonstrate the superior performance of our approach compared to state-of-the-art single-modality detectors and alternative fusion methods. The proposed system offers a practical and effective solution for security applications requiring reliable unmanned drone surveillance in complex, real-world scenarios. Future work may involve integrating additional modalities like RF signal detection and exploring end-to-end trainable architectures for the entire multimodal pipeline to further optimize performance.