A High-Precision Detection Algorithm for Low-Altitude Unmanned Drones Based on Cross-Attention Modality Fusion

The rapid development of the low-altitude economy, fueled by the liberalization of low-altitude airspace, has positioned unmanned drones as pivotal actors. Their agility, vertical take-off and landing capability, and cost-effectiveness make them ideally suited for complex, low-altitude operations. However, this proliferation introduces significant safety challenges, notably the risk of conflicts between unmanned drones themselves and with manned aircraft. “Rogue flights” of unmanned drones in sensitive areas like airports underscore the urgent need for robust, reliable low-altitude surveillance. The cornerstone of such a system is a high-precision, stable, and reliable algorithm for detecting the presence of unmanned drones within a monitored airspace.

Current detection methodologies often rely on single-modal sensors, each with inherent limitations. Radar detection, while long-range and weather-resistant, suffers from high false alarm rates for low-altitude targets and is costly and immobile. Visual/electro-optical detection provides rich information but is severely hampered by short range, weather conditions, and occlusion. Acoustic detection is highly susceptible to environmental noise. Radio frequency (RF) detection offers long range and low cost but is typically passive and can be confused by other signals in the crowded electromagnetic spectrum. A single sensor cannot reliably adapt to the complex and variable low-altitude environment, leading to missed detections and false alarms.

Therefore, this paper proposes a novel, high-precision detection framework that synergistically fuses dual-modal information: the RF signals emitted by unmanned drones and visual imagery. The core insight is that while one modality may be degraded (e.g., poor visibility for cameras, or RF interference), the other can provide compensatory evidence, thereby dramatically improving overall robustness and accuracy. The proposed algorithm is rigorously validated on public datasets. It achieves a classification accuracy exceeding 97% for RF signals, over 92% for image-based detection, and an impressive 100% accuracy after fusing the two modalities. This work provides a solid technical foundation for safeguarding low-altitude airspace and supporting the sustainable development of the low-altitude economy.

The complete architecture of the proposed low-altitude unmanned drone detection framework is illustrated below. It processes RF signals and images in parallel streams, extracts deep features from each, and employs a cross-attention-based fusion network to make a final, joint decision on the presence of an unmanned drone.

1. RF Signal Processing and Feature Extraction Based on Residual Learning

This section details the preprocessing pipeline and feature extraction network for low-altitude unmanned drone RF signals. We begin by modeling the noisy received signal. The signal then undergoes denoising via Ensemble Empirical Mode Decomposition (EEMD), dimensionality reduction via Compressed Sensing (CS) theory, and time-frequency analysis via the Short-Time Fourier Transform (STFT) to generate 2D representations. Finally, a Residual Network (ResNet) is designed to extract discriminative features from these representations.

1.1 Noisy Signal Model and Assumptions

The RF signal received in a low-altitude environment is inherently noisy. The general reception model can be expressed as:

$$ f(t) = s(t) + \sigma n(t) $$

where $ f(t) $ is the received modulated signal, $ s(t) $ is the original transmitted signal from the unmanned drone, $ \sigma $ is the noise amplitude, and $ n(t) $ is standard Gaussian white noise. The modulated RF signal can be represented as:

$$ s(t) = \text{Re} \{ s_b(t) \cdot \exp \{ j(2\pi f_c t + \phi_0) \} \} $$

where $ \phi_0 $ is the initial phase, and $ f_c $ is the carrier frequency, typically 2.4 GHz or 5.8 GHz for consumer unmanned drones. The baseband signal $ s_b(t) $ is a convolution of a symbol sequence $ a_k $ and a shaping filter $ g(t) $. After propagation through the channel, the complete received signal $ r(t) $ is:

$$ r(t) = [ s(t) * h(t) ] + n_{\text{AWGN}}(t) + n_{\text{inter}}(t) $$

where $ * $ denotes convolution, $ h(t) $ is the channel impulse response, $ n_{\text{AWGN}}(t) $ is additive white Gaussian noise, and $ n_{\text{inter}}(t) $ represents independent co-channel interference from other devices or unmanned drones.

To facilitate analysis, the following assumptions are made regarding low-altitude unmanned drone RF signals:

Propagation is approximately line-of-sight within the relevant low-altitude range.
The primary communication band for consumer unmanned drones is 2.4 GHz.
Background noise characteristics are relatively stable over short periods.
The unmanned drone’s communication bandwidth is stable over very short intervals and does not conflict with other critical services.

1.2 Signal Denoising via Ensemble Empirical Mode Decomposition (EEMD)

Traditional denoising filters (mean, Fourier, wavelet) rely on user-defined parameters, which may not align with the signal’s intrinsic structure, especially for non-linear, non-stationary signals like those from unmanned drones. EEMD overcomes the mode-mixing problem of standard Empirical Mode Decomposition (EMD) by repeatedly decomposing the signal with added white noise and averaging the results. The added white noise standard deviation $ \epsilon $ is set according to:

$$ \epsilon = \frac{A}{\sqrt{N}} $$

where $ A $ is the noise amplitude and $ N $ is the number of ensemble trials. EEMD adaptively decomposes the signal into Intrinsic Mode Functions (IMFs), allowing for the separation and removal of noise-dominated components (typically high-order and very low-order IMFs), thereby isolating the cleaner unmanned drone signal component.

1.3 Signal Compression via Compressed Sensing (CS)

Low-altitude unmanned drone RF signals are often approximately sparse and low-rank within a specific bandwidth. Analysis confirms that signal energy is concentrated in a small fraction of coefficients when projected onto a wavelet basis. This sparsity enables the application of CS for efficient dimensionality reduction before feature extraction. According to CS theory, if a signal $ \mathbf{X} \in \mathbb{R}^N $ is sparse in some basis $ \mathbf{\Psi} $ (i.e., $ \mathbf{X} = \mathbf{\Psi S} $, where $ \mathbf{S} $ has only $ K \ll N $ non-zero coefficients), it can be recovered from a small number of linear measurements $ \mathbf{Y} \in \mathbb{R}^M $:

$$ \mathbf{Y} = \mathbf{\Phi} \mathbf{X} $$

where $ \mathbf{\Phi} $ is an $ M \times N $ measurement matrix ($ M \ll N $) that is incoherent with $ \mathbf{\Psi} $. In this framework, we project the denoised signal onto a wavelet domain to preserve time-frequency features and use a partial Fourier matrix for measurement, achieving a compression ratio of approximately 10%. This drastically reduces the data volume for subsequent processing without significant information loss.

1.4 Time-Frequency Analysis using Short-Time Fourier Transform (STFT)

To capture the non-stationary characteristics of unmanned drone RF signals, STFT is employed for time-frequency analysis. STFT divides the signal into short, overlapping segments and applies the Fourier Transform to each, revealing how the frequency content evolves over time:

$$ \text{STFT}_x(t, f) = \int_{-\infty}^{\infty} x(\tau) w(\tau – t) e^{-j2\pi f \tau} d\tau $$

where $ x(\tau) $ is the RF signal and $ w(\tau – t) $ is a window function centered at time $ t $. From the STFT, we generate two distinct 2D image representations for feature extraction:

Spectrogram: A visualization of the signal’s power spectral density over time, showing energy distribution across frequencies.
Persistence Spectrum (or Percentile Spectrum): A statistical representation that highlights the most persistent frequency components over time, effectively denoising the spectrogram and emphasizing stable signal features characteristic of an unmanned drone’s control and telemetry links.

These images encapsulate the unique “fingerprint” of the unmanned drone’s RF activity.

1.5 Feature Extraction with Residual Network (ResNet)

To classify the time-frequency images (spectrogram and persistence), a deep convolutional neural network is required. We employ a Residual Network (ResNet-50), renowned for its ability to train very deep networks by mitigating the vanishing gradient problem. The key innovation is the residual block, which learns a residual mapping $ \mathcal{F}(x) $ instead of the desired underlying mapping $ \mathcal{H}(x) $. The block’s output is $ \mathcal{H}(x) = \mathcal{F}(x) + x $, where $ x $ is the input from a skip connection. This structure, shown below, allows gradients to flow directly through the network, enabling effective training and extraction of highly complex features from the unmanned drone’s RF signatures.

Table 1: ResNet-50 Architecture Used for RF Image Feature Extraction
Layer Name	Output Size	ResNet-50 Configuration
conv1	112×112	7×7, 64, stride 2
conv2_x	56×56	$ \begin{bmatrix} 1\times1, & 64 \\ 3\times3, & 64 \\ 1\times1, & 256 \end{bmatrix} \times 3 $
conv3_x	28×28	$ \begin{bmatrix} 1\times1, & 128 \\ 3\times3, & 128 \\ 1\times1, & 512 \end{bmatrix} \times 4 $
conv4_x	14×14	$ \begin{bmatrix} 1\times1, & 256 \\ 3\times3, & 256 \\ 1\times1, & 1024 \end{bmatrix} \times 6 $
conv5_x	7×7	$ \begin{bmatrix} 1\times1, & 512 \\ 3\times3, & 512 \\ 1\times1, & 2048 \end{bmatrix} \times 3 $
Output	1×1	Average Pool, Fully-Connected Layer, Softmax

2. Cross-Attention Based Multimodal Fusion Network

This section describes the vision-based feature extraction pipeline and the core cross-attention mechanism designed to fuse information from the RF and visual modalities for a final, robust decision on unmanned drone presence.

2.1 Visual Feature Extraction Pipeline

The visual stream must perform two critical tasks: classifying an image as containing an unmanned drone or not, and precisely locating the unmanned drone within the image. We employ a two-stage approach using EfficientNet and EfficientDet.

2.1.1 EfficientNet for Image Classification

We use EfficientNet-B1 as a feature extractor and classifier. Its design philosophy is based on compound scaling, which optimally balances network depth $d$, width $w$, and input resolution $r$:

$$ \text{depth: } d = \alpha^\phi, \quad \text{width: } w = \beta^\phi, \quad \text{resolution: } r = \gamma^\phi $$

subject to $ \alpha \cdot \beta^2 \cdot \gamma^2 \approx 2 $. Here, $ \phi $ is a user-defined coefficient. This scaling allows EfficientNet to achieve high accuracy with remarkable computational efficiency. For the task of detecting small unmanned drones, we replace the standard Squeeze-and-Excitation (SE) attention in its MBConv blocks with Coordinate Attention (CA). CA decomposes global pooling into one-dimensional horizontal and vertical operations, enabling the network to capture long-range dependencies with precise positional information, which is crucial for identifying small, distant unmanned drones.

2.1.2 EfficientDet for Object Detection and Localization

For precise unmanned drone localization, we employ EfficientDet-D1. Its backbone is an EfficientNet, and it uses a weighted Bidirectional Feature Pyramid Network (Bi-FPN) for multi-scale feature fusion. The weighted fusion for feature $ \tilde{U} $ is computed as:

$$ \tilde{U} = \sum_i \frac{w_i}{\epsilon + \sum_j w_j} U_i $$

where $ U_i $ are input features at different scales, $ w_i $ are learnable weights, and $ \epsilon $ is a small constant. This allows the model to adaptively emphasize the most useful features for detecting unmanned drones at various sizes and distances. Furthermore, to improve detection of small unmanned drones, we employ Wise-IoU (WIoU) as the bounding box regression loss. WIoU uses a dynamic non-monotonic focusing mechanism to reduce the harmful impact of low-quality anchor boxes, which is common when dealing with small targets like unmanned drones. The WIoU loss $ \mathcal{L}_{WIoU} $ is defined with a regulating factor $ R $ that is detached from the computational graph during backpropagation:

$$ R = \exp\left(\frac{(x – x_{gt})^2 + (y – y_{gt})^2}{(W_g^2 + H_g^2)^*}\right) $$

This design prevents the gradient of $ R $ from hindering convergence when the predicted and ground-truth boxes do not overlap.

2.2 Cross-Attention Feature Fusion Network

The core innovation of this work is the fusion mechanism. Simply concatenating features from the RF and visual modalities ignores the complex, non-linear relationships between them. We propose a cross-attention module where one modality’s features guide the aggregation of information from the other.

Let $ \mathbf{F}_r \in \mathbb{R}^{D_r} $ be the feature vector from the RF ResNet and $ \mathbf{F}_i \in \mathbb{R}^{D_i} $ be the combined feature vector from the vision pipeline (classification and detection features). The cross-attention mechanism treats $ \mathbf{F}_r $ as the Query ($ \mathbf{Q} $) and $ \mathbf{F}_i $ as both the Key ($ \mathbf{K} $) and Value ($ \mathbf{V} $) after linear projection:

$$ \mathbf{Q} = \mathbf{W}_q \mathbf{F}_r + \mathbf{b}_q, \quad \mathbf{K} = \mathbf{W}_k \mathbf{F}_i + \mathbf{b}_k, \quad \mathbf{V} = \mathbf{W}_v \mathbf{F}_i + \mathbf{b}_v $$

The attention weights $ \mathbf{A} $ are computed as the softmax over the scaled dot-product of $ \mathbf{Q} $ and $ \mathbf{K} $:

$$ \mathbf{A} = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{D_h}}\right) $$

The output is a fused feature vector that incorporates visual information weighted by its relevance to the current RF signature:

$$ \mathbf{F}_{\text{fusion}} = \text{LayerNorm}(\mathbf{F}_r + \mathbf{A} \mathbf{V}) $$

To capture diverse aspects of the relationship, we employ a Multi-Head Cross-Attention (MHCA) mechanism with 8 heads. Each head performs the above operation in a different learned projection subspace. The outputs of all heads are concatenated and linearly projected to produce the final multimodal feature representation. This representation is then fed into a simple fully-connected classification layer to make the final “unmanned drone present/absent” decision.

3. Experimental Results and Analysis

3.1 Datasets

The framework is evaluated on two public datasets:

DroneRF Dataset: Contains RF recordings of background noise and three unmanned drone models under four different flight modes (e.g., hovering, flying). We create classification tasks: 2-class (drone vs. background), 4-class (flight modes), and 10-class (models + flight modes).
Drone (UAV) Detection Dataset: Contains images with bounding box annotations for unmanned drones in various backgrounds, used for training and evaluating the visual detection pipeline.

3.2 RF Signal Processing and Classification Results

The RF signal from an unmanned drone is successfully denoised via EEMD, with noise-dominated IMFs identified and removed. The clean signal is compressed and transformed into spectrogram and persistence spectrum images. The ResNet-50 model is trained on these images. The classification accuracy and F1-Score are summarized below:

Table 2: RF Signal Image Classification Performance using ResNet-50
Image Type	Classification Task	Accuracy	F1-Score
Spectrogram	2-Class	100%	100%
	4-Class	97.78%	98.89%
	10-Class	86.67%	93.24%
Persistence Spectrum	2-Class	100%	100%
	4-Class	100%	100%
	10-Class	95.55%	97.24%

The persistence spectrum generally yields superior performance, especially for fine-grained classification, due to its robustness to transient noise. The model demonstrates a very high capability in distinguishing unmanned drone RF signatures from background noise.

3.3 Visual Detection Results

The EfficientDet-D1 model, with the Coordinate Attention-enhanced EfficientNet-B1 backbone and WIoU loss, is trained on the drone image dataset. It achieves an Average Precision (AP) of 37.4% and an AP@50 of 48.1% on the test set, demonstrating competent detection performance for the challenging task of locating small unmanned drones in images. The image classification branch of the visual pipeline achieves an accuracy and F1-score of 99.8% on the binary task of determining if an image contains an unmanned drone.

Table 3: EfficientDet-D1 Object Detection Performance
Model	Test Set AP	Test Set AP@50	Val Set AP
EfficientDet-D1	37.4	48.1	36.9

3.4 Multimodal Fusion Results

The proposed 8-head cross-attention fusion network integrates the features from the RF and visual pipelines. After training on synchronized multimodal data samples, the fused model achieves a perfect classification score on the test set: 100% accuracy, 100% recall, and an F1-score of 100%. This indicates that the fusion mechanism successfully leverages the complementary strengths of both modalities, correcting errors that might occur in a single-modality analysis and providing an exceptionally reliable decision on the presence of an unmanned drone.

3.5 Complexity and Efficiency Analysis

The introduction of Compressed Sensing significantly reduces computational load in the RF pipeline. Comparing a traditional processing chain (EEMD + STFT on full signal) with the proposed one (EEMD + CS + STFT on compressed signal), we observe a reduction in time complexity. For a signal length $ N = 10^6 $, EEMD iterations $ M=10 $, decomposition steps $ K=10 $, and STFT window $ W=100 $:

Traditional: $ O_{\text{traditional}} = N \cdot M \cdot K + N \cdot W \cdot \log W \approx 2 \times 10^8 $ operations.

Proposed: $ O_{\text{proposed}} = N \cdot M \cdot K + N \log N + N \log N + (0.1N) \cdot W \approx 1.22 \times 10^8 $ operations.

This represents a complexity reduction of approximately 39% for the RF front-end. The FLOPs for the main networks are: RF ResNet-50 (~4.12 GFLOPs), Visual EfficientDet-D1 (~5.1 GFLOPs), and the Cross-Attention Fusion Network (~0.068 GFLOPs). The total FLOPs for the inference pipeline is approximately $ 9.3 \times 10^9 $, which is feasible for deployment on edge computing platforms.

3.6 Robustness Validation

To test generalization, components were evaluated on unseen datasets. The RF feature extractor (ResNet-50) tested on the DroneRFa dataset achieved 97.5% accuracy and 98.7% F1-score. The visual detector tested on a separate drone detection dataset maintained an accuracy of 92.3% and an F1-score of 93.4%. The full multimodal framework, when tested on new synchronized data, retained a high performance of 99.8% accuracy and ~100% recall. These results confirm the robustness of the individual modules and the superior reliability gained through multimodal fusion for unmanned drone detection.

4. Conclusion

This paper has presented a novel, high-precision framework for detecting the presence of unmanned drones in low-altitude airspace. The core contribution is a cross-attention-based multimodal fusion algorithm that intelligently combines features from unmanned drone RF signatures and visual imagery. The RF processing chain employs EEMD for denoising, Compressed Sensing for efficient dimensionality reduction, and STFT to generate informative time-frequency images, which are classified by a ResNet-50. The visual pipeline uses an enhanced EfficientDet model with Coordinate Attention and Wise-IoU loss for effective unmanned drone localization and classification.

Experimental results demonstrate that the individual modalities perform strongly, with RF classification exceeding 97% accuracy and visual detection over 92% accuracy. Crucially, the fusion of these modalities via the proposed cross-attention network elevates the overall system performance to near-perfect levels (100% accuracy in controlled tests), effectively minimizing false alarms and missed detections. The analysis confirms the framework’s computational efficiency and robustness across different datasets.

This work provides a practical and reliable technical solution for low-altitude surveillance, addressing a critical need for safety in the rapidly expanding low-altitude economy. Future work will focus on leveraging the RF signal characteristics to estimate the approximate location of the unmanned drone, thereby providing coarse guidance to pan-tilt-zoom (PTZ) cameras for precise visual tracking and identification, creating a fully integrated, passive detection and tracking system for unmanned drones.