In recent years, the rapid advancement of drone technology has led to widespread applications of Unmanned Aerial Vehicles (UAVs) in various fields, including surveillance, delivery, and environmental monitoring. However, the proliferation of Unmanned Aerial Vehicle systems also poses security risks, such as unauthorized intrusions and potential threats to critical infrastructure. Traditional methods for drone identification, such as visual, acoustic, radar, and radio frequency (RF) signal-based approaches, have been explored extensively. Among these, RF signal-based methods offer advantages in terms of stealth and anti-jamming capabilities, making them suitable for complex environments like urban areas. Specifically, frequency-hopping (FH) signals, commonly used in the uplink control channels of many commercial Unmanned Aerial Vehicles, exhibit strong resistance to interference and low implementation complexity. Despite these benefits, existing deep learning-based classification methods that directly utilize time-frequency domain features often suffer from high computational complexity and insufficient real-time performance, limiting their practicality in resource-constrained scenarios.

To address these challenges, this study proposes a novel method for extracting multi-scale time-frequency features from drone frequency-hopping signals using two-dimensional wavelet decomposition. By leveraging the physical characteristics of FH signals, which exhibit energy concentration at specific time-frequency scales and uniform hopping patterns, our approach optimizes the feature space of time-frequency representations. This not only reduces data dimensionality but also preserves discriminative features essential for accurate classification. The core of our method involves applying 2D discrete wavelet transform (2D-DWT) to Short-Time Fourier Transform (STFT)-generated time-frequency images, extracting approximation coefficients at multiple decomposition levels, and selecting the optimal coefficients as inputs to a deep convolutional neural network (CNN) for classification. We employ the ResNet-18 architecture, known for its residual learning blocks that mitigate gradient vanishing issues in deep networks, to achieve high classification accuracy while maintaining computational efficiency.
The mathematical formulation of a typical frequency-hopping signal in drone communication can be expressed as:
$$f_T(t) = A \sum_{k=0}^{N-1} W_T(t – kT_h) \cdot \cos[2\pi f_k (t – kT_h) + \phi]$$
where \( A \) represents the modulation amplitude, \( W_T \) is a window function of duration \( T_h \), \( f_k \) denotes the discrete frequency points, and \( \phi \) is the phase. Modern FH systems in Unmanned Aerial Vehicles often employ匀速跳频 (uniform frequency hopping), where the carrier frequency switches periodically among a set of predefined frequencies, leading to energy concentration in specific time-frequency regions. The STFT is used to generate time-frequency images from the raw signals, providing a visual representation of local spectral features. The continuous form of STFT is defined as:
$$STFT_x(t, f) = \int_{-\infty}^{\infty} x(\tau) \omega(\tau – t) e^{-j2\pi f \tau} d\tau$$
where \( x(\tau) \) is the input signal, and \( \omega(\tau – t) \) is the window function. For discrete implementation, we use:
$$STFT_x(n, k) = \sum_{m=0}^{N-1} x(n + m) \omega(m) e^{-j2\pi nk / N}$$
with a Hamming window function to minimize spectral leakage:
$$\omega(n) = 0.5 – 0.5 \cos\left(\frac{2\pi n}{N-1}\right)$$
However, directly using these time-frequency images in neural networks results in high-dimensional data, increasing storage requirements and computational load. To overcome this, we apply 2D-DWT to decompose the time-frequency images into approximation and detail coefficients. The wavelet decomposition captures multi-scale features by recursively applying filters in horizontal and vertical directions. For a 2D signal \( f(x, y) \), the decomposition can be represented as:
$$\begin{aligned}
\phi(x, y) &= \phi(x) \phi(y) \\
\psi^H(x, y) &= \psi(x) \phi(y) \\
\psi^V(x, y) &= \phi(x) \psi(y) \\
\psi^D(x, y) &= \psi(x) \psi(y)
\end{aligned}$$
where \( \phi \) denotes the approximation coefficients, and \( \psi^H, \psi^V, \psi^D \) represent the horizontal, vertical, and diagonal detail coefficients, respectively. The approximation coefficients at decomposition level \( j \) are computed as:
$$W_{\phi}(j, m, n) = \frac{1}{\sqrt{MN}} \sum_{x=0}^{M-1} \sum_{y=0}^{N-1} f(x, y) \phi_{j, m, n}(x, y)$$
We use the Haar wavelet basis due to its compact support, orthogonality, and symmetry, which aligns well with the stepped energy distribution of FH signals. The Haar scaling and wavelet functions are:
$$\phi = \frac{1}{\sqrt{2}}[1, 1], \quad \psi = \frac{1}{\sqrt{2}}[1, -1]$$
By comparing the classification performance of approximation coefficients at different decomposition levels, we identify the optimal level that balances accuracy and computational efficiency. The selected coefficients are then fed into the ResNet-18 model, which utilizes residual blocks to facilitate training of deep networks. The residual block can be expressed as:
$$x_{m+1} = x_m + \mathcal{F}(x_m, \mathcal{W}_m)$$
where \( x_m \) is the input to the \( m \)-th block, \( \mathcal{F} \) represents the residual function, and \( \mathcal{W}_m \) denotes the weights. This structure helps in learning identity mappings, ensuring that deeper networks do not degrade in performance.
For experimental validation, we use the DroneRFa dataset, which includes 25 classes of signals, encompassing background noise and 24 types of Unmanned Aerial Vehicle signals. The dataset captures RF signals from drones operating in dual-band communication modes, with a sampling rate of 100 MS/s and segment duration of 0.01 seconds. We split the data into training, validation, and test sets in a 6:2:2 ratio, resulting in 11,379, 3,792, and 3,792 samples, respectively. The training process involves using the Adam optimizer with an initial learning rate of \( 1 \times 10^{-3} \), a batch size of 32, and cross-entropy loss. Early stopping is employed to prevent overfitting, and model performance is evaluated based on classification accuracy and inference time.
The results demonstrate that our method achieves high accuracy while significantly reducing computational overhead. The following table summarizes the performance comparison across different decomposition levels:
| Method | Accuracy (%) | Classification Time (s) | Training Time |
|---|---|---|---|
| STFT (Baseline) | 98.80 | 183.9 | N/A |
| Wavelet Level 1 Approximation | 98.07 | 8.2094 | 80 min 23 s |
| Wavelet Level 2 Approximation | 97.28 | 2.2925 | 35 min 50 s |
| Wavelet Level 3 Approximation | 97.60 | 0.7279 | 31 min 38 s |
| Wavelet Level 4 Approximation | 96.39 | 0.4105 | 29 min 59 s |
As shown, the third-level approximation coefficients provide the best trade-off, with an accuracy of 97.60% and a classification time of only 0.7279 seconds, which is 252 times faster than the baseline STFT method. This highlights the efficacy of our approach in optimizing the time-frequency feature space for drone technology applications. The confusion matrix for the third-level coefficients reveals strong class separability, with most entries concentrated along the diagonal, indicating robust performance across different Unmanned Aerial Vehicle types and environmental conditions.
In conclusion, our proposed method effectively addresses the computational challenges associated with time-frequency domain features in drone signal classification. By leveraging multi-scale wavelet decomposition and deep residual networks, we achieve high accuracy and real-time performance, making it suitable for embedded systems and real-time monitoring scenarios. Future work will focus on extending this approach to other types of RF signals and exploring adaptive wavelet bases to further enhance feature extraction in evolving drone technology landscapes. The integration of these advancements will contribute to safer and more efficient Unmanned Aerial Vehicle operations across various domains.
