Multiscale Spatiotemporal Attention Network for Unmanned Drone Audio Detection

The proliferation of unmanned drone technology has ushered in transformative applications across civilian and commercial sectors, including infrastructure inspection, precision agriculture, aerial photography, and logistics. However, the increasing accessibility of these unmanned drone platforms has concurrently escalated significant security and privacy concerns. The threat of unauthorized unmanned drone flights near sensitive areas such as airports, military installations, power plants, and government facilities necessitates the development of robust, automated, and continuous surveillance systems. Effective detection of these unmanned drone intrusions is a critical component of modern low-altitude airspace security.

To counter the unmanned drone threat, a variety of detection modalities have been explored, each with inherent strengths and limitations. Radio Frequency (RF) methods analyze communication links or radar reflections but can be evaded by autonomous unmanned drone operation or jammed. Video and infrared sensors provide rich data but are severely hampered by weather conditions, lighting, and require a direct line of sight. Multisensor fusion networks improve reliability but at the cost of system complexity and high deployment overhead. In this context, acoustic-based detection presents a compelling alternative. Audio sensors are passive, cost-effective, work in all lighting conditions, and possess a degree of omnidirectionality. The fundamental challenge lies in reliably distinguishing the characteristic acoustic signature of an unmanned drone from a cacophony of environmental background noise—a task where traditional signal processing techniques often struggle due to low signal-to-noise ratios (SNR) and the non-stationary nature of real-world sounds.

Recent advances in deep learning have revolutionized audio pattern recognition. Inspired by the success of Convolutional Neural Networks (CNNs) in image processing, researchers have effectively adapted these architectures for spectral image analysis, treating time-frequency representations like spectrograms as two-dimensional inputs. For sequential audio data, Recurrent Neural Networks (RNNs), particularly Long Short-Term Memory (LSTM) networks, have proven effective in modeling temporal dynamics. The integration of attention mechanisms further enhances model performance by enabling the network to focus adaptively on the most discriminative parts of the input sequence, both in time and across feature channels. This capability is paramount for unmanned drone audio detection, where the salient acoustic features may be brief or obscured by intermittent noise.

Building upon these foundations, we propose a novel deep learning architecture specifically designed for the robust acoustic detection of unmanned drones. Our model, a Multiscale Spatiotemporal Attention Network, is engineered to address the core challenges of this task: extracting discriminative features from low-SNR signals, capturing both local spectral patterns and long-range temporal dependencies, and dynamically suppressing irrelevant background information. The proposed system operates by first converting raw audio into Mel-Frequency Cepstral Coefficients (MFCCs), a perceptually relevant feature representation. A hierarchical CNN then learns multiscale spatial (spectral) features, which are subsequently processed by a bidirectional LSTM to model contextual temporal relationships. Crucially, a dedicated attention module is interleaved to weight critical spatiotemporal features, allowing the model to concentrate on the essence of the unmanned drone sound amidst interference.

The remainder of this article is organized as follows. First, we detail the extraction process of the acoustic features used as input. We then provide a comprehensive description of the proposed Multiscale Spatiotemporal Attention Network, breaking down its architectural components. Following this, we present our experimental setup, including the datasets used, evaluation metrics, and implementation details. We subsequently report and analyze the results, comparing our model’s performance against established baseline methods and conducting ablation studies to validate the contribution of each core component. Finally, we conclude with a summary of our findings and potential directions for future work.

Acoustic Feature Extraction: Mel-Frequency Cepstral Coefficients (MFCCs)

The performance of any audio classification system, including unmanned drone detection, is fundamentally tied to the quality and relevance of its input features. We employ Mel-Frequency Cepstral Coefficients (MFCCs), a feature set renowned for its effectiveness in speech and sound recognition. MFCCs are designed to approximate the human auditory system’s response, which is more sensitive to changes in lower frequencies. The extraction pipeline involves several stages of signal processing.

Given a raw audio signal $x[n]$, the process begins with pre-processing. Pre-emphasis is applied using a high-pass filter to amplify higher frequencies, compensating for the natural attenuation of sound. The signal is then divided into short, overlapping frames (e.g., 20-40 ms) to assume short-term stationarity. Each frame is multiplied by a window function (typically a Hamming window) to minimize spectral discontinuities at the frame edges.

For each framed signal $x_i[n]$, the Discrete Fourier Transform (DFT) is computed to obtain its magnitude spectrum $|X_i[k]|$ in the frequency domain. The core psychoacoustic transformation follows, where this linear-frequency spectrum is warped onto the Mel scale. The Mel scale is a perceptual scale of pitches judged by listeners to be equal in distance from one another. The mapping from physical frequency $f$ in Hertz to Mel frequency $m$ is given by:

$$Mel(f) = 2595 \times \log_{10}\left(1 + \frac{f}{100}\right)$$

A bank of triangular band-pass filters spaced equally on the Mel scale is constructed. The energy within each filter $E_m$ is calculated by summing the weighted spectrum. This step produces a compact representation that emphasizes perceptually meaningful frequencies. The logarithm of these filterbank energies is then taken, $\log(E_m)$. This compresses the dynamic range and simulates the logarithmic sensitivity of human hearing. Finally, the Discrete Cosine Transform (DCT) is applied to the log filterbank energies. The DCT decorrelates the energies, and the lower-order coefficients represent the spectral envelope’s broad shape, while higher-order coefficients represent finer details. Typically, the first 13-40 coefficients are retained, forming the MFCC feature vector for each frame. For an audio clip, this results in a two-dimensional feature matrix $\mathbf{X} \in \mathbb{R}^{T \times F}$, where $T$ is the number of time frames and $F$ is the number of MFCC coefficients. This matrix serves as the foundational input to our deep learning model for discriminating unmanned drone sounds.

Proposed Methodology: Multiscale Spatiotemporal Attention Network

The core of our detection system is the Multiscale Spatiotemporal Attention Network, a hierarchical architecture designed for precise and robust feature learning from unmanned drone audio signals. The model follows a structured pipeline: feature preprocessing, multiscale convolutional feature extraction, spatiotemporal attention-based feature enhancement, and bidirectional sequential classification.

Overall Architecture and Forward Propagation

The model processes a batch of input MFCC features $\mathbf{X} \in \mathbb{R}^{B \times T \times F}$, where $B$ is the batch size, $T$ is the sequence length (time frames), and $F$ is the feature dimension (MFCC coefficients). The goal is to learn a mapping $f: \mathbb{R}^{T \times F} \rightarrow \mathbb{R}^{C}$, where $C$ is the number of output classes (e.g., ‘unmanned drone’ vs. ‘non-drone’). The forward propagation is formally described by the following sequence of transformations, which we will elaborate on in subsequent sections.

Let the learnable parameters of the CNN, Attention, LSTM, and Classifier modules be denoted by $\theta$, $\phi$, $\psi$, and $\omega$, respectively. The forward pass is:

$$
\begin{aligned}
\mathbf{Z}^{(0)} &= \text{Permute}(\mathbf{X}), \quad \mathbf{Z}^{(0)} \in \mathbb{R}^{B \times F \times T} \\
\mathbf{Z}^{(1)} &= \text{CNN}_\theta(\mathbf{Z}^{(0)}), \quad \mathbf{Z}^{(1)} \in \mathbb{R}^{B \times C_{L} \times \frac{T}{2^L}} \\
\mathbf{Z}^{(1)}_{2D} &= \text{ExpandDim}(\mathbf{Z}^{(1)}), \quad \mathbf{Z}^{(1)}_{2D} \in \mathbb{R}^{B \times C_{L} \times \frac{T}{2^L} \times 1} \\
\mathbf{Z}^{(2)}_{2D} &= \text{MSTANet}_\phi(\mathbf{Z}^{(1)}_{2D}), \quad \mathbf{Z}^{(2)}_{2D} \in \mathbb{R}^{B \times C_{L} \times \frac{T}{2^L} \times 1} \\
\mathbf{Z}^{(3)} &= \text{Squeeze}(\mathbf{Z}^{(2)}_{2D}), \quad \mathbf{Z}^{(3)} \in \mathbb{R}^{B \times C_{L} \times \frac{T}{2^L}} \\
\mathbf{Z}^{(4)} &= \text{BiLSTM}_\psi(\mathbf{Z}^{(3)}) \\
\hat{\mathbf{Y}} &= \text{Classifier}_\omega(\mathbf{Z}^{(4)})
\end{aligned}
$$

The initial permutation aligns the feature dimension for 1D convolution. The CNN module extracts hierarchical features while downsampling the temporal dimension. The feature map is then expanded to a 2D structure to be processed by the Multiscale Spatiotemporal Attention Network (MSTANet). After attention-based refinement, the feature is squeezed back to 1D and passed through a Bidirectional LSTM for temporal context modeling before final classification.

Multiscale Convolutional Feature Extraction Network

To capture spectral patterns at various granularities inherent in unmanned drone sounds, we design a progressive 1D CNN with increasing channel depth. The network consists of a sequence of convolutional blocks, each performing a convolution, batch normalization, ReLU activation, dropout for regularization, and max-pooling for downsampling. This structure allows the model to learn from local fine-grained features (e.g., harmonic components) in early layers to more abstract, broad spectral shapes in deeper layers.

Let the input feature after permutation be $\mathbf{Z}^{(0)} = \mathbf{M}^{(0)}$. The operation at the $l$-th convolutional block is defined as:

$$
\mathbf{M}^{(l)} = \text{MaxPool}\left( \text{Dropout}\left( \text{ReLU}\left( \text{BN}\left( \mathbf{W}^{(l)} \ast \mathbf{M}^{(l-1)} + \mathbf{b}^{(l)} \right) \right) \right) \right)
$$

where $\ast$ denotes the 1D convolution operation, $\mathbf{W}^{(l)}$ and $\mathbf{b}^{(l)}$ are the kernel weights and bias for layer $l$, BN is Batch Normalization, and MaxPool uses a stride of 2, halving the temporal dimension. The parameters for our three-block architecture are summarized below.

Layer	Input Shape	Output Shape	Kernel Size / Pool	Composition
Input	$(B, T, F)$	$(B, F, T)$	–	Permute(1, 2)
Conv Block 1	$(B, F, T)$	$(B, 64, T/2)$	Kernel=3, Pool=2	Conv1D+BN+ReLU+Dropout+MaxPool
Conv Block 2	$(B, 64, T/2)$	$(B, 128, T/4)$	Kernel=3, Pool=2	Conv1D+BN+ReLU+Dropout+MaxPool
Conv Block 3	$(B, 128, T/4)$	$(B, 256, T/8)$	Kernel=3, Pool=2	Conv1D+BN+ReLU+Dropout+MaxPool

Multiscale Spatiotemporal Attention Enhancement Layer

The feature maps $\mathbf{Z}^{(1)}$ from the CNN are rich but may contain redundant or noise-correlated information. To refine these features specifically for the task of unmanned drone identification, we introduce a dedicated attention module. This module treats the time dimension as a spatial dimension and the feature channels separately, applying a two-pronged strategy.

First, the 1D feature tensor $\mathbf{Z}^{(1)} \in \mathbb{R}^{B \times C_L \times T’}$ (where $T’ = T/2^L$) is expanded to a 2D representation $\mathbf{Z}^{(1)}_{2D} \in \mathbb{R}^{B \times C_L \times T’ \times 1}$. This allows the application of 2D spatial operations along the time axis. The core of this layer is a Squeeze-and-Excitation with Skip Connection Channel Attention (S-C Attention) block followed by a Multiscale Depthwise Convolution Fusion block.

S-C Attention Module: For an input 2D feature map $\mathbf{S}_{IN}$, we split it along the channel dimension into two subsets: $\mathbf{S}_1, \mathbf{S}_2 = \text{Split}(\mathbf{S}_{IN})$, where $\mathbf{S}_1, \mathbf{S}_2 \in \mathbb{R}^{B \times \frac{C}{2} \times H \times W}$ ($H=T’$, $W=1$). We then apply global average pooling (GAP) to $\mathbf{S}_1$ and global max pooling (GMP) to $\mathbf{S}_2$, followed by a shared two-layer perceptron to generate channel-wise attention weights.

$$
\begin{aligned}
\mathbf{A}_{avg} &= \sigma(\mathbf{W}_2 \cdot \delta(\mathbf{W}_1 \cdot \text{GAP}(\mathbf{S}_1))) \\
\mathbf{A}_{max} &= \sigma(\mathbf{W}_4 \cdot \delta(\mathbf{W}_3 \cdot \text{GMP}(\mathbf{S}_2)))
\end{aligned}
$$

where $\delta$ is the ReLU activation, $\sigma$ is the sigmoid function, and $\mathbf{W}_*$ are learnable weight matrices. The final channel-refined feature is the concatenation of the recalibrated subsets: $\mathbf{S}_{OUT} = \text{Concat}(\mathbf{A}_{avg} \odot \mathbf{S}_1, \: \mathbf{A}_{max} \odot \mathbf{S}_2)$, where $\odot$ denotes channel-wise multiplication.

Multiscale Depthwise Convolution Fusion: The output $\mathbf{S}_{OUT}$ is then processed in parallel by multiple depthwise separable convolutions with different kernel sizes to capture temporal patterns at multiple scales. This is crucial for unmanned drone sounds, which may exhibit both rapid rotor harmonics and slower engine modulations.

$$
\mathbf{Y} = \text{DConv}_{5\times5}(\mathbf{S}_{OUT}) + \text{DConv}_{7\times7}(\mathbf{S}_{OUT}) + \text{DConv}_{11\times11}(\mathbf{S}_{OUT}) + \text{DConv}_{21\times21}(\mathbf{S}_{OUT})
$$

The outputs are summed, allowing the network to integrate information from short, medium, and long temporal contexts, effectively increasing the receptive field and focusing on temporally salient regions indicative of an unmanned drone.

Multilevel Bidirectional Temporal Classification Layer

After attention-based enhancement, the feature sequence $\mathbf{Z}^{(3)}$ contains refined spatiotemporal information. To model the full contextual dependencies in the audio sequence—how the sound of an unmanned drone evolves over time—we employ a deep Bidirectional LSTM (BiLSTM) network.

Given the input sequence $\mathbf{Z}^{(3)} = (\mathbf{z}_1, \mathbf{z}_2, …, \mathbf{z}_{T’})$, a standard LSTM computes a hidden state $\overrightarrow{\mathbf{h}_t}$ using information from past time steps. A BiLSTM runs two separate LSTMs: one forward ($\overrightarrow{\mathbf{h}_t}$) and one backward ($\overleftarrow{\mathbf{h}_t}$). The final representation at each time step is the concatenation of both directions:

$$
\mathbf{h}_t = [\overrightarrow{\mathbf{h}_t}; \overleftarrow{\mathbf{h}_t}]
$$

We stack three such BiLSTM layers. This hierarchy allows the model to learn temporal patterns at different levels of abstraction: lower layers capture local, short-term dynamics (e.g., individual sound events), while higher layers integrate these into longer-term, global structures that define the entire unmanned drone acoustic signature.

The final step is classification based on the learned sequential features. Instead of using only the last time step’s output, we employ a multilevel feature fusion strategy. We extract the final hidden states from all layers of the BiLSTM stack, resulting in a rich feature tensor $\mathbf{H} \in \mathbb{R}^{2L \times B \times D_h}$, where $L$ is the number of LSTM layers and $D_h$ is the hidden dimension. This tensor is flattened and passed through a two-stage classifier with dimensionality reduction to prevent overfitting:

$$
\begin{aligned}
\mathbf{a}^{(1)} &= \delta(\mathbf{W}_1 \cdot \text{Flatten}(\mathbf{H}) + \mathbf{b}^{(1)}), \quad \mathbf{a}^{(1)} \in \mathbb{R}^{256} \\
\hat{\mathbf{y}} &= \text{Softmax}(\mathbf{W}_2 \cdot \mathbf{a}^{(1)} + \mathbf{b}^{(2)}), \quad \hat{\mathbf{y}} \in \mathbb{R}^{C}
\end{aligned}
$$

where $\mathbf{W}_1, \mathbf{W}_2$ are weight matrices, $\mathbf{b}^{(1)}, \mathbf{b}^{(2)}$ are bias vectors, and $\delta$ is a ReLU activation function. The final output $\hat{\mathbf{y}}$ represents the predicted probability distribution over the $C$ classes.

Experiments and Results

Datasets and Experimental Setup

We evaluate the proposed model on two distinct datasets to assess its performance and generalization capability for unmanned drone audio detection.

1. Anti-UAV-Sound (Self-Built Dataset): This dataset is specifically curated for the binary classification task of distinguishing unmanned drone sounds from other sounds. Unmanned drone audio clips were collected from real-world recordings of various quadcopter models in indoor and outdoor settings. Non-drone sounds were sourced from public environmental sound datasets (ESC-50) and noise samples, including wind, vehicle traffic, human speech, and machinery. We also added silent segments to balance the dataset. The audio files were standardized to a single channel with a 16 kHz sampling rate.

2. UrbanSound8K (Public Dataset): To demonstrate the model’s versatility beyond the specific unmanned drone task, we also test it on this widely-used benchmark for general urban sound classification. It contains 10 classes such as air conditioner, car horn, children playing, dog bark, drilling, engine idling, gunshot, jackhammer, siren, and street music. While not exclusively focused on unmanned drones, this dataset presents a challenging multi-class acoustic environment, testing the model’s ability to learn diverse audio patterns.

For both datasets, we split the data into training and testing sets using an 90%-10% ratio. All experiments are conducted on an NVIDIA RTX 4090 GPU using PyTorch. We use the AdamW optimizer with an initial learning rate of $1\times10^{-3}$, a batch size of 32, and train for 100 epochs. The input MFCC features are computed with 40 coefficients, 25 ms frame length, and 10 ms frame shift.

Parameter	Anti-UAV-Sound Dataset	UrbanSound8K Dataset
Training Samples	11,480	7,858
Testing Samples	1,277	874
Number of Classes	2	10
Primary Target	Unmanned Drone vs. Non-Drone	General Urban Sounds

Evaluation Metrics

We employ standard classification metrics to quantitatively evaluate model performance. For binary classification (Anti-UAV-Sound), we report Accuracy, Precision, Recall, and the F1-Score. For multi-class classification (UrbanSound8K), we report overall Accuracy and the macro-averaged F1-Score. Let TP, TN, FP, and FN denote True Positives, True Negatives, False Positives, and False Negatives, respectively.

$$
\begin{aligned}
\text{Accuracy } (\eta) &= \frac{TP + TN}{TP + TN + FP + FN} \\
\text{Precision } (P) &= \frac{TP}{TP + FP} \\
\text{Recall } (R) &= \frac{TP}{TP + FN} \\
\text{F1-Score } (F1) &= 2 \times \frac{P \times R}{P + R}
\end{aligned}
$$

In the context of unmanned drone detection, a high Precision is critical to minimize false alarms (e.g., mistaking a car for a drone), while a high Recall is vital to avoid missing actual unmanned drone threats. The F1-Score provides a balanced measure of these two competing objectives.

Comparative Performance Analysis

We compare our proposed Multiscale Spatiotemporal Attention Network against several state-of-the-art audio classification models from the literature. The results are presented in the table below.

Model	Reference / Year	Anti-UAV-Sound		UrbanSound8K
		Acc. (%)	F1 (%)	Acc. (%)	F1 (%)
ECAPA-TDNN	Interspeech 2020	93.25	93.71	88.75	88.40
PANNs	IEEE/ACM TASLP 2020	92.71	93.22	95.56	96.76
TDNN	Computer Speech & Language 2022	94.22	93.47	95.93	96.74
CAM++	arXiv 2022	93.89	94.12	92.73	93.99
ERes2Net	ICASSP 2022	93.87	94.32	95.62	96.56
SVM with MFCC/LPCC	IEEE TVT 2019	90.12	89.93	88.10	87.63
CNN with MFCC+GFCC	Chinese Journal 2020	91.22	93.01	88.13	88.11
Proposed Model	–	95.20	95.47	97.34	98.36

On the Anti-UAV-Sound dataset, our model achieves the highest accuracy of 95.20% and the highest F1-score of 95.47%, outperforming all compared models. This indicates a superior ability to balance the detection of true unmanned drone sounds while rejecting false positives from background noise. The improvement over strong baselines like TDNN and ERes2Net validates the effectiveness of our integrated multiscale and attention design for this specific task.

Perhaps more impressively, on the more diverse UrbanSound8K dataset, our model also sets a new benchmark with 97.34% accuracy and 98.36% F1-score. This demonstrates that the architectural principles of multiscale feature learning and spatiotemporal attention are not task-specific but contribute to robust general-purpose audio classification. The model’s parameter count is approximately 4.93 million, which is efficient compared to several high-performing counterparts, indicating good parameter efficiency.

Ablation Study

To deconstruct the contribution of each core component in our proposed network for unmanned drone detection, we conduct a systematic ablation study on the Anti-UAV-Sound dataset. We create several model variants by removing or simplifying key modules.

Model Variant	Configuration	Purpose	Acc. (%)	F1 (%)
M1: Full Model	CNN + BiLSTM + Attention + Classifier	Performance Baseline	95.20	95.47
M2: No Attention	CNN + BiLSTM + Classifier	Assess Attention Contribution	94.45	95.84
M3: No BiLSTM	CNN + Attention + Classifier	Assess Temporal Modeling Contribution	93.10	94.13
M4: Shallow CNN	2-Layer CNN + BiLSTM + Attention + Classifier	Assess CNN Depth Impact	95.54	96.82
M5: CNN Baseline	CNN + Classifier	Baseline Comparison	92.57	93.89

The results are illuminating. The full model (M1) delivers the best overall accuracy. Removing the attention mechanism (M2) causes a noticeable drop in accuracy (0.75%), confirming that the module successfully focuses the network on discriminative spatiotemporal features of the unmanned drone sound. Removing the BiLSTM (M3) leads to a more significant performance degradation (2.1% accuracy drop), underscoring the critical importance of modeling long-range temporal context—the evolving nature of the unmanned drone’s acoustic signature is a key discriminant. Using a shallower CNN (M4) results in a slight performance decrease, validating our design of a deep, hierarchical feature extractor. Finally, the simple CNN baseline (M5) performs the worst, highlighting that the sequential modeling and attention components are not merely additive but synergistically essential for high-performance unmanned drone audio detection.

Conclusion

In this work, we have presented a novel deep learning architecture, the Multiscale Spatiotemporal Attention Network, for the critical task of unmanned drone audio detection. The model is designed to tackle the central challenges in this domain: low signal-to-noise ratios, the complex temporal evolution of unmanned drone sounds, and the need to suppress irrelevant environmental audio. By combining a hierarchical CNN for multiscale spectral feature extraction, a bidirectional LSTM for contextual temporal modeling, and a dedicated attention mechanism for adaptive feature refinement, the network learns a highly discriminative representation of the target acoustic signature.

Comprehensive experiments on a dedicated unmanned drone dataset (Anti-UAV-Sound) and a general urban sound benchmark (UrbanSound8K) demonstrate the superior performance of our proposed model. It achieves state-of-the-art results, outperforming several established audio classification models. The ablation studies provide clear evidence that each architectural component—the deep CNN, the BiLSTM, and the attention module—plays a vital and synergistic role in achieving this high performance. The model maintains robust accuracy even in noisy, complex acoustic environments, making it a viable and effective technical solution for real-world low-altitude security applications aimed at mitigating the risks posed by unauthorized unmanned drone operations.

Future work will focus on further optimizing the model for embedded deployment, exploring its integration with other sensor modalities (e.g., RF, video) in a fusion framework, and expanding the dataset to include a wider variety of unmanned drone models and more challenging acoustic scenarios.