In recent years, Unmanned Aerial Vehicles (UAVs), particularly quadrotor UAVs, have become indispensable in various applications such as logistics, search and rescue, surveillance, and agricultural monitoring. The autonomous navigation and control of these UAVs heavily rely on the Global Navigation Satellite System (GNSS) for precise positioning and timing. However, the vulnerability of GNSS signals to spoofing attacks poses significant security risks, potentially leading to mission failure, loss of control, or even catastrophic incidents. Among these threats, ramp-style slowly varying spoofing attacks are particularly insidious, as they introduce gradual deviations that are challenging to detect using conventional methods. This paper addresses this critical issue by proposing a novel detection framework based on Gated Cross-Attention (GCA) for multi-feature fusion, specifically designed to enhance the detection accuracy and robustness of UAVs under such attacks.
The proliferation of Unmanned Aerial Vehicle technology has enabled advancements in autonomous operations, but it also exposes these systems to cyber-physical threats. GNSS spoofing attacks, where malicious actors broadcast counterfeit signals to mislead the UAV’s navigation system, can cause gradual drift in position estimates, evading traditional detection mechanisms that rely on abrupt changes. For instance, in a quadrotor UAV, the integration of GNSS with an Inertial Navigation System (INS) through filters like the Extended Kalman Filter (EKF) is common. The innovation sequence generated during this fusion process serves as a key indicator of consistency between sensor measurements and predictions. Under ramp-style attacks, the innovation sequence exhibits subtle temporal patterns and statistical anomalies that can be leveraged for detection. However, existing approaches often struggle with delayed detection or low accuracy due to inadequate feature extraction and fusion techniques.

Our work introduces a GCA-based LSTM-CNN network (GCA-LSTM-CNN) that synergistically combines temporal and local anomaly features from innovation sequences and Mahalanobis distance metrics. The Long Short-Term Memory (LSTM) network captures long-term dependencies and cumulative effects induced by slow attacks, while the 1D Convolutional Neural Network (1D-CNN) extracts localized statistical deviations. The core innovation lies in the gated cross-attention mechanism, which dynamically adjusts the contribution of each feature branch based on global context, overcoming limitations of static fusion methods. Additionally, we incorporate a weighted label smoothing loss function to handle class imbalance and improve generalization. Through extensive simulations in a ROS-Gazebo environment with a PX4-controlled quadrotor UAV, we demonstrate that our model achieves superior performance compared to baseline methods, with an accuracy of 98.83% and reduced false negatives across various attack scenarios.
The remainder of this paper is structured as follows: First, we describe the problem formulation and the characteristics of ramp-style GNSS spoofing attacks on Unmanned Aerial Vehicles. Next, we detail the proposed GCA-LSTM-CNN architecture, including feature extraction, fusion mechanisms, and loss design. We then present experimental results, including ablation studies and comparisons with state-of-the-art methods. Finally, we conclude with insights and future directions for enhancing UAV security.
Problem Description: GNSS Spoofing in Unmanned Aerial Vehicles
In a typical quadrotor UAV, the navigation system fuses GNSS measurements with IMU data using an EKF to estimate the vehicle’s state, including position, velocity, and attitude. The measurement equation in the presence of a spoofing attack can be modeled as:
$$ \mathbf{y}_k = h(\mathbf{x}_k) + \mathbf{v}_k + \lambda \mathbf{y}^a_k $$
where $\mathbf{y}_k$ is the sensor measurement vector, $h(\cdot)$ is the observation function, $\mathbf{x}_k$ is the system state, $\mathbf{v}_k$ is the measurement noise, $\lambda$ is a diagonal attack selection matrix indicating which states are compromised, and $\mathbf{y}^a_k$ represents the spoof-induced deviation. For ramp-style attacks, the deviation $\mathbf{s}(t)$ evolves gradually over time:
$$ \mathbf{s}(t) = \frac{t – \tau}{T_s} \times \mathbf{s}_{\text{total}} $$
where $\tau$ is the attack start time, $T_s$ is the transition duration, and $\mathbf{s}_{\text{total}}$ is the total offset vector. This results in a slow divergence in the innovation sequence $\mathbf{r}_k$, defined as the difference between actual measurements and EKF predictions:
$$ \mathbf{r}_k = \mathbf{y}_k – h(\hat{\mathbf{x}}_{k|k-1}) $$
For GNSS position measurements, the innovation becomes:
$$ \mathbf{r}_{\text{pos},k} = \mathbf{p}_{\text{gnss,true},k} + \mathbf{s}(t) – \hat{\mathbf{p}}_{k|k-1} $$
where $\mathbf{p}_{\text{gnss,true},k}$ is the true position, and $\hat{\mathbf{p}}_{k|k-1}$ is the predicted position. The gradual nature of the attack means that the innovation sequence exhibits slow drifts rather than abrupt changes, making detection challenging for conventional methods like innovation tests or residual monitoring. This underscores the need for advanced machine learning techniques that can capture these subtle patterns in Unmanned Aerial Vehicle systems.
Proposed GCA-LSTM-CNN Detection Framework
Our proposed framework leverages a dual-branch architecture to extract complementary features from preprocessed innovation data. The overall workflow involves data preprocessing, feature extraction via LSTM and CNN, adaptive fusion using gated cross-attention, and classification through fully connected layers. We describe each component in detail below.
Data Preprocessing
To construct input samples, we first compute the innovation sequence $\mathbf{r}_k \in \mathbb{R}^{5}$ for each time step, including velocity and position innovations in the North-East-Down (NED) frame. We enhance this with statistical features by computing sliding window means and covariances, padded with zeros where necessary, resulting in a 15-dimensional vector $\mathbf{s}_k \in \mathbb{R}^{15}$. Additionally, we calculate the Mahalanobis distance $\mathbf{d}_k$ between the current innovation and the distribution of normal data:
$$ \mathbf{d}_k = \sqrt{(\mathbf{r}_k – \mu_n)^T \Sigma_n^{-1} (\mathbf{r}_k – \mu_n)} $$
where $\mu_n$ and $\Sigma_n$ are the mean and covariance of normal innovations. The data is then normalized using Z-score normalization and segmented into overlapping sliding windows of size $m$ to form input samples $\mathbf{Z}_k \in \mathbb{R}^{m \times 15}$ for the LSTM branch and $\mathbf{D}_k \in \mathbb{R}^{m}$ for the CNN branch.
LSTM-based Innovation Sequence Feature Extraction
The LSTM branch processes the windowed innovation sequence $\mathbf{Z}_k$ to capture temporal dependencies. We employ a two-layer LSTM network with dropout and batch normalization to prevent overfitting and accelerate convergence. The LSTM equations for each time step are:
$$ \begin{aligned}
\mathbf{I}_t &= \sigma(\mathbf{X}_t \mathbf{W}_{xi} + \mathbf{H}_{t-1} \mathbf{W}_{hi} + \mathbf{b}_i) \\
\mathbf{F}_t &= \sigma(\mathbf{X}_t \mathbf{W}_{xf} + \mathbf{H}_{t-1} \mathbf{W}_{hf} + \mathbf{b}_f) \\
\mathbf{O}_t &= \sigma(\mathbf{X}_t \mathbf{W}_{xo} + \mathbf{H}_{t-1} \mathbf{W}_{ho} + \mathbf{b}_o) \\
\tilde{\mathbf{C}}_t &= \tanh(\mathbf{X}_t \mathbf{W}_{xc} + \mathbf{H}_{t-1} \mathbf{W}_{hc} + \mathbf{b}_c) \\
\mathbf{C}_t &= \mathbf{F}_t \odot \mathbf{C}_{t-1} + \mathbf{I}_t \odot \tilde{\mathbf{C}}_t \\
\mathbf{H}_t &= \mathbf{O}_t \odot \tanh(\mathbf{C}_t)
\end{aligned} $$
where $\sigma$ is the sigmoid function, $\odot$ denotes element-wise multiplication, and $\mathbf{W}$ and $\mathbf{b}$ are learnable parameters. The output from the second LSTM layer is passed through batch normalization and a linear layer to produce features $\mathbf{X} \in \mathbb{R}^{B \times d_L}$, where $B$ is the batch size and $d_L$ is the feature dimension.
CNN-based Mahalanobis Distance Feature Extraction
Parallelly, the CNN branch processes the Mahalanobis distance sequence $\mathbf{D}_k$ to detect local anomalies. We use two 1D convolutional layers with ReLU activation and batch normalization:
$$ \begin{aligned}
\mathbf{C}_1 &= \text{ReLU}(\text{BN}(\mathbf{D}_{\text{in}} * \mathbf{W}_1 + \mathbf{b}_1)) \\
\mathbf{C}_2 &= \text{ReLU}(\text{BN}(\mathbf{C}_1 * \mathbf{W}_2 + \mathbf{b}_2))
\end{aligned} $$
where $*$ denotes convolution, and $\mathbf{W}_1$, $\mathbf{W}_2$, $\mathbf{b}_1$, $\mathbf{b}_2$ are convolutional filters and biases. The features are then pooled and flattened, followed by a fully connected layer to yield $\mathbf{Y} \in \mathbb{R}^{B \times d_C}$.
Gated Cross-Attention Fusion Mechanism
To effectively fuse the LSTM and CNN features, we propose a gated cross-attention mechanism. First, the CNN features $\mathbf{Y}$ are aligned to the LSTM feature space using a 1D convolution:
$$ \mathbf{Y}’ = \text{ReLU}(\text{BN}(\mathbf{Y} * \mathbf{W}_a + \mathbf{b}_a)) $$
Then, cross-attention is computed where the LSTM features $\mathbf{X}$ serve as queries, and the aligned CNN features $\mathbf{Y}’$ provide keys and values:
$$ \begin{aligned}
\mathbf{Q} &= \mathbf{X} \mathbf{W}_q, \quad \mathbf{K} = \text{Conv1d}(\mathbf{Y}’), \quad \mathbf{V} = \text{Conv1d}(\mathbf{Y}’) \\
\mathbf{A} &= \text{Softmax}\left(\frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d_C}}\right), \quad \tilde{\mathbf{Y}} = \mathbf{A} \mathbf{V}
\end{aligned} $$
The attention output $\tilde{\mathbf{Y}}$ is combined with the original features using a gating mechanism. Global average pooling is applied to both $\mathbf{X}$ and $\tilde{\mathbf{Y}}$ to obtain context vectors, which are concatenated and passed through a linear layer with Softmax to generate gating coefficients $\alpha$ and $\beta$:
$$ \begin{aligned}
\mathbf{z} &= [\text{Pool}(\mathbf{X}); \text{Pool}(\tilde{\mathbf{Y}})] \\
[\alpha, \beta] &= \text{Softmax}(\mathbf{W}_g \mathbf{z} + \mathbf{b}_g)
\end{aligned} $$
The features are weighted as $\mathbf{X}’ = \alpha \mathbf{X}$ and $\tilde{\mathbf{Y}}’ = \beta \tilde{\mathbf{Y}}$, and then fused with a residual connection:
$$ \mathbf{F} = \mathbf{X}’ + \tilde{\mathbf{Y}}’ \mathbf{W}_r $$
where $\mathbf{W}_r$ is a projection matrix. This adaptive fusion allows the model to emphasize the most relevant features for each input sample.
Classification and Loss Function
The fused features $\mathbf{F}$ are passed through a fully connected network with two hidden layers and ReLU activation, followed by a Softmax output layer for classification into four classes: normal, latitude attack, longitude attack, and dual-channel attack. To address class imbalance, we use a weighted label smoothing loss:
$$ \mathcal{L} = -\frac{1}{B} \sum_{j=1}^{B} \sum_{k=1}^{K} \omega_{jk} u_{jk} \log(p_{jk}) $$
where $\omega_{jk}$ is the class weight, $u_{jk}$ is the smoothed label, and $p_{jk}$ is the predicted probability. The model is optimized using Adam with a OneCycleLR scheduler for efficient training.
Experimental Evaluation
We evaluated our proposed GCA-LSTM-CNN framework using simulations in ROS/Gazebo with a PX4-controlled quadrotor UAV. The UAV was subjected to ramp-style spoofing attacks on latitude, longitude, and both channels simultaneously, with a constant velocity offset of 0.1 m/s. Data from 40 flight trajectories (approximately 480,000 points) were collected and split 4:1 for training and testing. We compared our model against baseline methods including SVM with RBF kernel, Fully Connected Network (FCN), and a serial CNN-LSTM model. Performance metrics included accuracy, precision, recall, and F1-score.
Results and Analysis
The proposed GCA-LSTM-CNN model achieved an overall accuracy of 98.83%, outperforming all baselines. The following table summarizes the detection performance across different attack types:
| Model | Attack Type | Precision | Recall | F1-Score | Accuracy |
|---|---|---|---|---|---|
| SVM | Normal | 0.9398 | 0.9927 | 0.9655 | 0.9699 |
| Latitude Attack | 0.9855 | 0.9827 | 0.9841 | ||
| Longitude Attack | 0.9828 | 0.9296 | 0.9555 | ||
| Dual-Channel Attack | 0.9950 | 0.9630 | 0.9788 | ||
| FCN | Normal | 0.9257 | 0.9963 | 0.9597 | 0.9675 |
| Latitude Attack | 0.9985 | 0.9827 | 0.9905 | ||
| Longitude Attack | 0.9811 | 0.9134 | 0.9460 | ||
| Dual-Channel Attack | 0.9833 | 0.9646 | 0.9812 | ||
| CNN-LSTM | Normal | 0.9482 | 0.9917 | 0.9695 | 0.9748 |
| Latitude Attack | 0.9985 | 0.9827 | 0.9905 | ||
| Longitude Attack | 0.9750 | 0.9499 | 0.9623 | ||
| Dual-Channel Attack | 0.9983 | 0.9662 | 0.9820 | ||
| GCA-LSTM-CNN | Normal | 0.9723 | 1.0000 | 0.9860 | 0.9883 |
| Latitude Attack | 1.0000 | 0.9829 | 0.9914 | ||
| Longitude Attack | 0.9942 | 0.9801 | 0.9871 | ||
| Dual-Channel Attack | 0.9968 | 0.9840 | 0.9904 |
Our model demonstrated significant improvements in recall for attack classes, reducing false negatives by up to 3.0% compared to the best baseline. The gated fusion mechanism effectively balanced the contributions of temporal and statistical features, as evidenced by the high F1-scores across all classes. Ablation studies confirmed that both the weighted label smoothing loss and the GCA module contributed positively to performance, with the combined approach yielding the best results.
Furthermore, we tested the model’s robustness under different attack rates, including 0.4 m/s, 0.2 m/s, and 0.05 m/s. The results, summarized below, show that the GCA-LSTM-CNN model maintains high accuracy even under extremely slow attacks, highlighting its suitability for real-world Unmanned Aerial Vehicle applications:
| Attack Rate (m/s) | Accuracy | F1-Score |
|---|---|---|
| 0.4 | 0.9921 | 0.9920 |
| 0.2 | 0.9913 | 0.9914 |
| 0.1 | 0.9883 | 0.9887 |
| 0.05 | 0.9758 | 0.9756 |
Dynamic detection on continuous sequences revealed that the model successfully identified all attack types with reasonable delays (e.g., 3-7 time steps), demonstrating its practical utility for real-time monitoring in JUYE UAV systems.
Conclusion
In this paper, we presented a novel GCA-LSTM-CNN framework for detecting ramp-style slowly varying GNSS spoofing attacks in Unmanned Aerial Vehicles. By integrating LSTM-based temporal analysis and CNN-based local anomaly detection through an adaptive gated cross-attention mechanism, our model achieves high detection accuracy and robustness. Experimental results confirm its superiority over existing methods, with significant reductions in false negatives and improved performance across various attack scenarios. Future work will focus on extending the model to handle diverse attack patterns and optimizing it for resource-constrained embedded systems in JUYE UAV platforms, further enhancing the security and reliability of autonomous UAV operations.
