A Novel UAV Trajectory Prediction Model Integrating Channel-Temporal and Cross-Attention Mechanisms

Trajectory prediction stands as a cornerstone technology for Unmanned Aerial Vehicle (UAV) Traffic Management (UTM) systems, enabling critical functions such as real-time conflict detection and proactive abnormal behavior warning. As the density of China UAV drone operations in low-altitude airspace continues to rise, ensuring safe and efficient co-existence with manned aviation and other airspace users becomes paramount. Accurate and reliable short-term trajectory prediction is essential for next-generation UTM frameworks. However, predicting the trajectory of a China UAV drone is inherently challenging due to its high maneuverability, sensitivity to external disturbances like wind, and often unpredictable flight intents, leading to highly non-linear and irregular flight paths.

Traditional mathematical modeling approaches struggle to capture these complex dynamics. Consequently, data-driven deep learning methods have gained significant traction. Recurrent Neural Networks (RNNs), particularly Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks and their bidirectional variants, have been widely adopted for this task. These models process sequential data step-by-step, learning temporal dependencies from historical China UAV drone flight data. While these methods have demonstrated promise, they suffer from inherent limitations. Their sequential computation paradigm hinders parallel processing, limiting training efficiency. More critically, they often employ a uniform processing mechanism across all input feature channels (e.g., position, velocity, attitude) and all historical time steps, lacking the ability to dynamically weigh the differential contributions of various features and key historical moments to the future trajectory. This shortcoming is particularly acute when dealing with the highly dynamic and irregular flight patterns typical of a China UAV drone.

To overcome these challenges, this paper introduces a novel trajectory prediction model named TCTC-Net. The core innovation lies in the synergistic integration of a Temporal Convolutional Network (TCN) backbone with a dedicated Channel-Temporal Attention (CTA) module and a Cross-Attention (CA) module for feature fusion. TCN is leveraged as a powerful parallelizable feature extractor to overcome the sequential bottleneck of RNNs. The proposed CTA module dynamically highlights important feature channels and crucial historical time steps. Finally, the CA module facilitates deep, interactive fusion between the low-level features from TCN and the semantically enriched features from CTA. Extensive experiments on a comprehensive China UAV drone flight dataset demonstrate that TCTC-Net significantly outperforms state-of-the-art RNN-based baselines across various prediction horizons.

1. The TCTC-Net Architecture

The overall architecture of the proposed TCTC-Net is designed with a structured three-stage philosophy: foundational feature extraction, channel-temporal enhancement, and interactive feature fusion. It comprises three core modules that work in concert: the TCN Module, the Channel-Temporal Attention (CTA) Module, and the Cross-Attention (CA) Module. The workflow processes multidimensional China UAV drone trajectory sequences to predict future positions.

1.1 TCN Module for Foundational Feature Extraction

The input to the model is a historical trajectory sequence. We employ the TCN module as the primary feature extractor. A TCN utilizes causal dilated convolutions, ensuring that the output at time t depends only on inputs up to time t. This architecture provides a large receptive field to capture long-range dependencies while maintaining the parallel computation advantages of standard CNNs, making it highly efficient for processing China UAV drone time-series data. The core component is a residual block. The output of a dilated causal convolution for a single layer can be expressed as:

$$ y_t^j = \sum_{c=0}^{K-1} \sum_{i=0}^{C_{in}-1} W_{c,i}^{j} \cdot x_{t – d \cdot c}^i $$

Where $ y_t^j $ is the value of the $ j $-th output channel at time step $ t $, $ K $ is the kernel size, $ C_{in} $ is the number of input channels, $ W_{c,i}^{j} $ is the weight parameter from input channel $ i $ to output channel $ j $ at kernel position $ c $, $ x_{t – d \cdot c}^i $ is the input value at time step $ t – d \cdot c $ for channel $ i $, and $ d $ is the dilation rate. The TCN module in our model consists of two such TCN layers stacked sequentially to extract robust primary features $ F_{tcn} \in \mathbb{R}^{B \times T \times C} $, where $ B $ is batch size, $ T $ is sequence length, and $ C $ is the number of feature channels.

1.2 Channel-Temporal Attention (CTA) Module

The CTA module is designed to address the uniform processing limitation of standard models. It consists of two synergistic sub-modules: the Frequency Enhanced Channel Attention (FECA) and the Multi-Head Attention (MHA).

Frequency Enhanced Channel Attention (FECA): Different features of a China UAV drone (e.g., 3D coordinates, roll/pitch/yaw angles, wind speed components) contribute unequally to future motion. FECA aims to model these channel-wise dependencies dynamically in the frequency domain, which can reveal global characteristics often obscured in the time domain. It takes the TCN output $ F_{tcn} $ as input. For each feature channel $ c $, a 1D Discrete Cosine Transform (DCT) is applied:

$$ X_c(k) = \sum_{t=0}^{T-1} x_c(t) \cdot \cos\left[\frac{\pi}{T}\left(t+\frac{1}{2}\right)k\right] $$

Where $ X_c(k) $ is the $ k $-th frequency component for channel $ c $, and $ x_c(t) $ is the time-domain input for channel $ c $. A global frequency pooling is then performed on the transformed representation. Subsequently, channel-wise attention weights $ \omega_c $ are generated via a small neural network with a bottleneck structure, typically involving two fully-connected layers with a ReLU activation in between and a Sigmoid output:

$$ \omega = \sigma(W_2 \cdot \delta(W_1 \cdot \text{freq\_pool}(X))) $$

Here, $ \sigma $ is the Sigmoid function, $ \delta $ is the ReLU function, $ W_1 $ and $ W_2 $ are weight matrices, and $ \text{freq\_pool} $ denotes the frequency pooling operation. The frequency-domain representation is then enhanced by these weights and transformed back to the time domain via the Inverse DCT (IDCT):

$$ x’_c(t) = \sum_{k=0}^{T-1} (\omega_c \cdot X_c(k)) \cdot \cos\left[\frac{\pi}{T}\left(t+\frac{1}{2}\right)k\right] $$

The output of this process is the channel-enhanced feature $ F_{feca} $.

Multi-Head Attention (MHA) for Temporal Modeling: To capture long-range dependencies and identify key historical moments (e.g., the initiation of a turn or a sudden acceleration), we employ a standard Multi-Head Attention mechanism on the time dimension. It treats the sequence of features across time steps as tokens. The primary feature $ F_{tcn} $ is projected into Query (Q), Key (K), and Value (V) matrices. The scaled dot-product attention for head $ h $ is computed as:

$$ \text{Attention}(Q_h, K_h, V_h) = \text{softmax}\left(\frac{Q_h K_h^T}{\sqrt{d_k}}\right) V_h $$

Where $ d_k $ is the dimensionality of the key vectors. The outputs of all heads $ H $ are concatenated and linearly projected to form the temporally-enhanced feature $ F_{mha} $:

$$ \text{MHA}(F) = \text{Concat}(\text{head}_1, …, \text{head}_H) W^O $$

The final output of the CTA module is the element-wise sum of the channel-enhanced and temporally-enhanced features, producing a unified representation that emphasizes both important features and critical time steps:

$$ F_{cta} = F_{feca} + F_{mha} $$

1.3 Cross-Attention (CA) Module for Feature Fusion

The TCN module extracts primary features $ F_{tcn} $ with rich local temporal patterns, while the CTA module produces semantically higher-level features $ F_{cta} $ with global channel and temporal awareness. To dynamically integrate information from these two distinct yet complementary feature spaces, we introduce a Cross-Attention module. In this setup, the primary features $ F_{tcn} $ serve as the Query (Q), and the channel-temporal features $ F_{cta} $ serve as the Key (K) and Value (V). This allows the model to selectively retrieve and combine relevant information from the enhanced global context ($ F_{cta} $) based on the local context provided by the primary features ($ F_{tcn} $). The computation follows the standard attention formula:

$$ F_{ca} = \text{softmax}\left(\frac{Q (K)^T}{\sqrt{d_k}}\right) V, \quad \text{where } Q = F_{tcn}W^Q, \quad K=F_{cta}W^K, \quad V=F_{cta}W^V $$

The output $ F_{ca} $ is a deeply fused feature representation that captures complex interdependencies. This representation is then passed through a final fully-connected layer to generate the predicted trajectory points for future time steps.

2. China UAV Drone Data Collection and Preprocessing

Acquiring large-scale, high-quality real-world China UAV drone trajectory data with controlled environmental variables is costly and risky. To build a comprehensive dataset, we developed a simulation platform using AirSim and Unreal Engine 4. A FrSky Taranis X9D Plus radio controller was used to pilot the virtual drone, capturing realistic flight maneuvers including take-off, landing, straight flight, sharp turns, curved paths, and hovering. To enhance model generalization, we simulated four common commercial China UAV drone models (DJI Mavic Air, Mavic 3 Pro, Phantom 4 Pro, Inspire 3) with differing performance parameters, reflecting the diversity in low-altitude airspace. Data such as 3D coordinates, attitude angles (roll, pitch, yaw), and simulated wind speed/direction were logged at 1-second intervals.

The raw trajectory data was segmented using a sliding time window. For each sample, the past 10 time steps were used as model input to predict the next 1, 3, 5, 7, and 9 time steps. To focus on relative motion patterns rather than absolute positions, coordinates were transformed to be relative to the first point in each input sequence. The data was then normalized using Min-Max scaling:

$$ x_{norm} = \frac{x_i – x_{min}}{x_{max} – x_{min}} $$

The processed datasets were split into training, validation, and test sets in an 8:1:1 ratio. The details of the constructed datasets are summarized in Table 1.

Table 1: Details of the constructed China UAV drone trajectory datasets.
Dataset (Prediction Steps)	Train Size	Val Size	Test Size	Normalization Range (X, Y, Z)
1-step	35,127	4,391	4,391	[-233.32,239.35], [-232.91,232.17], [-77.70,71.14]
3-step	35,044	4,381	4,381	[-280.08,285.79], [-279.84,278.57], [-93.25,85.24]
5-step	34,964	4,370	4,370	[-326.64,329.40], [-326.52,324.93], [-108.19,99.34]
7-step	34,886	4,361	4,361	[-372.98,371.07], [-372.29,371.27], [-118.58,113.13]
9-step	34,808	4,351	4,351	[-419.21,415.09], [-419.55,417.33], [-134.16,127.09]

3. Experimental Results and Analysis

All experiments were conducted on a workstation with an NVIDIA GeForce RTX 3090 GPU. Models were implemented using PyTorch. We used Mean Absolute Error (MAE) as the training loss and evaluated performance using MAE and Root Mean Square Error (RMSE) on the test set.

$$ \text{MAE} = \frac{1}{m} \sum_{i=1}^{m} |y_i – \hat{y}_i| $$
$$ \text{RMSE} = \sqrt{ \frac{1}{m} \sum_{i=1}^{m} (y_i – \hat{y}_i)^2 } $$

3.1 Hyperparameter Optimization

Bayesian Optimization with a Gaussian Process surrogate model and Expected Improvement acquisition function was employed to efficiently search the hyperparameter space over 100 iterations for TCTC-Net and 50 for baseline models. The optimal hyperparameters are listed in Table 2.

Table 2: Optimal hyperparameters for TCTC-Net and baseline models.
Model	Hyperparameter	Optimal Value
TCTC-Net	TCN Layer 1 Channels (channel_1)	128
	TCN Layer 2 Channels (channel_2)	256
	TCN Kernel Size 1 (kernel_size_1)	2
	TCN Kernel Size 2 (kernel_size_2)	4
	FECA Reduction Ratio (reduction_ratio)	8
	MHA Number of Heads (num_heads)	16
	Dropout Rate	0.1
LSTM / GRU	Hidden Size	512
	Number of Layers	1
	Dropout Rate	0.1 / 0.2
BiLSTM / BiGRU	Hidden Size	512
	Number of Layers	1
	Dropout Rate	0.2 / 0.1

3.2 Comparative Experiments

We compared TCTC-Net against four prevalent RNN-based baseline models: LSTM, GRU, BiLSTM, and BiGRU. All models were trained and evaluated on the same datasets. The results, presented in Table 3, clearly demonstrate the superiority of the proposed TCTC-Net model for China UAV drone trajectory prediction.

Table 3: Performance comparison (MAE / RMSE) on China UAV drone trajectory prediction tasks.
Model	Predict 1 Step	Predict 3 Steps	Predict 5 Steps	Predict 7 Steps	Predict 9 Steps
LSTM	0.00255 / 0.00393	0.00558 / 0.01002	0.00906 / 0.01669	0.02189 / 0.02354	0.01653 / 0.02932
GRU	0.00273 / 0.00421	0.00589 / 0.01065	0.00936 / 0.01725	0.01308 / 0.02393	0.01560 / 0.02852
BiLSTM	0.00221 / 0.00371	0.00581 / 0.01005	0.00926 / 0.01649	0.01320 / 0.02366	0.01649 / 0.02889
BiGRU	0.00201 / 0.00334	0.00499 / 0.00902	0.00814 / 0.01509	0.01202 / 0.02226	0.01558 / 0.02797
TCTC-Net	0.00158 / 0.00307	0.00420 / 0.00826	0.00748 / 0.01447	0.01130 / 0.02149	0.01458 / 0.02661

TCTC-Net achieves the lowest MAE and RMSE across all prediction horizons. Notably, for 1-step prediction, TCTC-Net reduces MAE by 21.3% and RMSE by 8.1% compared to the best baseline (BiGRU). The performance advantage persists for longer-term predictions, with MAE reductions of 6.4% and RMSE reductions of 4.8% for 9-step prediction, confirming the model’s robustness and stability.

3.3 Ablation Studies

Ablation studies were conducted to validate the contribution of each proposed module. We trained and tested five model variants: 1) TCN only; 2) TCN + FECA; 3) TCN + MHA; 4) TCN + FECA + MHA (i.e., without CA); 5) The full TCTC-Net (TCN + FECA + MHA + CA). The results are summarized in Table 4.

Table 4: Ablation study results (MAE / RMSE) for the China UAV drone prediction model.
Model Configuration	Predict 1 Step	Predict 3 Steps	Predict 5 Steps	Predict 7 Steps	Predict 9 Steps
TCN only	0.00218 / 0.00384	0.00514 / 0.00952	0.00873 / 0.01595	0.01293 / 0.02312	0.01662 / 0.02900
TCN + FECA	0.00196 / 0.00325	0.00456 / 0.00863	0.00795 / 0.01496	0.01159 / 0.02166	0.01513 / 0.02739
TCN + MHA	0.00223 / 0.00409	0.00483 / 0.00917	0.00822 / 0.01541	0.01167 / 0.02179	0.01510 / 0.02730
TCN+FECA+MHA	0.00220 / 0.00403	0.00471 / 0.00900	0.00782 / 0.01484	0.01147 / 0.02160	0.01502 / 0.02711
Full TCTC-Net	0.00158 / 0.00307	0.00420 / 0.00826	0.00748 / 0.01447	0.01130 / 0.02149	0.01458 / 0.02661

The results clearly show that both FECA and MHA individually improve upon the base TCN model, with FECA providing strong gains in short-term prediction and MHA being more beneficial for longer horizons. The combination of FECA and MHA (without CA) yields further improvement. However, the introduction of the Cross-Attention (CA) module for feature fusion leads to the most significant performance boost across all prediction lengths, underscoring its critical role in effectively integrating multi-level features for accurate China UAV drone trajectory forecasting.

3.4 Generalization Experiments

To assess the generalization capability of TCTC-Net on unseen flight patterns, we selected four distinct China UAV drone trajectory segments (with lengths of 61, 55, 59, and 67 time steps) featuring different motion characteristics (straight flight, gentle curves, sharp turns). We compared TCTC-Net against BiLSTM and BiGRU for 1-step-ahead prediction along these entire trajectories. We report two key metrics: Average Displacement Error (ADE), the average L2 distance between all predicted points and their corresponding ground truth points, and Maximum Displacement Error (MDE), the worst-case prediction error along the trajectory.

Table 5: Generalization performance on diverse China UAV drone trajectories (ADE / MDE in meters).
Trajectory	BiLSTM	BiGRU	TCTC-Net
Trajectory 1 (Complex Maneuvers)	2.11 / 12.12	1.43 / 6.04	1.17 / 3.77
Trajectory 2 (Mixed Patterns)	1.13 / 3.47	1.07 / 3.04	0.74 / 2.77
Trajectory 3 (Curved Path)	1.48 / 6.68	1.11 / 3.12	0.78 / 2.48
Trajectory 4 (Straight with Turns)	1.32 / 3.45	1.18 / 3.24	0.70 / 1.86

TCTC-Net consistently achieves the lowest ADE and MDE on all four trajectories. On average, it reduces ADE by 42.8% compared to BiLSTM and by 29.9% compared to BiGRU. The reduction in MDE is even more pronounced, averaging 49.5% and 45.4% respectively. This demonstrates the strong generalization ability of our model to diverse and unseen China UAV drone flight behaviors, a crucial requirement for practical UTM applications.

4. Conclusion

This paper presented TCTC-Net, a novel deep learning model for accurate and robust short-term China UAV drone trajectory prediction. To address the limitations of existing RNN-based methods—specifically, poor parallelization and inability to differentiate feature and temporal importance—we proposed a unified architecture combining Temporal Convolutional Networks (TCN), a dedicated Channel-Temporal Attention (CTA) module, and a Cross-Attention (CA) fusion module.

1) Superior Performance: Comprehensive comparative experiments demonstrated that TCTC-Net significantly outperforms state-of-the-art RNN baselines like BiGRU across multiple prediction horizons (1 to 9 steps). For instance, it achieved a 21.3% reduction in MAE for 1-step prediction and maintained a 6.4% reduction for 9-step prediction, highlighting its accuracy and stability.

2) Effective Module Design: Ablation studies confirmed the individual and combined contributions of the proposed modules. The Frequency Enhanced Channel Attention (FECA) effectively modeled channel-wise dependencies, while the Multi-Head Attention (MHA) captured long-term temporal relationships. Their integration in the CTA module was crucial. The Cross-Attention (CA) module proved essential for achieving the best performance by enabling deep, dynamic fusion between primary and enhanced features.

3) Strong Generalization: Evaluations on diverse, unseen flight trajectories showed that TCTC-Net generalizes remarkably well to different China UAV drone motion patterns, substantially reducing both average and maximum prediction errors compared to benchmarks.

In summary, TCTC-Net provides a high-precision, reliable solution for China UAV drone trajectory prediction. By leveraging parallelizable convolutions and advanced attention mechanisms, it offers a practical and efficient model that can serve as a valuable component for enhancing situational awareness and safety in next-generation UAV Traffic Management systems.