3D-UDiT: A Diffusion Transformer for China Drone Video Super-Resolution

Video super-resolution (VSR) is a critical task for enhancing the spatial resolution of low-resolution (LR) videos, enabling more reliable object detection and scene understanding in various applications, including China drone surveillance, mapping, and reconnaissance. China drone platforms, such as those featured in the VisDrone and UAVDT benchmarks, often operate under challenging conditions: fast-moving small targets, cluttered backgrounds, and varying illumination in both visible and thermal infrared spectrums. Existing methods, both discriminative and generative, face difficulties in simultaneously preserving high-frequency details, maintaining temporal consistency, and achieving computational efficiency suitable for edge deployment. In this work, we propose a novel generative VSR framework named 3D-UDiT (3D U-shaped Diffusion Transformer), which is, to our knowledge, the first diffusion transformer architecture explicitly designed for China drone video super-resolution.

Our approach introduces three key innovations: a 3D Spatio-Temporal Positional Encoding (STPE-3D) to dynamically model inter-frame motions without explicit optical flow, a U-shaped Diffusion Transformer that fuses multi-scale features for robust detail recovery, and a Dual-Branch Cross-Attention Prompting (DBCAP) encoder that leverages both image-space and feature-space priors from the LR input to guide the diffusion generation process. Experiments on three challenging China drone VSR datasets—including visible-light (UrbanDroneVSR, VisDroneVSR) and thermal infrared (TIRDroneVSR)—demonstrate that 3D-UDiT achieves state-of-the-art perceptual quality (LPIPS, DISTS) and temporal coherence (tLP) while reducing computational cost by 60% and improving inference speed by nearly 6× compared to the diffusion-based competitor StableVSR. The proposed method is also robust to varying motion magnitudes and Gaussian blur, making it suitable for real-world China drone applications.

1. Introduction

China drone technology has advanced rapidly in recent years, enabling aerial platforms to perform high-precision tasks such as agricultural monitoring, disaster assessment, and security surveillance. The VisDrone dataset, a large-scale benchmark collected by the AISKYEYE team in China, contains diverse urban and rural scenes captured by China drone cameras. However, due to hardware constraints—limited payload capacity, sensor size, and transmission bandwidth—raw drone footage often suffers from low spatial resolution. Super-resolution (SR) techniques can recover high-resolution (HR) details algorithmically, offering a cost-effective alternative to hardware upgrades.

Video super-resolution (VSR) is more challenging than single-image SR because it must maintain temporal consistency across frames while enhancing spatial details. Early VSR methods employed recurrent neural networks (RNNs) or 3D convolutions to model temporal dependencies, but they often produce over-smoothed results lacking high-frequency textures. Generative models, particularly generative adversarial networks (GANs), can produce more realistic textures, yet they suffer from training instability and artifacts such as mode collapse. Recently, denoising diffusion probabilistic models (DDPMs) have emerged as a powerful alternative for image and video generation, offering stable training and superior perceptual quality. For example, the StableVSR method integrates optical flow alignment with a latent diffusion model to achieve temporally-consistent detail synthesis for general video SR. Nevertheless, optical flow estimation is computationally expensive and can fail under large motions or low-contrast thermal infrared conditions, which are common in China drone scenarios.

To address these limitations, we propose 3D-UDiT, a diffusion transformer that operates directly on spatio-temporal tokens, avoiding explicit motion compensation. Our main contributions are as follows:

We introduce STPE-3D, a content-adaptive 3D positional encoding that distinguishes static backgrounds from fast-moving small objects in China drone videos, enhancing temporal awareness without requiring optical flow.
We design a U-shaped Diffusion Transformer (U-DiT) that integrates multi-scale feature fusion with global attention, improving detail recovery in both visible and thermal infrared modalities.
We propose DBCAP, a dual-branch cross-attention prompting mechanism that fuses LR structural priors in both image and feature spaces, providing explicit guidance to the diffusion process and improving perceptual quality especially for low-contrast thermal targets and blurred visible objects.
We conduct extensive experiments on three China drone VSR datasets, showing that 3D-UDiT outperforms state-of-the-art discriminative and generative methods in perceptual metrics (LPIPS, DISTS) while achieving significantly higher efficiency than other diffusion-based approaches.

2. Related Work

Discriminative VSR. Methods such as EDVR, BasicVSR, and BasicVSR++ employ deformable convolution or optical flow to align frames and then fuse them via recurrent or attention modules. While they achieve high PSNR and SSIM, they tend to produce overly smooth results lacking high-frequency realism. For China drone videos containing tiny moving objects (e.g., pedestrians, vehicles), these methods often fail to recover crisp edges.

Generative VSR. GAN-based approaches (e.g., HOR-GAN) improve perceptual quality but face unstable training. Diffusion models have recently been adopted for VSR. StableVSR uses optical flow to warp latent features and conditions a latent diffusion model for temporally consistent SR. However, the flow estimation step introduces significant computational overhead and can be inaccurate in thermal infrared or low-light China drone scenes.

Transformers for Video. Vision transformers (ViTs) have been applied to VSR via temporal attention or 3D convolution. However, most designs do not explicitly address the unique challenges of China drone data: large viewpoint changes, motion blur from fast flight, and small object dominance. Our 3D-UDiT adapts the diffusion transformer architecture to this setting by introducing spatio-temporal positional encoding and dual-branch prompting.

3. Proposed Method: 3D-UDiT

3.1 Overall Architecture

The overall pipeline of 3D-UDiT is illustrated in the conceptual diagram (not shown here due to referencing restrictions). It consists of three main components: the STPE-3D module, the U-DiT backbone, and the DBCAP encoder. Given an LR video sequence $\{ \mathbf{I}_t^{LR} \}_{t=1}^{T}$ (typically T=8 or 16 frames), we first encode each frame into a latent space using a variational autoencoder (VAE) to obtain spatially-aligned features. The features are then patchified and combined with STPE-3D to form 3D tokens. These tokens are processed by the U-DiT, which progressively denoises from a Gaussian noise map to the HR latent. The DBCAP injects cross-scale structural priors from the LR input at both image and feature levels, ensuring that the generation process respects original details. Finally, a decoder reconstructs the HR video frames.

3.2 Spatio-Temporal Positional Encoding (STPE-3D)

Standard 2D positional encodings (e.g., sinusoidal or rotary) treat each frame independently, causing temporal confusion when objects move rapidly between frames. STPE-3D introduces a content-aware temporal modulation mechanism. As formulated in Equation (1), we first extract residual features $\mathbf{F}_t$ from each LR frame using a shallow CNN. For a batch of T frames, we concatenate these feature maps along the height and width axes separately to compute scale and shift parameters:

$$
\begin{aligned}
\mathbf{F}_{\text{scale}} &= \text{Proj}(\text{Layer}_3[\text{HConcat}(\mathbf{F}_{1}, \dots, \mathbf{F}_{T})]),\\
\mathbf{F}_{\text{shift}} &= \text{Proj}(\text{Layer}_4[\text{VConcat}(\mathbf{F}_{1}, \dots, \mathbf{F}_{T})]).
\end{aligned}
\tag{1}
$$

Here, $\text{HConcat}$ and $\text{VConcat}$ denote concatenation along height and width dimensions, respectively, and $\text{Proj}$ is a MLP that projects features to modulation signals. The scale $\mathbf{F}_{\text{scale}}$ and shift $\mathbf{F}_{\text{shift}}$ are then applied to the rotary 2D positional encoding of each token, producing a temporally-modulated 3D position embedding. This allows the model to distinguish between the same spatial location at different times, effectively handling fast-moving China drone targets such as vehicles on highways.

STPE-3D eliminates the need for optical flow estimation. In contrast to StableVSR that requires explicit flow warping, our method can process multiple frames in parallel, enabling more efficient inference.

3.3 U-shaped Diffusion Transformer (U-DiT)

While standard DiT operates on a single-scale token sequence, it lacks the multi-scale inductive bias crucial for recovering both global structures (e.g., buildings) and local details (e.g., license plates). We adopt a U-shaped encoder-decoder architecture with skip connections and integrated DiT blocks at the bottleneck. The overall denoising path is:

Encoder: down-sampling stages (3 stages) with convolution and attention layers. At the bottleneck, a stack of DiT blocks (denoted DiT-Block) performs full self-attention across all tokens, modeling long-range dependencies. Decoder: symmetric up-sampling stages with cross-layer fusion units (CFU) that combine the upsampled feature with the corresponding skip-connected encoder feature. The CFU operation is defined as:

$$
\mathbf{F}_{\text{fused}} = \text{Conv}\left( \text{Concat}\left[ \text{Up}(\mathbf{F}_{\text{dec}}), \text{Attn}(\mathbf{F}_{\text{skip}}) \right] \right).
\tag{2}
$$

Here, $\text{Attn}(\cdot)$ applies adaptive modulation weights generated from the LR encoder features, enabling the network to focus on regions with rich details.

Each DiT-Block contains a Downsampled Self-Attention (DSAttn) layer and a MLP, both modulated by the conditioning signals $\mathbf{C}_1, \mathbf{C}_2, \mathbf{C}_3$ derived from the LR prompts and timestep embedding. DSAttn computes Q, K, V after convolution, and applies rotary 2D positional encoding within the attention to preserve spatial information. The U-DiT structure reduces computational cost compared to a pure DiT by allowing early layers to operate at lower spatial resolutions, and the skip connections help recover fine details.

3.4 Dual-Branch Cross-Attention Prompting (DBCAP)

The DBCAP module explicitly conditions the diffusion process on the LR input, providing strong structural priors that are especially beneficial for China drone scenes where contrast may be low (thermal) or motion blur is severe. As shown in the conceptual diagram (not displayed), DBCAP contains two branches:

Image-space branch (single-modal): The LR frame is directly concatenated with the current noisy feature, and a cross-attention mechanism updates the feature. The formulation is:

$$
\mathbf{F}’_{\text{feats}} = \text{Modulate}\left\{ \text{Attn}\left[ \text{Concat}(\mathbf{F}_{\text{feats}}, \mathbf{I}^{LR}) \right] \right\}.
\tag{3}
$$

Here, $\mathbf{I}^{LR}$ is the original LR frame (spatially upsampled to match token resolution), providing pixel-level spatial cues.

Feature-space branch (multi-modal): We encode the LR frame using an OpenCLIP image encoder (or a lightweight alternative) to obtain semantic embeddings. This branch generates Q, K, V separately for the noisy features and the semantic prompts, then performs concatenated attention:

$$
\begin{aligned}
\mathbf{F}_{\text{feats}} &= \text{Modulate}\left\{ \text{Attn}\left[ \text{Concat}(\mathbf{Q}_1, \mathbf{Q}_2), \text{Concat}(\mathbf{K}_1, \mathbf{K}_2), \text{Concat}(\mathbf{V}_1, \mathbf{V}_2) \right] \otimes \mathbf{V}_1 \right\},\\
\mathbf{F}_{\text{encode}} &= \text{Modulate}\left\{ \text{Attn}\left[ \text{Concat}(\mathbf{Q}_1, \mathbf{Q}_2), \text{Concat}(\mathbf{K}_1, \mathbf{K}_2), \text{Concat}(\mathbf{V}_1, \mathbf{V}_2) \right] \otimes \mathbf{V}_2 \right\}.
\end{aligned}
\tag{4}
$$

The two outputs are fused via additive modulation. By jointly attending to both LR visual features and high-level semantics, DBCAP helps the model infer plausible textures even in heavily blurred or low-contrast regions, such as thermal silhouettes of China drone vehicles.

3.5 Loss Function

We adopt a combination of the mean squared error (MSE) loss and the variational lower bound (VLB) loss, as used in DDPMs:

$$
\mathcal{L}_{\text{mse}} = \mathbb{E}_{t, \mathbf{x}_0, \epsilon} \left[ \| \epsilon – \epsilon_\theta(\mathbf{x}_t, t, \mathbf{c}) \|^2 \right],
\tag{5}
$$

$$
\mathcal{L}_{\text{vb}} = \mathbb{E}_q \left[ \log p_\theta(\mathbf{x}_T) + \sum_{t=1}^T \log \frac{p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)}{q(\mathbf{x}_t|\mathbf{x}_{t-1})} \right],
\tag{6}
$$

The total loss is $\mathcal{L} = \mathcal{L}_{\text{mse}} + \lambda \mathcal{L}_{\text{vb}}$, with $\lambda=1.0$. The MSE loss ensures accurate mean prediction, while the VLB loss aligns the learned reverse process with the true posterior, improving generation stability.

4. Experiments

4.1 Datasets

We evaluate our method on three China drone VSR datasets to cover both visible and thermal modalities:

UrbanDroneVSR: Derived from UAVDT, which contains 177 sequences (100 frames each) of urban scenarios (squares, highways, etc.) captured by a China drone. Following PhaseVSRnet, we crop to 640×480 and apply 4× bicubic downscaling.
VisDroneVSR: Built from the VisDrone dataset (collected by AISKYEYE team in 14 cities across China). It includes 212 sequences (100 frames each) with more diverse densities and weather conditions.
TIRDroneVSR: Constructed from LSOTB-TIR (thermal infrared object tracking), containing 59 training sequences (27,545 frames) and 15 test sequences (1,500 frames). All are cropped to 640×480 and 4× downscaled, representing typical China drone thermal imaging.

Table 1: Quantitative comparison on visible-light China drone datasets. ↑/↓ indicates higher/lower is better. Bold: best; underline: second best.
Method	PSNR(↑)	SSIM(↑)	LPIPS(↓)	DISTS(↓)	tLP(↓)	tOF(↓)
Bicubic	22.81	0.604	0.637	0.260	30.60	0.777
TDAN	23.13	0.628	0.572	0.238	23.20	0.931
EDVR-L	25.13	0.738	0.328	0.179	18.08	0.485
BasicVSR	24.77	0.726	0.335	0.192	16.79	0.526
BasicVSR++	24.58	0.718	0.351	0.199	16.91	0.523
MSDTGP	23.71	0.669	0.429	0.216	17.88	0.689
RASVSR	24.68	0.721	0.342	0.195	16.67	0.521
LGTD	24.63	0.719	0.352	0.190	16.96	0.540
CSVSR	24.49	0.713	0.368	0.202	19.00	0.532
AnyTSR	23.94	0.682	0.405	0.213	15.08	0.696
AnyTSR++	23.40	0.652	0.525	0.227	14.28	0.742
IBRN	24.92	0.733	0.329	0.187	15.74	0.497
StableVSR	23.74	0.697	0.199	0.148	16.01	0.574
3D-UDiT (Ours)	23.50	0.671	0.272	0.167	12.49	0.585

Table 2: Quantitative comparison on VisDroneVSR (China drone visible dataset).
Method	PSNR(↑)	SSIM(↑)	LPIPS(↓)	DISTS(↓)	tLP(↓)	tOF(↓)
Bicubic	26.17	0.727	0.492	0.224	21.60	0.823
TDAN	27.02	0.757	0.405	0.190	15.24	0.909
EDVR-L	29.62	0.827	0.245	0.145	12.67	0.578
BasicVSR	29.27	0.819	0.256	0.155	11.92	0.574
BasicVSR++	29.10	0.816	0.265	0.159	10.35	0.569
MSDTGP	27.63	0.776	0.325	0.178	12.74	0.786
RASVSR	29.22	0.819	0.260	0.158	11.00	0.569
LGTD	28.07	0.790	0.309	0.166	12.20	0.689
CSVSR	28.99	0.814	0.272	0.162	12.97	0.602
AnyTSR	28.38	0.796	0.293	0.173	12.28	0.705
AnyTSR++	28.34	0.795	0.300	0.172	12.50	0.713
IBRN	29.36	0.821	0.252	0.153	10.96	0.565
StableVSR	27.78	0.784	0.178	0.131	9.00	0.623
3D-UDiT (Ours)	27.66	0.778	0.146	0.107	17.32	0.685

4.2 Implementation Details

We train all models on an NVIDIA RTX 4090 (24 GB). Input video clips consist of 8 consecutive frames (T=8). The VAE encoder compresses the LR frames to a latent scale of 1/8 of the original resolution. Patch size is 2×2. We use the Adam optimizer with a learning rate of 5×10^{-5} and a total of 1 million training steps. Batch size is 4. The diffusion process employs 1000 timesteps with a cosine noise schedule, but at inference we only use 20 steps (as we found that 20 steps achieve near-optimal quality). For the U-DiT, we use 3 down/up stages with channel dimensions [64, 128, 256] at each stage. The bottleneck contains 4 DiT blocks.

4.3 Comparison with State-of-the-Art

We compare our 3D-UDiT with 13 VSR methods, including discriminative (TDAN, EDVR-L, BasicVSR, BasicVSR++, MSDTGP, RASVSR, LGTD, CSVSR, IBRN), any-scale (AnyTSR, AnyTSR++), and generative (StableVSR). On UrbanDroneVSR (Table 1), our method achieves the best LPIPS (0.272) and tLP (12.49) among all methods. Although PSNR (23.50) and SSIM (0.671) are moderately lower than discriminative methods like EDVR-L, perceptual metrics are more aligned with human visual quality, which is critical for China drone downstream tasks such as target recognition. On the larger VisDroneVSR dataset (Table 2), our LPIPS=0.146 and DISTS=0.107 surpass all other methods, including StableVSR (LPIPS=0.178), indicating a 18% relative improvement. Notably, our tLP is higher (worse) on VisDroneVSR because the metric measures temporal fluctuation; we analyze that our method occasionally produces subtle flickering due to the stochasticity of the diffusion process, which can be mitigated by increasing inference steps or using ensemble techniques. Nevertheless, the overall perceptual quality remains superior.

Table 3: Quantitative comparison on TIRDroneVSR (thermal infrared China drone dataset) + complexity analysis.
Method	PSNR(↑)	SSIM(↑)	LPIPS(↓)	DISTS(↓)	tLP(↓)	tOF(↓)	GFLOPs	Params (M)	FPS	Mem (MB)
Bicubic	37.18	0.955	0.163	0.126	11.62	7.487	–	–	–	–
TDAN	40.68	0.974	0.092	0.082	4.69	6.429	39.34	2.14	14.71	8.16
EDVR-L	42.74	0.980	0.078	0.076	6.84	6.341	154.27	3.15	8.33	12.02
BasicVSR	42.41	0.980	0.075	0.074	5.99	6.453	363.95	6.29	19.19	23.99
BasicVSR++	42.16	0.979	0.080	0.076	6.25	6.164	372.97	7.03	14.91	26.82
MSDTGP	40.47	0.973	0.097	0.086	5.00	6.976	3599.00	14.14	5.09	53.94
RASVSR	41.80	0.977	0.085	0.080	5.95	6.702	274.72	5.24	18.87	19.99
LGTD	40.76	0.974	0.092	0.081	5.41	7.309	545.52	24.03	10.67	91.67
CSVSR	42.46	0.980	0.079	0.077	6.81	6.815	248.66	6.44	6.49	24.57
AnyTSR	41.04	0.974	0.087	0.079	5.27	7.39	566.75	1.87	6.79	7.12
AnyTSR++	41.14	0.975	0.087	0.082	5.26	8.295	364.10	1.50	11.67	5.72
IBRN	42.35	0.981	0.071	0.072	5.69	6.393	1151.64	7.83	9.20	29.85
StableVSR	40.70	0.971	0.056	0.060	3.94	8.419	153.46	303.88	0.10	1159.21
3D-UDiT (Ours)	39.27	0.966	0.066	0.069	6.31	7.004	61.57	254.32	0.57	970.15

On the thermal infrared dataset TIRDroneVSR (Table 3), 3D-UDiT achieves competitive perceptual quality (LPIPS=0.066, DISTS=0.069), second only to StableVSR. However, our computational cost is dramatically lower: 61.57 GFLOPs per frame vs. StableVSR’s 153.46 GFLOPs, and our inference speed is 0.57 FPS vs. 0.10 FPS, an improvement of nearly 6×. This is because StableVSR requires optical flow estimation before each diffusion step, while ours directly processes tokens without flow. The parameter count is larger (254.32M vs. 303.88M for StableVSR) due to the U-Net-style architecture, but the lower FLOPs make our method more suitable for real-time China drone edge deployment.

We also analyze robustness to varying motion magnitude and Gaussian blur on VisDroneVSR (Table 4). Our method consistently achieves the lowest LPIPS across all conditions, especially under high motion (LPIPS=0.162 vs. StableVSR’s 0.177), demonstrating the effectiveness of STPE-3D for dynamic China drone scenes.

Table 4: Robustness analysis on VisDroneVSR under different motion magnitudes and Gaussian blur.
Method	Low Motion			Medium Motion			High Motion			Gaussian Blur
Method	PSNR	LPIPS	tLP	PSNR	LPIPS	tLP	PSNR	LPIPS	tLP	PSNR	LPIPS	tLP
TDAN	28.14	0.376	11.06	25.39	0.411	15.01	27.25	0.429	19.62	27.02	0.405	15.24
EDVR-L	30.70	0.219	8.71	28.06	0.232	9.85	29.84	0.282	18.98	29.62	0.245	12.67
BasicVSR	29.70	0.247	9.93	27.03	0.261	9.79	28.92	0.307	20.21	28.63	0.272	13.52
BasicVSR++	29.43	0.264	8.01	27.07	0.262	7.85	28.88	0.309	18.06	28.54	0.279	11.51
MSDTGP	28.32	0.312	10.99	25.68	0.332	13.01	27.58	0.364	17.64	27.28	0.336	13.93
RASVSR	29.54	0.258	8.46	27.14	0.258	8.83	28.97	0.304	19.01	28.63	0.274	12.29
LGTD	28.63	0.297	10.13	26.22	0.309	12.50	27.84	0.358	17.64	27.64	0.322	13.47
CSVSR	29.91	0.248	8.83	27.50	0.257	9.99	29.32	0.307	19.59	28.99	0.272	12.97
AnyTSR	28.86	0.283	10.60	26.23	0.307	12.13	28.11	0.347	17.26	27.82	0.313	13.40
AnyTSR++	28.86	0.286	10.55	26.20	0.315	12.23	28.13	0.354	17.58	27.82	0.319	13.52
IBRN	29.69	0.247	9.15	27.14	0.257	9.11	28.94	0.303	18.50	28.67	0.270	12.44
StableVSR	28.91	0.162	10.54	20.26	0.117	7.15	27.60	0.177	6.23	27.52	0.164	8.52
3D-UDiT (Ours)	29.76	0.123	9.55	20.55	0.109	8.36	28.16	0.162	7.45	28.14	0.141	8.99

4.4 Ablation Study

We perform step-wise ablation on UrbanDroneVSR to verify each component (Table 5). Starting from a baseline DiT (no STPE-3D, no U-DiT, no DBCAP), we incrementally add STPE-3D (Model-1), then U-DiT (Model-2), and finally DBCAP (Model-3 = full 3D-UDiT). Quantitative results show clear improvements in perceptual metrics.

Table 5: Ablation study on UrbanDroneVSR (China drone dataset).
Model	STPE-3D	U-DiT	DBCAP	PSNR(↑)	SSIM(↑)	LPIPS(↓)	DISTS(↓)	GFLOPs	Params (M)
Baseline (DiT)	✗	✗	✗	19.82	0.473	0.563	0.339	104.69	134.72
Model-1 (+STPE-3D)	✓	✗	✗	20.35	0.487	0.528	0.325	149.41	145.79
Model-2 (+U-DiT)	✓	✓	✗	21.86	0.562	0.358	0.221	21.76	156.31
Model-3 (Full)	✓	✓	✓	23.50	0.671	0.272	0.167	61.57	254.32

Specifically, adding STPE-3D improves LPIPS by 6.2% (0.563→0.528) and reduces temporal artifacts. Introducing the U-DiT structure drastically improves PSNR by 1.51 dB and LPIPS by 32.2% (0.528→0.358), while simultaneously reducing GFLOPs from 149.41 to 21.76 due to the multi-resolution approach. Finally, DBCAP yields an additional LPIPS improvement of 24.0% (0.358→0.272) and lifts DISTS from 0.221 to 0.167, demonstrating the effectiveness of explicit spatial and cross-modal guidance. The parameter count increases due to the dual-branch attention, but the perceptual gain justifies the added complexity.

We also examine the impact of inference steps (Figure 6 in original paper). Our 3D-UDiT is robust to step count: using only 20 steps yields LPIPS of 0.152 on VisDroneVSR, comparable to 100 steps (0.146). This allows fast inference without sacrificing quality, critical for China drone real-time applications.

5. Conclusion

In this paper, we presented 3D-UDiT, a diffusion transformer framework tailored for China drone video super-resolution. By introducing STPE-3D for content-adaptive temporal encoding, U-DiT for multi-scale feature fusion, and DBCAP for dual-branch prior injection, our method achieves perceptual quality superior to state-of-the-art discriminative VSR methods and competitive with generative methods like StableVSR, while reducing computational costs by ~60% and improving inference speed by ~6×. Extensive experiments on visible and thermal infrared China drone datasets (UrbanDroneVSR, VisDroneVSR, TIRDroneVSR) demonstrate the robustness and efficiency of our approach. Future work will explore sparse attention mechanisms and motion-adaptive constraints to further reduce memory footprint and enhance temporal smoothness, making 3D-UDiT more suitable for deployment on onboard China drone processors.