3D-UDiT: A Diffusion Transformer for China Drone Video Super-Resolution

Video super-resolution (VSR) is a critical task for enhancing the spatial resolution of low-resolution (LR) videos, enabling more reliable object detection and scene understanding in various applications, including China drone surveillance, mapping, and reconnaissance. China drone platforms, such as those featured in the VisDrone and UAVDT benchmarks, often operate under challenging conditions: fast-moving small targets, cluttered backgrounds, and varying illumination in both visible and thermal infrared spectrums. Existing methods, both discriminative and generative, face difficulties in simultaneously preserving high-frequency details, maintaining temporal consistency, and achieving computational efficiency suitable for edge deployment. In this work, we propose a novel generative VSR framework named 3D-UDiT (3D U-shaped Diffusion Transformer), which is, to our knowledge, the first diffusion transformer architecture explicitly designed for China drone video super-resolution.

Our approach introduces three key innovations: a 3D Spatio-Temporal Positional Encoding (STPE-3D) to dynamically model inter-frame motions without explicit optical flow, a U-shaped Diffusion Transformer that fuses multi-scale features for robust detail recovery, and a Dual-Branch Cross-Attention Prompting (DBCAP) encoder that leverages both image-space and feature-space priors from the LR input to guide the diffusion generation process. Experiments on three challenging China drone VSR datasets—including visible-light (UrbanDroneVSR, VisDroneVSR) and thermal infrared (TIRDroneVSR)—demonstrate that 3D-UDiT achieves state-of-the-art perceptual quality (LPIPS, DISTS) and temporal coherence (tLP) while reducing computational cost by 60% and improving inference speed by nearly 6× compared to the diffusion-based competitor StableVSR. The proposed method is also robust to varying motion magnitudes and Gaussian blur, making it suitable for real-world China drone applications.

1. Introduction

China drone technology has advanced rapidly in recent years, enabling aerial platforms to perform high-precision tasks such as agricultural monitoring, disaster assessment, and security surveillance. The VisDrone dataset, a large-scale benchmark collected by the AISKYEYE team in China, contains diverse urban and rural scenes captured by China drone cameras. However, due to hardware constraints—limited payload capacity, sensor size, and transmission bandwidth—raw drone footage often suffers from low spatial resolution. Super-resolution (SR) techniques can recover high-resolution (HR) details algorithmically, offering a cost-effective alternative to hardware upgrades.

Video super-resolution (VSR) is more challenging than single-image SR because it must maintain temporal consistency across frames while enhancing spatial details. Early VSR methods employed recurrent neural networks (RNNs) or 3D convolutions to model temporal dependencies, but they often produce over-smoothed results lacking high-frequency textures. Generative models, particularly generative adversarial networks (GANs), can produce more realistic textures, yet they suffer from training instability and artifacts such as mode collapse. Recently, denoising diffusion probabilistic models (DDPMs) have emerged as a powerful alternative for image and video generation, offering stable training and superior perceptual quality. For example, the StableVSR method integrates optical flow alignment with a latent diffusion model to achieve temporally-consistent detail synthesis for general video SR. Nevertheless, optical flow estimation is computationally expensive and can fail under large motions or low-contrast thermal infrared conditions, which are common in China drone scenarios.

To address these limitations, we propose 3D-UDiT, a diffusion transformer that operates directly on spatio-temporal tokens, avoiding explicit motion compensation. Our main contributions are as follows:

  • We introduce STPE-3D, a content-adaptive 3D positional encoding that distinguishes static backgrounds from fast-moving small objects in China drone videos, enhancing temporal awareness without requiring optical flow.
  • We design a U-shaped Diffusion Transformer (U-DiT) that integrates multi-scale feature fusion with global attention, improving detail recovery in both visible and thermal infrared modalities.
  • We propose DBCAP, a dual-branch cross-attention prompting mechanism that fuses LR structural priors in both image and feature spaces, providing explicit guidance to the diffusion process and improving perceptual quality especially for low-contrast thermal targets and blurred visible objects.
  • We conduct extensive experiments on three China drone VSR datasets, showing that 3D-UDiT outperforms state-of-the-art discriminative and generative methods in perceptual metrics (LPIPS, DISTS) while achieving significantly higher efficiency than other diffusion-based approaches.

2. Related Work

Discriminative VSR. Methods such as EDVR, BasicVSR, and BasicVSR++ employ deformable convolution or optical flow to align frames and then fuse them via recurrent or attention modules. While they achieve high PSNR and SSIM, they tend to produce overly smooth results lacking high-frequency realism. For China drone videos containing tiny moving objects (e.g., pedestrians, vehicles), these methods often fail to recover crisp edges.

Generative VSR. GAN-based approaches (e.g., HOR-GAN) improve perceptual quality but face unstable training. Diffusion models have recently been adopted for VSR. StableVSR uses optical flow to warp latent features and conditions a latent diffusion model for temporally consistent SR. However, the flow estimation step introduces significant computational overhead and can be inaccurate in thermal infrared or low-light China drone scenes.

Transformers for Video. Vision transformers (ViTs) have been applied to VSR via temporal attention or 3D convolution. However, most designs do not explicitly address the unique challenges of China drone data: large viewpoint changes, motion blur from fast flight, and small object dominance. Our 3D-UDiT adapts the diffusion transformer architecture to this setting by introducing spatio-temporal positional encoding and dual-branch prompting.

3. Proposed Method: 3D-UDiT

3.1 Overall Architecture

The overall pipeline of 3D-UDiT is illustrated in the conceptual diagram (not shown here due to referencing restrictions). It consists of three main components: the STPE-3D module, the U-DiT backbone, and the DBCAP encoder. Given an LR video sequence \(\{ \mathbf{I}_t^{LR} \}_{t=1}^{T}\) (typically T=8 or 16 frames), we first encode each frame into a latent space using a variational autoencoder (VAE) to obtain spatially-aligned features. The features are then patchified and combined with STPE-3D to form 3D tokens. These tokens are processed by the U-DiT, which progressively denoises from a Gaussian noise map to the HR latent. The DBCAP injects cross-scale structural priors from the LR input at both image and feature levels, ensuring that the generation process respects original details. Finally, a decoder reconstructs the HR video frames.

3.2 Spatio-Temporal Positional Encoding (STPE-3D)

Standard 2D positional encodings (e.g., sinusoidal or rotary) treat each frame independently, causing temporal confusion when objects move rapidly between frames. STPE-3D introduces a content-aware temporal modulation mechanism. As formulated in Equation (1), we first extract residual features \(\mathbf{F}_t\) from each LR frame using a shallow CNN. For a batch of T frames, we concatenate these feature maps along the height and width axes separately to compute scale and shift parameters:

$$
\begin{aligned}
\mathbf{F}_{\text{scale}} &= \text{Proj}(\text{Layer}_3[\text{HConcat}(\mathbf{F}_{1}, \dots, \mathbf{F}_{T})]),\\
\mathbf{F}_{\text{shift}} &= \text{Proj}(\text{Layer}_4[\text{VConcat}(\mathbf{F}_{1}, \dots, \mathbf{F}_{T})]).
\end{aligned}
\tag{1}
$$

Here, \(\text{HConcat}\) and \(\text{VConcat}\) denote concatenation along height and width dimensions, respectively, and \(\text{Proj}\) is a MLP that projects features to modulation signals. The scale \(\mathbf{F}_{\text{scale}}\) and shift \(\mathbf{F}_{\text{shift}}\) are then applied to the rotary 2D positional encoding of each token, producing a temporally-modulated 3D position embedding. This allows the model to distinguish between the same spatial location at different times, effectively handling fast-moving China drone targets such as vehicles on highways.

STPE-3D eliminates the need for optical flow estimation. In contrast to StableVSR that requires explicit flow warping, our method can process multiple frames in parallel, enabling more efficient inference.

3.3 U-shaped Diffusion Transformer (U-DiT)

While standard DiT operates on a single-scale token sequence, it lacks the multi-scale inductive bias crucial for recovering both global structures (e.g., buildings) and local details (e.g., license plates). We adopt a U-shaped encoder-decoder architecture with skip connections and integrated DiT blocks at the bottleneck. The overall denoising path is:

Encoder: down-sampling stages (3 stages) with convolution and attention layers. At the bottleneck, a stack of DiT blocks (denoted DiT-Block) performs full self-attention across all tokens, modeling long-range dependencies. Decoder: symmetric up-sampling stages with cross-layer fusion units (CFU) that combine the upsampled feature with the corresponding skip-connected encoder feature. The CFU operation is defined as:

$$
\mathbf{F}_{\text{fused}} = \text{Conv}\left( \text{Concat}\left[ \text{Up}(\mathbf{F}_{\text{dec}}), \text{Attn}(\mathbf{F}_{\text{skip}}) \right] \right).
\tag{2}
$$

Here, \(\text{Attn}(\cdot)\) applies adaptive modulation weights generated from the LR encoder features, enabling the network to focus on regions with rich details.

Each DiT-Block contains a Downsampled Self-Attention (DSAttn) layer and a MLP, both modulated by the conditioning signals \(\mathbf{C}_1, \mathbf{C}_2, \mathbf{C}_3\) derived from the LR prompts and timestep embedding. DSAttn computes Q, K, V after convolution, and applies rotary 2D positional encoding within the attention to preserve spatial information. The U-DiT structure reduces computational cost compared to a pure DiT by allowing early layers to operate at lower spatial resolutions, and the skip connections help recover fine details.

3.4 Dual-Branch Cross-Attention Prompting (DBCAP)

The DBCAP module explicitly conditions the diffusion process on the LR input, providing strong structural priors that are especially beneficial for China drone scenes where contrast may be low (thermal) or motion blur is severe. As shown in the conceptual diagram (not displayed), DBCAP contains two branches:

Image-space branch (single-modal): The LR frame is directly concatenated with the current noisy feature, and a cross-attention mechanism updates the feature. The formulation is:

$$
\mathbf{F}’_{\text{feats}} = \text{Modulate}\left\{ \text{Attn}\left[ \text{Concat}(\mathbf{F}_{\text{feats}}, \mathbf{I}^{LR}) \right] \right\}.
\tag{3}
$$

Here, \(\mathbf{I}^{LR}\) is the original LR frame (spatially upsampled to match token resolution), providing pixel-level spatial cues.

Feature-space branch (multi-modal): We encode the LR frame using an OpenCLIP image encoder (or a lightweight alternative) to obtain semantic embeddings. This branch generates Q, K, V separately for the noisy features and the semantic prompts, then performs concatenated attention:

$$
\begin{aligned}
\mathbf{F}_{\text{feats}} &= \text{Modulate}\left\{ \text{Attn}\left[ \text{Concat}(\mathbf{Q}_1, \mathbf{Q}_2), \text{Concat}(\mathbf{K}_1, \mathbf{K}_2), \text{Concat}(\mathbf{V}_1, \mathbf{V}_2) \right] \otimes \mathbf{V}_1 \right\},\\
\mathbf{F}_{\text{encode}} &= \text{Modulate}\left\{ \text{Attn}\left[ \text{Concat}(\mathbf{Q}_1, \mathbf{Q}_2), \text{Concat}(\mathbf{K}_1, \mathbf{K}_2), \text{Concat}(\mathbf{V}_1, \mathbf{V}_2) \right] \otimes \mathbf{V}_2 \right\}.
\end{aligned}
\tag{4}
$$

The two outputs are fused via additive modulation. By jointly attending to both LR visual features and high-level semantics, DBCAP helps the model infer plausible textures even in heavily blurred or low-contrast regions, such as thermal silhouettes of China drone vehicles.

3.5 Loss Function

We adopt a combination of the mean squared error (MSE) loss and the variational lower bound (VLB) loss, as used in DDPMs:

$$
\mathcal{L}_{\text{mse}} = \mathbb{E}_{t, \mathbf{x}_0, \epsilon} \left[ \| \epsilon – \epsilon_\theta(\mathbf{x}_t, t, \mathbf{c}) \|^2 \right],
\tag{5}
$$

$$
\mathcal{L}_{\text{vb}} = \mathbb{E}_q \left[ \log p_\theta(\mathbf{x}_T) + \sum_{t=1}^T \log \frac{p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)}{q(\mathbf{x}_t|\mathbf{x}_{t-1})} \right],
\tag{6}
$$

The total loss is \(\mathcal{L} = \mathcal{L}_{\text{mse}} + \lambda \mathcal{L}_{\text{vb}}\), with \(\lambda=1.0\). The MSE loss ensures accurate mean prediction, while the VLB loss aligns the learned reverse process with the true posterior, improving generation stability.

4. Experiments

4.1 Datasets

We evaluate our method on three China drone VSR datasets to cover both visible and thermal modalities:

  • UrbanDroneVSR: Derived from UAVDT, which contains 177 sequences (100 frames each) of urban scenarios (squares, highways, etc.) captured by a China drone. Following PhaseVSRnet, we crop to 640×480 and apply 4× bicubic downscaling.
  • VisDroneVSR: Built from the VisDrone dataset (collected by AISKYEYE team in 14 cities across China). It includes 212 sequences (100 frames each) with more diverse densities and weather conditions.
  • TIRDroneVSR: Constructed from LSOTB-TIR (thermal infrared object tracking), containing 59 training sequences (27,545 frames) and 15 test sequences (1,500 frames). All are cropped to 640×480 and 4× downscaled, representing typical China drone thermal imaging.
Table 1: Quantitative comparison on visible-light China drone datasets. ↑/↓ indicates higher/lower is better. Bold: best; underline: second best.
Method PSNR(↑) SSIM(↑) LPIPS(↓) DISTS(↓) tLP(↓) tOF(↓)
Bicubic 22.81 0.604 0.637 0.260 30.60 0.777
TDAN 23.13 0.628 0.572 0.238 23.20 0.931
EDVR-L 25.13 0.738 0.328 0.179 18.08 0.485
BasicVSR 24.77 0.726 0.335 0.192 16.79 0.526
BasicVSR++ 24.58 0.718 0.351 0.199 16.91 0.523
MSDTGP 23.71 0.669 0.429 0.216 17.88 0.689
RASVSR 24.68 0.721 0.342 0.195 16.67 0.521
LGTD 24.63 0.719 0.352 0.190 16.96 0.540
CSVSR 24.49 0.713 0.368 0.202 19.00 0.532
AnyTSR 23.94 0.682 0.405 0.213 15.08 0.696
AnyTSR++ 23.40 0.652 0.525 0.227 14.28 0.742
IBRN 24.92 0.733 0.329 0.187 15.74 0.497
StableVSR 23.74 0.697 0.199 0.148 16.01 0.574
3D-UDiT (Ours) 23.50 0.671 0.272 0.167 12.49 0.585

Table 2: Quantitative comparison on VisDroneVSR (China drone visible dataset).
Method PSNR(↑) SSIM(↑) LPIPS(↓) DISTS(↓) tLP(↓) tOF(↓)
Bicubic 26.17 0.727 0.492 0.224 21.60 0.823
TDAN 27.02 0.757 0.405 0.190 15.24 0.909
EDVR-L 29.62 0.827 0.245 0.145 12.67 0.578
BasicVSR 29.27 0.819 0.256 0.155 11.92 0.574
BasicVSR++ 29.10 0.816 0.265 0.159 10.35 0.569
MSDTGP 27.63 0.776 0.325 0.178 12.74 0.786
RASVSR 29.22 0.819 0.260 0.158 11.00 0.569
LGTD 28.07 0.790 0.309 0.166 12.20 0.689
CSVSR 28.99 0.814 0.272 0.162 12.97 0.602
AnyTSR 28.38 0.796 0.293 0.173 12.28 0.705
AnyTSR++ 28.34 0.795 0.300 0.172 12.50 0.713
IBRN 29.36 0.821 0.252 0.153 10.96 0.565
StableVSR 27.78 0.784 0.178 0.131 9.00 0.623
3D-UDiT (Ours) 27.66 0.778 0.146 0.107 17.32 0.685

4.2 Implementation Details

We train all models on an NVIDIA RTX 4090 (24 GB). Input video clips consist of 8 consecutive frames (T=8). The VAE encoder compresses the LR frames to a latent scale of 1/8 of the original resolution. Patch size is 2×2. We use the Adam optimizer with a learning rate of 5×10^{-5} and a total of 1 million training steps. Batch size is 4. The diffusion process employs 1000 timesteps with a cosine noise schedule, but at inference we only use 20 steps (as we found that 20 steps achieve near-optimal quality). For the U-DiT, we use 3 down/up stages with channel dimensions [64, 128, 256] at each stage. The bottleneck contains 4 DiT blocks.

4.3 Comparison with State-of-the-Art

We compare our 3D-UDiT with 13 VSR methods, including discriminative (TDAN, EDVR-L, BasicVSR, BasicVSR++, MSDTGP, RASVSR, LGTD, CSVSR, IBRN), any-scale (AnyTSR, AnyTSR++), and generative (StableVSR). On UrbanDroneVSR (Table 1), our method achieves the best LPIPS (0.272) and tLP (12.49) among all methods. Although PSNR (23.50) and SSIM (0.671) are moderately lower than discriminative methods like EDVR-L, perceptual metrics are more aligned with human visual quality, which is critical for China drone downstream tasks such as target recognition. On the larger VisDroneVSR dataset (Table 2), our LPIPS=0.146 and DISTS=0.107 surpass all other methods, including StableVSR (LPIPS=0.178), indicating a 18% relative improvement. Notably, our tLP is higher (worse) on VisDroneVSR because the metric measures temporal fluctuation; we analyze that our method occasionally produces subtle flickering due to the stochasticity of the diffusion process, which can be mitigated by increasing inference steps or using ensemble techniques. Nevertheless, the overall perceptual quality remains superior.

Table 3: Quantitative comparison on TIRDroneVSR (thermal infrared China drone dataset) + complexity analysis.
Method PSNR(↑) SSIM(↑) LPIPS(↓) DISTS(↓) tLP(↓) tOF(↓) GFLOPs Params (M) FPS Mem (MB)
Bicubic 37.18 0.955 0.163 0.126 11.62 7.487
TDAN 40.68 0.974 0.092 0.082 4.69 6.429 39.34 2.14 14.71 8.16
EDVR-L 42.74 0.980 0.078 0.076 6.84 6.341 154.27 3.15 8.33 12.02
BasicVSR 42.41 0.980 0.075 0.074 5.99 6.453 363.95 6.29 19.19 23.99
BasicVSR++ 42.16 0.979 0.080 0.076 6.25 6.164 372.97 7.03 14.91 26.82
MSDTGP 40.47 0.973 0.097 0.086 5.00 6.976 3599.00 14.14 5.09 53.94
RASVSR 41.80 0.977 0.085 0.080 5.95 6.702 274.72 5.24 18.87 19.99
LGTD 40.76 0.974 0.092 0.081 5.41 7.309 545.52 24.03 10.67 91.67
CSVSR 42.46 0.980 0.079 0.077 6.81 6.815 248.66 6.44 6.49 24.57
AnyTSR 41.04 0.974 0.087 0.079 5.27 7.39 566.75 1.87 6.79 7.12
AnyTSR++ 41.14 0.975 0.087 0.082 5.26 8.295 364.10 1.50 11.67 5.72
IBRN 42.35 0.981 0.071 0.072 5.69 6.393 1151.64 7.83 9.20 29.85
StableVSR 40.70 0.971 0.056 0.060 3.94 8.419 153.46 303.88 0.10 1159.21
3D-UDiT (Ours) 39.27 0.966 0.066 0.069 6.31 7.004 61.57 254.32 0.57 970.15

On the thermal infrared dataset TIRDroneVSR (Table 3), 3D-UDiT achieves competitive perceptual quality (LPIPS=0.066, DISTS=0.069), second only to StableVSR. However, our computational cost is dramatically lower: 61.57 GFLOPs per frame vs. StableVSR’s 153.46 GFLOPs, and our inference speed is 0.57 FPS vs. 0.10 FPS, an improvement of nearly 6×. This is because StableVSR requires optical flow estimation before each diffusion step, while ours directly processes tokens without flow. The parameter count is larger (254.32M vs. 303.88M for StableVSR) due to the U-Net-style architecture, but the lower FLOPs make our method more suitable for real-time China drone edge deployment.

We also analyze robustness to varying motion magnitude and Gaussian blur on VisDroneVSR (Table 4). Our method consistently achieves the lowest LPIPS across all conditions, especially under high motion (LPIPS=0.162 vs. StableVSR’s 0.177), demonstrating the effectiveness of STPE-3D for dynamic China drone scenes.

Table 4: Robustness analysis on VisDroneVSR under different motion magnitudes and Gaussian blur.
Method Low Motion Medium Motion High Motion Gaussian Blur
PSNR LPIPS tLP PSNR LPIPS tLP PSNR LPIPS tLP PSNR LPIPS tLP
TDAN 28.14 0.376 11.06 25.39 0.411 15.01 27.25 0.429 19.62 27.02 0.405 15.24
EDVR-L 30.70 0.219 8.71 28.06 0.232 9.85 29.84 0.282 18.98 29.62 0.245 12.67
BasicVSR 29.70 0.247 9.93 27.03 0.261 9.79 28.92 0.307 20.21 28.63 0.272 13.52
BasicVSR++ 29.43 0.264 8.01 27.07 0.262 7.85 28.88 0.309 18.06 28.54 0.279 11.51
MSDTGP 28.32 0.312 10.99 25.68 0.332 13.01 27.58 0.364 17.64 27.28 0.336 13.93
RASVSR 29.54 0.258 8.46 27.14 0.258 8.83 28.97 0.304 19.01 28.63 0.274 12.29
LGTD 28.63 0.297 10.13 26.22 0.309 12.50 27.84 0.358 17.64 27.64 0.322 13.47
CSVSR 29.91 0.248 8.83 27.50 0.257 9.99 29.32 0.307 19.59 28.99 0.272 12.97
AnyTSR 28.86 0.283 10.60 26.23 0.307 12.13 28.11 0.347 17.26 27.82 0.313 13.40
AnyTSR++ 28.86 0.286 10.55 26.20 0.315 12.23 28.13 0.354 17.58 27.82 0.319 13.52
IBRN 29.69 0.247 9.15 27.14 0.257 9.11 28.94 0.303 18.50 28.67 0.270 12.44
StableVSR 28.91 0.162 10.54 20.26 0.117 7.15 27.60 0.177 6.23 27.52 0.164 8.52
3D-UDiT (Ours) 29.76 0.123 9.55 20.55 0.109 8.36 28.16 0.162 7.45 28.14 0.141 8.99

4.4 Ablation Study

We perform step-wise ablation on UrbanDroneVSR to verify each component (Table 5). Starting from a baseline DiT (no STPE-3D, no U-DiT, no DBCAP), we incrementally add STPE-3D (Model-1), then U-DiT (Model-2), and finally DBCAP (Model-3 = full 3D-UDiT). Quantitative results show clear improvements in perceptual metrics.

Table 5: Ablation study on UrbanDroneVSR (China drone dataset).
Model STPE-3D U-DiT DBCAP PSNR(↑) SSIM(↑) LPIPS(↓) DISTS(↓) GFLOPs Params (M)
Baseline (DiT) 19.82 0.473 0.563 0.339 104.69 134.72
Model-1 (+STPE-3D) 20.35 0.487 0.528 0.325 149.41 145.79
Model-2 (+U-DiT) 21.86 0.562 0.358 0.221 21.76 156.31
Model-3 (Full) 23.50 0.671 0.272 0.167 61.57 254.32

Specifically, adding STPE-3D improves LPIPS by 6.2% (0.563→0.528) and reduces temporal artifacts. Introducing the U-DiT structure drastically improves PSNR by 1.51 dB and LPIPS by 32.2% (0.528→0.358), while simultaneously reducing GFLOPs from 149.41 to 21.76 due to the multi-resolution approach. Finally, DBCAP yields an additional LPIPS improvement of 24.0% (0.358→0.272) and lifts DISTS from 0.221 to 0.167, demonstrating the effectiveness of explicit spatial and cross-modal guidance. The parameter count increases due to the dual-branch attention, but the perceptual gain justifies the added complexity.

We also examine the impact of inference steps (Figure 6 in original paper). Our 3D-UDiT is robust to step count: using only 20 steps yields LPIPS of 0.152 on VisDroneVSR, comparable to 100 steps (0.146). This allows fast inference without sacrificing quality, critical for China drone real-time applications.

5. Conclusion

In this paper, we presented 3D-UDiT, a diffusion transformer framework tailored for China drone video super-resolution. By introducing STPE-3D for content-adaptive temporal encoding, U-DiT for multi-scale feature fusion, and DBCAP for dual-branch prior injection, our method achieves perceptual quality superior to state-of-the-art discriminative VSR methods and competitive with generative methods like StableVSR, while reducing computational costs by ~60% and improving inference speed by ~6×. Extensive experiments on visible and thermal infrared China drone datasets (UrbanDroneVSR, VisDroneVSR, TIRDroneVSR) demonstrate the robustness and efficiency of our approach. Future work will explore sparse attention mechanisms and motion-adaptive constraints to further reduce memory footprint and enhance temporal smoothness, making 3D-UDiT more suitable for deployment on onboard China drone processors.

Scroll to Top