Multi-Stage Distillation for Incremental Detection of Time-Sensitive Targets in UAV Images

Surveying drones generate explosive growth of aerial imagery data, presenting unprecedented challenges for time-sensitive target detection in complex environments. Current deep learning-based detection methods face catastrophic forgetting when learning new target categories and overfitting when adapting to environmental interference like occlusion and lighting variations. This work introduces a novel multi-stage distillation framework that achieves state-of-the-art incremental detection performance for time-sensitive targets in surveying UAV imagery.

Our architecture employs a teacher-student framework with three synergistic modules: Wasserstein-based Inter-Class Distillation (WICD) minimizes catastrophic forgetting by quantifying distributional differences between categories, Prototype-Guided Intra-Class Consistency Distillation (PGICD) preserves intra-class invariance against environmental interference, and Cross-head Adaptive Distillation (CAD) dynamically balances classification-regression knowledge transfer. For surveying drone applications, this approach enables continuous adaptation to new targets while maintaining high precision for previously learned military vehicles, aircraft, and ships under challenging conditions.

Technical Framework

The system processes surveying UAV images through a transformer-based backbone network. Multi-scale features from the Feature Pyramid Network (FPN) feed into three specialized distillation modules:

1. Wasserstein Inter-Class Distillation (WICD)

WICD addresses catastrophic forgetting by modeling feature distributions using Gaussian mixtures and measuring inter-class divergence through Wasserstein distance. For class $i$ features $F_i$ and semantic queries $Q_i$:

$$F_{\mu} = \frac{1}{C \times H \times W} \sum_{k=1}^{C \times H \times W} F_i[k]$$
$$F_{\Sigma} = \frac{1}{C \times H \times W} \sum_{k=1}^{C \times H \times W} (F_i[k] – F_{\mu})(F_i[k] – F_{\mu})^T$$

The Wasserstein distance between teacher ($T$) and student ($S$) distributions becomes:

$$D_{W}^F = \underbrace{||\mu_T – \mu_S||^2}_{\text{Mean term}} + \underbrace{\text{tr}(\Sigma_T + \Sigma_S – 2(\Sigma_T^{1/2}\Sigma_S\Sigma_T^{1/2})^{1/2})}_{\text{Covariance term}}$$

Table 1 validates WICD’s superiority over conventional distance metrics on surveying UAV datasets:

Distance Metric	SIMD AP (%)	MAR20 AP (%)
Euclidean	65.3	56.8
Cosine	56.4	48.1
KL Divergence	62.1	53.2
Manhattan	58.2	50.4
Wasserstein (Ours)	69.2	58.3

2. Prototype-Guided Intra-Class Consistency (PGICD)

PGICD combats overfitting by aligning class prototypes using Gaussian kernel similarity. For feature prototypes $p^F$ and semantic prototypes $p^Q$:

$$d_i^F = 1 – \exp\left(-\frac{||p_i^{F,T} – p_i^{F,S}||^2}{2\sigma \cdot (p_i^{F,T} \cdot p_i^{F,S})}\right)$$

The consistency loss enforces intra-class stability critical for surveying drones operating in variable conditions:

$$\mathcal{L}_{PGICD} = \lambda_1 \sum_{i=1}^k (d_i^F)^2 + \lambda_2 \sum_{i=1}^k (d_i^Q)^2$$

3. Cross-head Adaptive Distillation (CAD)

CAD dynamically balances knowledge transfer between classification and regression heads based on WICD/PGICD performance:

$$w_{cls} = \frac{1}{1 + \exp(-\alpha_{WICD} \cdot \mathcal{L}_{WICD} + \alpha_{PGICD} \cdot \mathcal{L}_{PGICD})}$$
$$w_{reg} = \frac{1}{1 + \exp(\beta_{WICD} \cdot \mathcal{L}_{WICD} – \beta_{PGICD} \cdot \mathcal{L}_{PGICD})}$$

The adaptive distillation loss becomes:

$$\mathcal{L}_{CAD} = \frac{w_{cls}}{|\mathcal{R}|} \sum_{r \in \mathcal{R}} \phi(r) \cdot \delta(p_{cls}^S(r), p_{cls}^T(r)) + \frac{w_{reg}}{|\mathcal{R}|} \sum_{r \in \mathcal{R}} \phi(r) \cdot \delta(p_{reg}^S(r), p_{reg}^T(r))$$

Experimental Validation

We evaluate on SIMD (15 categories) and MAR20 (20 aircraft categories) surveying UAV datasets under incremental scenarios:

Method	SIMD 8+7 AP	MAR20 10+10 AP
Faster ILOD	54.2	40.2
ERD	58.1	46.8
ABR-IOD	57.7	47.6
Efficient-IOD	66.7	53.7
Ours	70.8	60.2
Upper Bound	72.5	62.5

Our method achieves minimal performance gaps (1.7% absolute on SIMD, 2.3% on MAR20) compared to upper bounds. Ablation studies confirm each module’s contribution:

Components	Step 3 AP (%)
Baseline	56.2
+WICD	61.6
+WICD+PGICD	64.8
+WICD+CAD	65.3
Full Model	68.9

Conclusion

Our multi-stage distillation framework significantly advances time-sensitive target detection for surveying drones by simultaneously addressing catastrophic forgetting through Wasserstein-based distribution alignment and preventing overfitting via prototype-guided consistency. The adaptive knowledge transfer mechanism ensures robust performance across complex operational scenarios involving occlusion, scale variation, and environmental interference. Future work will optimize deployment efficiency for real-time surveying UAV applications and extend to video-based incremental detection.