Post-Training Quantization for CNN-Transformer Hybrid Models in Unmanned Aerial Vehicle Applications

In recent years, drone technology has become increasingly vital in various low-altitude production and service scenarios, such as logistics, surveillance, agricultural monitoring, and emergency communication. These applications demand real-time, low-power, and lightweight visual perception systems. However, Unmanned Aerial Vehicles (UAVs) are typically equipped with system-on-chip (SoC) or embedded AI chips with limited computational resources and battery life. This poses significant challenges for deploying complex deep learning models, which require high memory and computational overhead. Vision Transformers (ViTs) and their hybrid counterparts, which combine convolutional neural networks (CNNs) with self-attention mechanisms, have demonstrated superior performance in tasks like image classification and object detection. Yet, their large parameter counts and high computational costs hinder real-time inference on embedded platforms. To address this, we propose an efficient post-training quantization (PTQ) method tailored for CNN-Transformer hybrid models, enabling high-accuracy and fast inference on drone platforms without introducing additional operators.

The integration of drone technology into critical operations necessitates robust and efficient AI models. Unmanned Aerial Vehicles often operate in dynamic environments where latency and power consumption are paramount. Hybrid models, such as MobileViT and EfficientFormer, leverage the strengths of CNNs for local feature extraction and Transformers for global context, striking a balance between accuracy and efficiency. However, these models still face deployment bottlenecks due to their computational intensity. Quantization techniques reduce model size and accelerate inference by converting floating-point weights and activations to low-bit integers. While PTQ methods like FQ-ViT and PTQ4ViT have shown promise for standard Transformers, they struggle with hybrid architectures due to long-tailed activation distributions from depthwise separable convolutions. Our approach, named Outlier-Aware Quantization (OAQ), tackles this issue through activation noise compensation and adaptive difficulty migration, achieving near-lossless quantization with significant speedups on edge devices.

In this paper, we first review the evolution of model architectures and quantization techniques, highlighting the unique challenges posed by drone applications. We then detail our OAQ method, which optimizes activation distributions and transfers quantization difficulty to weights. Experimental results on ImageNet-1K demonstrate that OAQ maintains top-1 accuracy within 1% degradation for 8-bit quantization across various hybrid models, while inference speed on Jetson AGX Xavier improves by over 200% in some cases. Ablation studies confirm the contributions of each component, and activation distribution analysis visualizes the effectiveness of our approach. This work provides a practical solution for deploying advanced vision models on resource-constrained Unmanned Aerial Vehicle platforms, advancing the adoption of drone technology in real-world scenarios.

Related Work

The development of deep learning models has been driven by the need for efficiency in edge devices like drones. Initially, CNNs dominated computer vision with architectures like AlexNet, VGG, and ResNet, which excel at local feature extraction but lack long-range dependency modeling. The advent of Vision Transformers, such as ViT and Swin Transformer, introduced self-attention mechanisms for global context, achieving state-of-the-art results on large-scale datasets. However, their high computational complexity limits deployment on Unmanned Aerial Vehicles. To bridge this gap, hybrid models like MobileViT, EdgeNeXT, and EfficientFormer combine CNNs and Transformers, offering a trade-off between accuracy and efficiency. These models are designed for mobile and edge devices, making them suitable for drone technology applications where real-time processing is critical.

Quantization has emerged as a key technique for model compression. It can be categorized into quantization-aware training (QAT) and post-training quantization (PTQ). QAT involves fine-tuning with quantization simulated during training, preserving accuracy but requiring extensive data and computation. PTQ, on the other hand, directly quantizes pre-trained models without retraining, making it more practical for drone deployments where data access may be limited. Early PTQ methods, such as those by Jacob et al., focused on CNNs using integer arithmetic for efficient inference. For Transformers, FQ-ViT and PTQ4ViT introduced specialized techniques like rank preservation and dual uniform quantization to handle unique activation distributions. Recent work on hybrid models, including Q-HyViT and HyQ, addresses challenges like high dynamic range and outlier activations through bridge block reconstruction and distribution scaling. However, these methods often rely on custom operators or mixed precision, which may not be supported on all hardware. Our OAQ method avoids these limitations by using uniform quantization and mathematical equivalences, ensuring compatibility with standard edge platforms for Unmanned Aerial Vehicle operations.

Methodology

Our OAQ method targets the quantization of CNN-Transformer hybrid models for drone technology, focusing on mitigating the impact of outlier activations. The approach consists of two main components: Activation Noise Compensation (ANC) and Adaptive Difficulty Migration (ADM). These are optimized during a calibration phase and then integrated into the quantizer without altering the model structure. We assume a pre-trained model with weights and activations in floating-point format, and we aim to quantize them to 8-bit integers using uniform symmetric quantization. The quantization of a floating-point value $X$ to a $b$-bit integer $Q(X|b)$ is defined as:

$$ Q(X|b) = \text{clip} \left( \left\lfloor \frac{X}{s} \right\rfloor + z_p, 0, 2^b – 1 \right) $$

where $s$ is the scale factor, $z_p$ is the zero-point, and $\text{clip}$ constrains values to the integer range. The scale and zero-point are determined from the minimum and maximum values of $X$:

$$ l = \min(X), \quad u = \max(X) $$
$$ s = \frac{u – l}{2^b – 1}, \quad z_p = \text{clip} \left( \left\lfloor -\frac{l}{s} \right\rfloor, 0, 2^b – 1 \right) $$

In hybrid models, depthwise separable convolutions produce activations with long-tailed distributions, where most values are near zero but extremes stretch the dynamic range. This leads to significant quantization error if not handled properly.

Activation Noise Compensation (ANC)

ANC adds a noise vector $\beta$ to the input activations $X$ of a convolution layer, smoothing the distribution and reducing outliers. For a convolution with weight $W \in \mathbb{R}^{C_{in} \times C_{out}}$, bias $B \in \mathbb{R}^{1 \times C_{out}}$, and input $X$, the output $Y$ is modified as:

$$ Y = W X + B = W (X – \beta) + (B + W \beta) $$

Here, $\beta \in \mathbb{R}^{1 \times C_{in}}$ is a per-channel noise vector initialized as the mean of activations for each channel:

$$ \beta_i = \mu_i = \frac{1}{N} \sum_{j=1}^{N} x_{ij} $$

where $x_{ij}$ is the $j$-th activation in the $i$-th channel, and $N$ is the total number of activations. During calibration, $\beta$ is optimized to minimize a loss function that combines quantization error and distribution smoothness. The quantization error loss $L_{\text{quant}}(\beta)$ measures the difference between quantized outputs with and without noise:

$$ L_{\text{quant}}(\beta) = \frac{1}{N} \sum_{j=1}^{N} \left| Q\left(Y_{\text{noisy}}^{(j)}\right) – Q\left(Y_{\text{orig}}^{(j)}\right) \right|^2 $$

The smoothness loss $L_{\text{smooth}}(\beta)$ controls the dynamic range and variance of activations:

$$ L_{\text{smooth}}(\beta) = \lambda_1 \cdot |X_{\text{max}} – X_{\text{min}}| + \lambda_2 \cdot \sigma_i^2 $$

where $\lambda_1$ and $\lambda_2$ are hyperparameters, $|X_{\text{max}} – X_{\text{min}}|$ is the dynamic range, and $\sigma_i^2$ is the variance. The total loss is:

$$ L_{\text{total}}(\beta) = L_{\text{quant}}(\beta) + L_{\text{smooth}}(\beta) $$

After optimization, $\beta$ is fixed and merged into the quantizer, and the bias is adjusted accordingly. This process enhances quantization friendliness without adding computational overhead during inference on Unmanned Aerial Vehicle platforms.

Adaptive Difficulty Migration (ADM)

ADM shifts the quantization difficulty from activations to weights by applying a migration factor $\delta$. For the same convolution layer, the operation becomes:

$$ Y = W X + B = (W \delta) \left( \frac{X}{\delta} \right) + B $$

where $\delta \in \mathbb{R}^{1 \times C_{in}}$ is a per-channel factor initialized as the standard deviation of activations:

$$ \delta_i = \sigma_{X_i} = \sqrt{ \frac{1}{N} \sum_{j=1}^{N} (X_j – \mu_{X_i})^2 } $$

$\delta$ is optimized to minimize the mean squared error (MSE) between the quantized output $\hat{Y}$ and the original output $Y$:

$$ L_{\text{quant}} = \mathbb{E}\left[ (\hat{Y} – Y)^2 \right] $$

This allows activations to have a reduced dynamic range, making them easier to quantize, while weights absorb the complexity. Similar to ANC, ADM parameters are fixed after calibration and integrated into the quantizer. The combination of ANC and ADM ensures that activations are well-behaved for quantization, crucial for maintaining accuracy in drone technology applications where model robustness is essential.

Experiments and Analysis

We evaluate OAQ on the ImageNet-1K dataset, which includes 1.2 million training images and 50,000 validation images across 1,000 classes. For calibration, we use 1,000 unlabeled images from the training set. We test on pre-trained CNN-Transformer hybrid models from the timm library and open-source repositories, including MobileViT, MobileViTv2, and EfficientFormer. The baseline accuracy is measured using full-precision (FP32) models. Our experiments focus on 8-bit uniform quantization for both weights and activations, including all linear and convolutional layers. We compare OAQ with state-of-the-art PTQ methods: FQ-ViT, PTQ4ViT, Q-HyViT, HyQ, EasyQuant, and RepQ-ViT. All experiments are conducted on an Ubuntu 22.04 system with an Intel i9-13900K CPU and NVIDIA RTX 4090 GPU, using PyTorch 2.5.0 and CUDA 12.4. The learning rate for calibration is set to 0.001, and we use layer-wise quantization for linear modules and channel-wise for normalization layers.

Performance Comparison

Tables 1, 2, and 3 summarize the top-1 accuracy after quantization for MobileViT, MobileViTv2, and EfficientFormer models, respectively. OAQ consistently outperforms other methods, with accuracy drops of less than 1% in most cases. For example, on MobileViT-s, OAQ achieves 78.09% accuracy compared to the FP32 baseline of 78.4%, while RepQ-ViT drops to 50.01%. This demonstrates the effectiveness of our approach in handling hybrid architectures. On EfficientFormer models, OAQ maintains high accuracy, with EfficientFormer-L1 at 80.09% (baseline 80.20%), showcasing its robustness across different model sizes. These results are critical for drone technology, where reliable performance is necessary for tasks like object detection in Unmanned Aerial Vehicle operations.

Table 1: Quantization Accuracy for MobileViT Models
Method	Weight/Activation	MobileViT-xxs	MobileViT-xs	MobileViT-s
Full Precision	32/32	69.00	74.80	78.40
FQ-ViT	8/8	66.46	68.28	77.67
PTQ4ViT	8/8	37.75	65.52	68.19
RepQ-ViT	8/8	1.85	41.96	50.01
Q-HyViT	8/8	67.20	73.89	77.72
HyQ	8/8	68.15	73.99	77.93
EasyQuant	8/8	36.13	73.16	74.21
OAQ (Ours)	8/8	68.65	74.35	78.09

Table 2: Quantization Accuracy for MobileViTv2 Models
Method	Weight/Activation	MobileViTv2-50	MobileViTv2-75	MobileViTv2-100
Full Precision	32/32	70.20	75.60	78.10
FQ-ViT	8/8	67.66	69.56	77.15
PTQ4ViT	8/8	39.39	65.54	51.02
RepQ-ViT	8/8	26.60	55.52	40.85
Q-HyViT	8/8	69.89	75.29	77.63
HyQ	8/8	69.16	74.47	76.63
EasyQuant	8/8	66.80	62.91	69.34
OAQ (Ours)	8/8	69.97	75.31	77.86

Table 3: Quantization Accuracy for EfficientFormer Models
Method	Weight/Activation	EfficientFormer-L1	EfficientFormer-L3	EfficientFormer-L7
Full Precision	32/32	80.20	82.40	83.30
FQ-ViT	8/8	66.63	81.85	82.47
PTQ4ViT	8/8	79.14	80.31	81.56
RepQ-ViT	8/8	79.67	81.98	82.47
HyQ	8/8	78.55	82.26	82.66
EasyQuant	8/8	78.42	81.03	82.27
OAQ (Ours)	8/8	80.09	82.14	83.15

Inference speed is a critical metric for drone technology. We measure the frame rate on Jetson AGX Xavier for FP32 and INT8 models. As shown in Figure 1, quantization with OAQ significantly boosts inference speed, with improvements exceeding 200% for larger models like EfficientFormer-L7. This acceleration enables real-time processing on Unmanned Aerial Vehicles, enhancing their capability in time-sensitive applications.

Ablation Studies

We conduct ablation experiments on MobileViT and MobileViTv2 to evaluate the individual contributions of ANC and ADM. The baseline is quantization without these components. Results in Figure 2 indicate that removing ANC leads to accuracy drops of up to 6.94% on MobileViT-xxs, while removing ADM causes drops of up to 6.63% on MobileViT-xs. This underscores the importance of both components. The synergy between ANC and ADM is evident, as ANC smooths activations, allowing ADM to better migrate quantization difficulty. For instance, on MobileViT-s, the combined approach achieves 78.09% accuracy, whereas removing ANC or ADM reduces it to 75.41% and 77.33%, respectively. These findings validate our design choices for drone applications, where model efficiency and accuracy are balanced.

Activation Distribution Analysis

We analyze activation distributions in EfficientFormer to visualize the impact of OAQ. Figure 3 shows histograms for convolutional inputs in intermediate and final stages. Original activations exhibit long tails with outliers, as seen in the wide distributions. After applying ANC, the distributions become flatter and more concentrated near zero. With ADM added, activations are further compressed, eliminating extremes. This transformation reduces quantization error and improves model stability, which is essential for reliable performance in Unmanned Aerial Vehicle systems operating under varying conditions.

Conclusion

We present OAQ, a post-training quantization method for CNN-Transformer hybrid models that addresses outlier activations through noise compensation and difficulty migration. This approach achieves high accuracy with minimal degradation and significant inference speedups on edge devices. Experiments on multiple models demonstrate OAQ’s superiority over existing methods, making it suitable for drone technology deployments. Future work will explore lower-bit quantization and adaptation to other model architectures for broader applications in Unmanned Aerial Vehicle platforms.