A Post-Training Quantization Method for CNN–Transformer Hybrid Models on UAV Platforms

In recent years, drone technology has revolutionized various low-altitude applications, including logistics, surveillance, agricultural monitoring, and emergency communication. Unmanned Aerial Vehicles (UAVs) are increasingly deployed in these scenarios due to their flexibility and efficiency. However, the computational resources on UAV platforms are often limited, as they typically rely on system-on-chip (SoC) or embedded AI hardware with constrained memory and power budgets. This poses significant challenges for deploying deep learning models, which require high computational overhead and memory footprint for real-time inference. Visual models, particularly those based on convolutional neural networks (CNNs) and transformers, have demonstrated exceptional performance in tasks like image classification and object detection. Yet, their large parameter counts and complex operations hinder efficient deployment on edge devices like drones. To address this, model compression techniques, such as quantization, have emerged as critical solutions. Quantization reduces the precision of model weights and activations from floating-point to lower-bit integers, thereby decreasing storage and computational demands. While post-training quantization (PTQ) methods have shown promise for standard models, they struggle with hybrid CNN-Transformer architectures due to the long-tailed distribution of activation values, leading to significant accuracy degradation. In this work, we propose an efficient PTQ method, termed Outlier-Aware Quantization (OAQ), which incorporates activation noise compensation and adaptive difficulty migration to mitigate the impact of outliers and enhance quantization robustness. Our approach achieves near-lossless accuracy with less than 1% drop in 8-bit quantization on various models, while improving inference speed by over 200% on embedded platforms. This advancement supports the deployment of advanced vision models on UAVs, enabling real-time performance in resource-constrained environments.

The rapid adoption of drone technology in low-altitude economies has intensified the demand for efficient visual perception systems. Unmanned Aerial Vehicles must perform tasks like object recognition and scene analysis with high accuracy and low latency, despite operating on limited battery life and computational resources. Deep learning models, especially hybrid architectures that combine CNNs and transformers, offer a balance between local feature extraction and global context modeling. For instance, MobileViT and EfficientFormer integrate convolutional layers with self-attention mechanisms to achieve state-of-the-art performance while maintaining efficiency. However, these models still face challenges in edge deployment due to their substantial memory and computational requirements. Quantization addresses these issues by converting high-precision floating-point numbers to integers, reducing model size and accelerating inference. Traditional PTQ methods, such as FQ-ViT and PTQ4ViT, have been developed for pure transformer models but are inadequate for hybrid architectures. The presence of depthwise separable convolutions in these hybrids results in activation distributions with long tails, where a small number of extreme values dominate the dynamic range. This causes quantization errors, as the limited bit-width cannot adequately represent the wide range of values. Our OAQ method tackles this by introducing noise compensation to smooth activation distributions and migrating quantization difficulty to weights, which are more amenable to compression. Through extensive experiments, we demonstrate that OAQ outperforms existing methods in accuracy and speed, making it suitable for UAV applications where real-time processing is critical.

In the domain of model architecture evolution, drone technology has driven the development of lightweight designs. Early CNN-based models like AlexNet and ResNet excelled in local feature extraction but struggled with long-range dependencies. Vision Transformers (ViTs) addressed this by leveraging self-attention mechanisms for global context, as seen in models like DeiT and Swin Transformer. However, their high computational complexity limits their use on UAV platforms. Hybrid models, such as MobileViT and EdgeNeXT, combine the strengths of CNNs and transformers to achieve efficiency without compromising accuracy. These architectures use depthwise separable convolutions for local processing and lightweight self-attention modules for global reasoning, reducing overall complexity. Despite these advancements, the irregular activation distributions in hybrids pose challenges for quantization. For example, depthwise convolutions produce activations with high inter-channel variance, leading to outliers that distort quantization scales. This issue is exacerbated in UAV deployments, where models must handle diverse environmental conditions. Existing PTQ methods often require specialized operators or structural modifications, increasing deployment complexity. In contrast, our OAQ approach maintains the original network structure and leverages mathematical equivalences to optimize quantization, ensuring compatibility with standard hardware. This is particularly important for drone technology, as it allows for seamless integration into existing UAV systems without additional overhead.

Quantization techniques have been extensively studied for deep learning models. Post-training quantization is preferred for its simplicity, as it does not require retraining. Uniform quantization, defined by the equation: $$Q(X|b) = \text{clip}\left(\left\lfloor \frac{X}{s} \right\rfloor + z_p, 0, 2^b – 1\right)$$ where $s$ is the scale factor and $z_p$ is the zero point, is widely supported by hardware. Methods like AdaRound improve weight quantization by optimizing rounding strategies, while FQ-ViT and PTQ4ViT focus on vision transformers. However, these methods are not tailored for hybrid models, where activation outliers cause significant accuracy loss. For instance, Q-HyViT and HyQ propose solutions for hybrid architectures but introduce custom operators that hinder deployment on UAV platforms. Our OAQ method avoids this by using activation noise compensation and adaptive difficulty migration, which are applied during calibration without altering the network. The noise compensation adjusts activations to reduce outliers, while difficulty migration shifts the quantization burden to weights. This dual strategy ensures robust performance across various models, as validated in our experiments. The growing importance of drone technology underscores the need for such efficient quantization methods, enabling Unmanned Aerial Vehicles to perform complex visual tasks in real-time.

Our OAQ method consists of two key components: activation noise compensation (ANC) and adaptive difficulty migration (ADM). The ANC module addresses the long-tailed activation distributions by injecting noise during the calibration phase. For a convolutional layer with input activations $X \in \mathbb{R}^{C_{in} \times H \times W}$ and weights $W \in \mathbb{R}^{C_{in} \times C_{out}}$, the output is computed as: $$Y = WX + B = W(X – \beta) + (B + W\beta)$$ where $\beta \in \mathbb{R}^{1 \times C_{in}}$ is the noise compensation vector. Initially, $\beta$ is set to the mean of activations per channel: $$\beta_i = \mu_i = \frac{1}{N} \sum_{j=1}^{N} x_{ij}$$ where $x_{ij}$ is the $j$-th activation in the $i$-th channel, and $N$ is the total number of activations. This initialization ensures that the noise is aligned with the activation distribution. During optimization, we minimize a loss function that combines quantization error and distribution smoothness: $$L_{\text{total}}(\beta) = L_{\text{quant}}(\beta) + L_{\text{smooth}}(\beta)$$ where $$L_{\text{quant}}(\beta) = \frac{1}{N} \sum_{j=1}^{N} \left| Q\left(Y_{\text{noisy}}^{(j)}\right) – Q\left(Y_{\text{orig}}^{(j)}\right) \right|^2$$ and $$L_{\text{smooth}}(\beta) = \lambda_1 \cdot |X_{\text{max}} – X_{\text{min}}| + \lambda_2 \cdot \sigma_i^2$$ Here, $\lambda_1$ and $\lambda_2$ are hyperparameters controlling the trade-off between dynamic range and variance reduction. The ANC module effectively flattens the activation distribution, reducing the impact of outliers and improving quantization accuracy.

The ADM module further enhances quantization by transferring the difficulty from activations to weights. For the same convolutional layer, we apply a difficulty migration factor $\delta \in \mathbb{R}^{1 \times C_{in}}$ to the weights and activations: $$Y = WX + B = (W\delta) \left(\frac{X}{\delta}\right) + B$$ The factor $\delta$ is initialized as the standard deviation of activations per channel: $$\delta_i = \sigma_{x_i} = \sqrt{\frac{1}{N} \sum_{j=1}^{N} (x_{ij} – \mu_{x_i})^2}$$ This initialization captures the variability in activations, allowing ADM to adaptively adjust the quantization difficulty. The optimization minimizes the mean squared error between quantized and original outputs: $$L_{\text{quant}} = \mathbb{E}\left[(\hat{Y} – Y)^2\right]$$ where $\hat{Y}$ is the quantized output. By migrating the quantization challenge to weights, which have a more uniform distribution, ADM simplifies the quantization of activations. This approach is particularly beneficial for drone technology, as it maintains the model structure and avoids the need for custom hardware support.

We evaluate OAQ on multiple hybrid models, including MobileViT, MobileViTv2, and EfficientFormer, using the ImageNet-1K dataset. The dataset comprises 1.2 million training images and 50,000 validation images across 1,000 classes. For calibration, we select 1,000 unlabeled images from the training set. Our experiments focus on 8-bit uniform quantization, comparing OAQ with state-of-the-art methods like FQ-ViT, PTQ4ViT, Q-HyViT, HyQ, EasyQuant, and RepQ-ViT. The results demonstrate that OAQ achieves superior accuracy with minimal degradation. For instance, on MobileViT-s, OAQ attains a Top-1 accuracy of 78.09%, compared to the full-precision baseline of 78.4%, representing a drop of only 0.31%. In contrast, other methods show significant declines, such as RepQ-ViT with 50.01% accuracy. The table below summarizes the quantization results for MobileViT models:

Method	Weight/Activation	MobileViT-xxs	MobileViT-xs	MobileViT-s
Full Precision	32/32	69.00	74.80	78.40
FQ-ViT	8/8	66.46	68.28	77.67
PTQ4ViT	8/8	37.75	65.52	68.19
RepQ-ViT	8/8	1.85	41.96	50.01
Q-HyViT	8/8	67.20	73.89	77.72
HyQ	8/8	68.15	73.99	77.93
EasyQuant	8/8	36.13	73.16	74.21
OAQ (Ours)	8/8	68.65	74.35	78.09

Similarly, for MobileViTv2 models, OAQ maintains high accuracy, as shown in the following table:

Method	Weight/Activation	MobileViTv2-50	MobileViTv2-75	MobileViTv2-100
Full Precision	32/32	70.20	75.60	78.10
FQ-ViT	8/8	67.66	69.56	77.15
PTQ4ViT	8/8	39.39	65.54	51.02
RepQ-ViT	8/8	26.60	55.52	40.85
Q-HyViT	8/8	69.89	75.29	77.63
HyQ	8/8	69.16	74.47	76.63
EasyQuant	8/8	66.80	62.91	69.34
OAQ (Ours)	8/8	69.97	75.31	77.86

For EfficientFormer models, OAQ consistently outperforms other methods, as detailed below:

Method	Weight/Activation	EfficientFormer-L1	EfficientFormer-L3	EfficientFormer-L7
Full Precision	32/32	80.20	82.40	83.30
FQ-ViT	8/8	66.63	81.85	82.47
PTQ4ViT	8/8	79.14	80.31	81.56
RepQ-ViT	8/8	79.67	81.98	82.47
HyQ	8/8	78.55	82.26	82.66
EasyQuant	8/8	78.42	81.03	82.27
OAQ (Ours)	8/8	80.09	82.14	83.15

In terms of inference speed, quantization with OAQ significantly accelerates model execution on embedded platforms like Jetson AGX Xavier. For example, the EfficientFormer-L7 model shows a speedup of over 200% compared to full-precision inference. This improvement is crucial for drone technology, where real-time processing is essential for tasks such as autonomous navigation and object tracking. The efficiency gains are achieved without sacrificing accuracy, making OAQ a viable solution for Unmanned Aerial Vehicle deployments.

To validate the contributions of ANC and ADM, we conduct ablation studies on MobileViT and MobileViTv2 models. Removing ANC leads to accuracy drops of up to 6.94% on MobileViT-xxs, while removing ADM causes declines of up to 6.63% on MobileViT-xs. The synergistic effect of both components is evident, as they collectively reduce quantization error by smoothing activations and migrating difficulty. The table below summarizes the ablation results for MobileViT models:

Model	Baseline	Without ANC	Without ADM	With OAQ
MobileViT-xxs	62.50	55.56	59.47	68.65
MobileViT-xs	68.20	62.20	61.57	74.35
MobileViT-s	75.80	73.12	75.04	78.09

Similarly, for MobileViTv2 models, the ablation results confirm the importance of both modules:

Model	Baseline	Without ANC	Without ADM	With OAQ
MobileViTv2-50	67.30	64.45	63.67	69.97
MobileViTv2-75	73.10	70.22	69.47	75.31
MobileViTv2-100	76.50	74.12	74.23	77.86

We also analyze the activation distributions before and after applying OAQ. For instance, in EfficientFormer models, the original activations exhibit long-tailed distributions with outliers. After ANC, the distributions become flatter, and with ADM, they are further concentrated around zero. This transformation reduces the dynamic range and minimizes quantization errors. The effectiveness of OAQ in handling activation outliers makes it particularly suitable for drone technology, where models must operate reliably under varying conditions.

In conclusion, our OAQ method provides an efficient solution for quantizing CNN-Transformer hybrid models on UAV platforms. By incorporating activation noise compensation and adaptive difficulty migration, we achieve high accuracy and significant speed improvements. The approach is hardware-friendly and requires no structural changes, facilitating deployment on resource-constrained drones. As drone technology continues to evolve, methods like OAQ will play a pivotal role in enabling advanced visual perception for Unmanned Aerial Vehicles. Future work could explore lower-bit quantization and adaptation to dynamic environments, further enhancing the capabilities of UAV-based applications.