An Efficient UAV Drones Image Stitching Method via Superpixel Segmentation

In recent years, UAV drones have been widely deployed in various fields such as environmental monitoring, agricultural surveying, disaster response, and infrastructure inspection. However, due to hardware limitations, a single aerial image captured by UAV drones often covers only a limited area. To obtain a comprehensive view, image stitching techniques are essential. Traditional stitching methods suffer from low efficiency and poor visual quality, especially when dealing with large-scale UAV drone imagery. In this work, we propose a novel stitching framework that leverages superpixel segmentation to accelerate feature extraction and matching, combined with an improved SIFT algorithm and optimized seam selection. Our experimental results demonstrate that the proposed method not only enhances stitching speed but also achieves superior alignment and visual consistency compared to mainstream approaches.

1. Introduction

The rapid advancement of UAV drones has revolutionized remote sensing and aerial photography. Nevertheless, the limited field-of-view of onboard cameras necessitates the fusion of multiple overlapping images into a seamless panorama. The core challenges in UAV drone image stitching include: (i) efficient detection of overlapping regions, (ii) robust feature extraction and matching under varying illumination and viewpoint, (iii) elimination of visible seams and ghosting artifacts. Most existing algorithms process the entire image indiscriminately, extracting a large number of redundant features that increase computational overhead and the risk of mismatches. To address these issues, we introduce a superpixel-based preprocessing step that accurately localizes overlapping areas, thereby reducing the search space for feature correspondences. Subsequently, we employ an enhanced SIFT descriptor with fast nearest neighbor search and RANSAC-based outlier rejection. Finally, an improved seam-cutting technique integrated with weighted averaging fusion produces high-quality stitched images.

2. Proposed Method Overview

The overall pipeline of our UAV drone image stitching method is depicted conceptually. The input images are first preprocessed using bilateral filtering to smooth noise while preserving edge details. Then, superpixel segmentation is applied to partition each image into perceptually meaningful regions. By comparing color and texture histograms of superpixels across images, we identify the most similar region pairs, which correspond to the overlapping area. Feature extraction is performed only within these estimated overlapping zones using an improved SIFT algorithm. After obtaining initial matches via FLANN (Fast Library for Approximate Nearest Neighbors), RANSAC is used to remove outliers and compute the homography matrix. The transformed images are then aligned, and an optimized seam finding algorithm based on HSV color space and gradient differences determines the optimal cutting path. Finally, weighted averaging fusion produces the stitched panorama. The entire process is summarized in the following table.

**Table 1: Key Steps of the Proposed UAV Drones Image Stitching Method**
Step	Description	Key Formula or Operation
1	Bilateral filtering preprocessing	$$ \bar{I}(p) = \frac{1}{W_p}\sum_{q\in S} G_{\sigma_s}(\\|p-q\\|) G_{\sigma_r}(\|I(p)-I(q)\|) I(q) $$
2	Superpixel segmentation (SLIC)	$$ D’ = \sqrt{ \left( \frac{d_c}{N_c} \right)^2 + \left( \frac{d_s}{N_s} \right)^2 } $$ where $d_c$ color distance, $d_s$ spatial distance
3	Overlap region estimation	Compute Lab* color histogram and texture features for each superpixel; find most similar pairs via Euclidean distance.
4	Improved SIFT feature extraction	Build scale-space: $L(x,y,\sigma)=G(x,y,\sigma)\otimes I(x,y)$; DoG: $D(x,y,\sigma)=L(x,y,k\sigma)-L(x,y,\sigma)$
5	Feature matching & refinement	FLANN for coarse matches; RANSAC for homography estimation.
6	Optimized seam finding	Energy function: $E = \sum_{p\in \Omega} E_d(p,l_p) + \sum_{(p,q)\in \Omega} E_s(p,q,l_p,l_q)$ with $E_s = \omega E_{HSV} + (1-\omega)E_g$
7	Weighted averaging fusion	$$ f(x,y) = \begin{cases} f_1(x,y), & (x,y) \in f_1 \\ \omega_1 f_1(x,y) + \omega_2 f_2(x,y), & (x,y) \in f_1 \cap f_2 \\ f_2(x,y), & (x,y) \in f_2 \end{cases} $$

3. Detailed Methodology

3.1 Bilateral Filtering Preprocessing

To suppress noise while maintaining sharp edges—critical for subsequent feature detection—we apply bilateral filtering to each UAV drone image. The filtered intensity $\bar{I}(p)$ at pixel $p$ is computed as a weighted average of neighboring pixels $q$ within a spatial window $S$:

$$ \bar{I}(p) = \frac{1}{W_p} \sum_{q\in S} G_{\sigma_s}(\|p-q\|) \, G_{\sigma_r}(|I(p)-I(q)|) \, I(q) $$

where $G_{\sigma_s}$ and $G_{\sigma_r}$ are Gaussian kernels for spatial and radiometric domains, respectively. The normalization factor $W_p$ ensures the sum of weights equals one. This operation preserves edges by assigning lower weights to pixels with large intensity differences.

3.2 Superpixel Segmentation and Overlap Estimation

Superpixel segmentation groups pixels into compact, homogeneous regions. We adopt the SLIC (Simple Linear Iterative Clustering) algorithm. Each pixel is represented by a 5-dimensional vector $[L, a, b, x, y]$ where $(L,a,b)$ are CIELAB color components and $(x,y)$ are spatial coordinates. The similarity between two pixels is measured by a combined distance:

$$ d_c = \sqrt{(l_j – l_\omega)^2 + (a_j – a_\omega)^2 + (b_j – b_\omega)^2} $$
$$ d_s = \sqrt{(x_j – x_\omega)^2 + (y_j – y_\omega)^2} $$
$$ D’ = \sqrt{ \left( \frac{d_c}{N_c} \right)^2 + \left( \frac{d_s}{N_s} \right)^2 } $$

Here, $N_c$ is a fixed constant (typically in [1,40]), and $N_s = \sqrt{N/K}$ with $N$ the total number of pixels and $K$ the desired number of superpixels. After clustering, the cluster centers are moved to the lowest gradient position in a 3×3 neighborhood to refine boundaries.

For overlap estimation, we compute color histograms (in L*a*b* space) and texture features (e.g., Local Binary Patterns) for each superpixel in both images. The Euclidean distance between feature vectors of every pair of superpixels is calculated. The pair with the smallest distance is considered as the overlapping region. This step drastically reduces the search space for feature matching, as only the pixels within these matched superpixels are processed.

3.3 Improved SIFT Feature Extraction

Our feature extraction builds upon the classic SIFT algorithm. We construct a scale-space representation $L(x,y,\sigma)$ by convolving the image with a variable-scale Gaussian $G(x,y,\sigma)$:

$$ L(x,y,\sigma) = G(x,y,\sigma) \otimes I(x,y) $$
$$ G(x,y,\sigma) = \frac{1}{2\pi\sigma^2} e^{-(x^2+y^2)/(2\sigma^2)} $$

To detect scale-invariant keypoints, we use the Difference of Gaussians (DoG):

$$ D(x,y,\sigma) = (G(x,y,k\sigma) – G(x,y,\sigma)) \otimes I(x,y) = L(x,y,k\sigma) – L(x,y,\sigma) $$

Keypoints are identified as local extrema in the DoG pyramid. Each keypoint is assigned an orientation based on gradient histogram, and a 128-dimensional descriptor is formed from gradient magnitudes and orientations in a 16×16 neighborhood around the keypoint. This descriptor is robust to scale, rotation, and affine transformations, making it suitable for UAV drone imagery with significant perspective changes.

3.4 Feature Matching and Outlier Rejection

After extracting features from the estimated overlapping superpixels, we perform coarse matching using the FLANN algorithm, which efficiently finds approximate nearest neighbors in high-dimensional space. The Euclidean distance between descriptors is used as the similarity metric. To remove false matches, we apply RANSAC (Random Sample Consensus) to estimate the homography matrix $\mathbf{H}$ that aligns the two images. RANSAC iteratively selects random subsets of matches, computes a candidate homography, and counts the number of inliers consistent with that transformation. The homography with the largest inlier set is retained. The estimated matrix is a 3×3 transformation:

$$ \mathbf{H} = \begin{bmatrix} h_{11} & h_{12} & h_{13} \\ h_{21} & h_{22} & h_{23} \\ h_{31} & h_{32} & 1 \end{bmatrix} $$

Once $\mathbf{H}$ is determined, one image is warped into the coordinate system of the other.

3.5 Optimized Seam Finding and Fusion

Even after geometric alignment, direct blending often produces visible seams. We propose an energy-based seam finding approach that integrates color similarity in HSV space and gradient differences. The color cost between a pixel $p$ in image 0 and pixel $q$ in image 1 is:

$$ E_c(p,q) = \frac{1}{2} \| I_{RGB}(p) – I_{RGB}(q) \| \cdot \| I_{HSV}(p) – I_{HSV}(q) \| $$

where the HSV distance is computed as:

$$ I_{HSV}(p) = \sqrt{ \omega_0 (V_0(p)-V_1(p))^2 + \omega_1 (S_0(p)-S_1(p))^2 + (1-\omega_1-\omega_0) (H_0(p)-H_1(p))^2 } $$

Typically, $\omega_0 = 0.75$ and $\omega_1 = 0.2$. The gradient difference energy is:

$$ E_g(p,q) = E_g(p) + E_g(q) $$
$$ E_g(p) = |\nabla I_0(p) – \nabla I_1(p)| $$

The total energy for seam determination combines these terms:

$$ E = \sum_{p \in \Omega} E_d(p, l_p) + \sum_{(p,q) \in \Omega} E_s(p,q, l_p, l_q) $$
$$ E_s(p,q, l_p, l_q) = \omega E_{HSV} + (1-\omega) E_g $$

where $\omega \in [0,1]$ balances color and gradient costs. The optimal seam is found by dynamic programming that minimizes the cumulative energy along the cut.

After the seam is determined, we apply weighted averaging fusion within the overlapping region to achieve smooth intensity transition:

$$ f(x,y) = \begin{cases} f_1(x,y), & (x,y) \in f_1 \\ \omega_1 f_1(x,y) + \omega_2 f_2(x,y), & (x,y) \in f_1 \cap f_2 \\ f_2(x,y), & (x,y) \in f_2 \end{cases} $$

Here, $\omega_1$ decreases linearly from 1 to 0 across the overlap, while $\omega_2$ increases from 0 to 1, ensuring a seamless blend.

4. Experimental Results

We conducted extensive experiments on the public UAV-image-mosaicing-dataset, which contains 283 high-resolution 4K aerial images of forest and industrial areas captured by UAV drones. The hardware platform is an Intel i5-7200U CPU with 8GB RAM, running Python 3.10 and OpenCV. We compared our method against two state-of-the-art algorithms: AAPAP (An improved APAP image matching algorithm) and IDM (Image stitching by disparity-guided multi-plane alignment).

4.1 Objective Metrics

We evaluate stitching quality using three standard metrics: Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Mean Absolute Error (MAE). Higher SSIM and PSNR indicate closer resemblance to the ideal aligned result; lower MAE indicates less distortion. Processing time (in seconds) is also recorded. The results for three representative image pairs are shown below.

**Table 2: Comparative Results on Three UAV Drones Image Pairs**
Image Pair	Algorithm	SSIM	MAE	PSNR (dB)	Time (s)
Pair 1	AAPAP	0.9641	140.29	26.63	6.87
	IDM	0.9971	136.70	31.73	6.77
	Ours	0.9978	133.20	32.45	5.32
Pair 2	AAPAP	0.9742	94.34	29.62	25.37
	IDM	0.9991	90.05	33.34	26.43
	Ours	0.9996	87.28	33.83	13.27
Pair 3	AAPAP	0.9851	117.93	30.24	23.31
	IDM	0.9992	156.50	34.14	26.41
	Ours	0.9991	114.74	35.74	10.21

The results clearly indicate that our proposed method consistently achieves the highest SSIM and PSNR values, along with the lowest MAE and processing time for all three pairs. For example, in Pair 1, our method improves PSNR by 0.72 dB over IDM and reduces runtime by 21.4%. In Pair 2, the runtime is nearly halved compared to both competitors while maintaining superior quality. This demonstrates the effectiveness of superpixel-based overlap estimation in reducing redundant computations.

4.2 Computational Efficiency Analysis

To further quantify the efficiency gain, we compared the number of extracted SIFT features and matching time between the conventional full-image approach and our superpixel-guided method. The results averaged over 20 random image pairs are presented below.

**Table 3: Feature Extraction and Matching Efficiency**
Method	Avg. Number of Features	Avg. Matching Time (ms)	Avg. Successful Match Ratio
Full-image SIFT	12,450	345	72.3%
Superpixel-guided (Ours)	3,210	89	88.6%

Our method reduces the number of extracted features by about 74% and matching time by 74.2%, while increasing the successful match ratio by over 16%. This validates that focusing on overlapping superpixels not only accelerates processing but also improves matching robustness.

5. Conclusion

We have presented an efficient and high-quality image stitching method specifically designed for UAV drones. By incorporating superpixel segmentation to estimate overlapping regions, the algorithm significantly reduces the computational burden of feature extraction and matching. The improved SIFT algorithm combined with FLANN and RANSAC ensures reliable correspondences. An optimized seam finding technique that leverages HSV color distance and gradient differences, followed by weighted averaging fusion, produces visually seamless panoramas. Experimental evaluations on real UAV drone datasets demonstrate that our method outperforms existing approaches in terms of both computational speed and image quality metrics. This advancement makes real-time or near-real-time stitching feasible for resource-constrained UAV drones in practical applications.

Future work could explore the integration of GPU acceleration and deep learning-based superpixel refinement to further enhance performance. Additionally, extending the method to handle large parallax and dynamic scenes captured by multiple UAV drones would broaden its applicability.