Autonomous Aerial Refueling for UAV Drones: A Vision-Based Detection and Localization Methodology

In the realm of modern aviation, UAV drones have revolutionized missions requiring extended endurance and operational range. Autonomous aerial refueling (AAR) stands as a critical technology to enhance the capabilities of UAV drones, enabling them to undertake longer-duration tasks without landing. However, the close-range docking phase in probe-and-drogue refueling presents significant challenges, including robust drogue detection under dynamic conditions and high-precision relative localization. This paper addresses these issues by proposing a novel near-infrared vision-based approach for drogue detection and localization, specifically tailored for UAV drones. Our method integrates traditional image processing with machine learning techniques to achieve reliable performance in various environmental scenarios.

The core of our system involves equipping the refueling drogue with eight near-infrared light-emitting diodes (LEDs) as fiducial markers and deploying a visual system on the receiver UAV drone, comprising cameras fitted with near-infrared filters. We detail a comprehensive pipeline encompassing image preprocessing, ellipse feature fitting via a shape-prior-driven neighborhood search, relative pose estimation using the perspective-n-point (PnP) algorithm, and multi-source information fusion. Experimental results demonstrate that our approach offers high detection success rates, efficiency, and centimeter-level localization accuracy, making it suitable for real-world AAR applications involving UAV drones.

The increasing adoption of UAV drones in military and civilian domains has heightened the demand for autonomous capabilities, including aerial refueling. For UAV drones, AAR eliminates the need for human intervention, reducing operational risks and enabling persistent missions. Vision-based navigation has emerged as a promising solution due to its high accuracy, low cost, and immunity to electromagnetic interference. However, existing methods often struggle with robustness against occlusions, lighting variations, and high-speed motions. Our work aims to overcome these limitations by leveraging near-infrared imaging and advanced algorithms, ensuring that UAV drones can perform docking reliably even in challenging conditions.

We begin by outlining the overall system design. The visual system on the UAV drone consists of two industrial cameras symmetrically mounted under the wings, each equipped with an 850 nm near-infrared filter to isolate LED emissions. This dual-camera setup enhances field-of-view coverage and redundancy, crucial for UAV drones operating in turbulent airflows. The drogue features eight LEDs arranged uniformly on a 0.4 m diameter ring, providing sufficient feature points for accurate ellipse fitting. The processing pipeline includes modules for image preprocessing, ellipse feature fitting, relative pose computation, and fusion filtering, all optimized for real-time execution on UAV drone onboard computers.

Image preprocessing is critical for extracting LED features from near-infrared images captured by UAV drones. The raw grayscale image undergoes several steps: edge detection using the Sobel operator, morphological dilation to enhance edges, adaptive thresholding via Otsu’s method, bitwise AND operation to combine results, and connected component extraction. The centroid of each connected region is computed using the gray-scale centroid method, and sub-pixel refinement is achieved through Gaussian model fitting. The equations for centroid calculation are:

$$ x_c = \frac{\sum_{(i,j) \in \Omega} I_{ij} x_i}{\sum_{(i,j) \in \Omega} I_{ij}} $$

$$ y_c = \frac{\sum_{(i,j) \in \Omega} I_{ij} y_i}{\sum_{(i,j) \in \Omega} I_{ij}} $$

where $ (x_c, y_c) $ is the centroid coordinate, $ \Omega $ is the set of pixels in the connected region, $ (x_i, y_i) $ are pixel coordinates, and $ I_{ij} $ is the grayscale value. The Gaussian model for sub-pixel fitting is:

$$ M(x,y) = A \cdot \exp\left[ -\frac{(x – x_c)^2}{2\sigma_x^2} – \frac{(y – y_c)^2}{2\sigma_y^2} \right] $$

Minimizing the error $ \epsilon = \sum_{i=1}^N [I(x_i, y_i) – M(x_i, y_i)]^2 $ yields refined LED positions. This preprocessing robustly handles noise and varying illumination, which are common in UAV drone operations.

Ellipse feature fitting is performed using a shape-prior-driven neighborhood search algorithm. Given the extracted feature points, we construct a KD-tree for efficient neighbor queries. For each point, we search within radii from $ r_{\text{min}} $ to $ r_{\text{max}} $ with a step length, and fit an ellipse to neighboring points using the conic equation:

$$ A x^2 + B xy + C y^2 + D x + E y + 1 = 0 $$

subject to the ellipse constraint $ 4C > A^2 $. Inliers are determined based on a distance threshold, and ellipses are scored using an angular distribution consistency metric. The algorithm prioritizes ellipses with uniformly distributed points, ensuring accurate drogue representation. Additionally, tracking is implemented by predicting the drogue region in the current frame based on the previous ellipse’s bounding box, expanded by factors $ e_u $ and $ e_v $:

$$ d_u = 2\sqrt{(a \cos \theta)^2 + (b \cos \theta)^2}, \quad d_v = 2\sqrt{(b \cos \theta)^2 + (a \cos \theta)^2} $$

$$ l_u = (1 + e_u) \cdot d_u, \quad l_v = (1 + e_v) \cdot d_v $$

where $ a $ and $ b $ are semi-axes, and $ \theta $ is the rotation angle. To enhance robustness, we integrate a deep learning-based drogue detection model, trained on a dataset of near-infrared images from UAV drone scenarios. The model, based on the RTMDet framework, provides confidence scores for candidate ellipses, aiding in cases of multiple fits or failures. This hybrid approach ensures reliable detection for UAV drones even under partial occlusions.

Relative pose estimation is achieved via the PnP algorithm. From the fitted ellipse, eight equally spaced points $ p_1 $ to $ p_8 $ are sampled, corresponding to the 3D coordinates $ P_1 $ to $ P_8 $ of the LEDs in the drogue coordinate system. The transformation between the camera and drogue is represented by $ T = \{R, t\} $, where $ R $ is rotation and $ t $ is translation. Using the direct linear transformation (DLT) method, we formulate:

$$ s \begin{bmatrix} u_1 \\ v_1 \\ 1 \end{bmatrix} = \begin{bmatrix} t_1 & t_2 & t_3 & t_4 \\ t_5 & t_6 & t_7 & t_8 \\ t_9 & t_{10} & t_{11} & t_{12} \end{bmatrix} \begin{bmatrix} X_1 \\ Y_1 \\ Z_1 \\ 1 \end{bmatrix} $$

Eliminating the scale factor $ s $ yields linear constraints. For all eight points, we construct a matrix equation solved via singular value decomposition (SVD) to obtain $ T $. This provides the relative pose between the UAV drone’s camera and the drogue, which is then transformed to the UAV drone body coordinate system using known extrinsic calibration.

Multi-source information fusion combines data from the two cameras on the UAV drone. Timestamp alignment synchronizes measurements within a threshold interval, and coordinate alignment transforms poses to a common body frame. A Kalman filter is applied for noise reduction and motion smoothing, employing a constant velocity model. The state vector is $ X = [p, \dot{p}]^T $, where $ p = [x, y, z] $ is position and $ \dot{p} $ is velocity. The state transition and measurement equations are:

$$ X_{k+1} = A X_k + w_k, \quad A = \begin{bmatrix} I & I \Delta T \\ 0 & I \end{bmatrix} $$

$$ Z_k = H X_k + v_k, \quad H = \begin{bmatrix} I & 0 \end{bmatrix} $$

where $ \Delta T $ is the sampling time, $ w_k $ and $ v_k $ are process and measurement noises with covariances $ Q $ and $ R $, respectively. The Kalman filter equations for prediction and update are:

Prediction: $ \hat{X}_k^- = A \hat{X}_{k-1} $, $ P_k^- = A P_{k-1} A^T + Q $

Update: $ K_k = P_k^- H^T (H P_k^- H^T + R)^{-1} $, $ \hat{X}_k = \hat{X}_k^- + K_k (Z_k – H \hat{X}_k^-) $, $ P_k = (I – K_k H) P_k^- $

This fusion ensures stable and accurate pose outputs for UAV drone control during docking.

Extensive experiments were conducted to validate our methodology for UAV drones. The setup included a drogue with eight 850 nm LEDs, two cameras with near-infrared filters, and an onboard computing platform. We collected 3,500 near-infrared images under various conditions: different distances, angles, lighting, backgrounds, occlusions, and simulated smoke. The detection success rates are summarized in Table 1, demonstrating high robustness across scenarios for UAV drone applications.

Table 1: Detection Success Rates for UAV Drones in Various Scenarios
Scenario	Number of Frames	Ellipse Fitting Failures	Deep Learning Successes After Failure	Total Failures	Success Rate (%)
No occlusion	2000	17	7	10	99.5
1 LED occluded	300	14	9	5	98.3
2 LEDs occluded	300	20	8	12	96.0
3 LEDs occluded	300	21	4	17	94.3
Smoke simulation	300	12	2	10	96.7
High-speed motion	300	24	10	14	95.3

The efficiency of our algorithm is critical for real-time UAV drone operations. As shown in Table 2, the average detection time per frame is 24.6 ms, with a complete pipeline time of 28.9 ms, enabling a frame rate over 30 Hz. This meets the stringent real-time requirements for UAV drones during aerial refueling.

Table 2: Processing Efficiency for UAV Drone Vision System
Metric	Average Time (ms)	Maximum Time (ms)	Minimum Time (ms)	Standard Deviation
Detection only	24.6	30.7	19.6	1.42
Full pipeline	28.9	32.2	23.1	1.56

Localization accuracy was evaluated using ultra-wideband (UWB) measurements as ground truth. The UAV drone platform was fixed, and the drogue was moved along x, y, and z axes. The relative position errors are summarized in Table 3, showing centimeter-level accuracy across ranges up to 16.91 m. This precision is essential for safe docking of UAV drones.

Table 3: Localization Accuracy for UAV Drones in Different Ranges
Distance Range (m)	Number of Frames	Average Error in x (cm)	Average Error in y (cm)	Average Error in z (cm)
[0.00, 5.00)	1933	6.40	2.15	2.72
[5.00, 10.00)	Included in total	8.42	3.98	4.03
[10.00, 16.91)	Included in total	9.34	5.52	5.17

Field tests with actual UAV drones further validated our approach. Two UAV drones were deployed as tanker and receiver, flying at speeds around 45 m/s. The visual system successfully detected and localized the drogue under backlight, headlight, and occlusion conditions, enabling autonomous docking. These results underscore the practicality of our method for UAV drones in dynamic aerial environments.

In conclusion, we have presented a robust vision-based drogue detection and localization method for autonomous aerial refueling of UAV drones. By integrating near-infrared LED markers, dual-camera systems, advanced image processing, and machine learning, our approach achieves high detection success rates, real-time efficiency, and centimeter-level positioning accuracy. The hybrid algorithm ensures reliability under occlusions, lighting variations, and high-speed motions, which are common challenges for UAV drones. Future work may involve extending this to multi-UAV drone scenarios or incorporating deep learning for end-to-end pose estimation. This methodology holds significant promise for enhancing the autonomy and operational capabilities of UAV drones in critical refueling missions.

The deployment of UAV drones in complex missions necessitates continuous innovation in supporting technologies. Our vision-based system contributes to this by providing a reliable solution for autonomous aerial refueling, enabling UAV drones to achieve greater endurance and flexibility. As UAV drones evolve, such advancements will be pivotal in expanding their roles across military, commercial, and scientific domains.