Multi-UAV Collaborative Mission Planning for Aircraft Skin Coverage Detection

In the context of aircraft skin coverage detection, traditional manual inspection methods face significant bottlenecks, such as low operational efficiency and stringent timeliness constraints. To address these challenges, multi-unmanned aerial vehicle (UAV) collaborative operations have emerged as a promising solution. Among these, the Multi-UAV Cooperative Mission Planning (MCMP) problem models the collaborative detection task, where existing algorithms often rely on heuristic approaches. However, these methods struggle to meet practical requirements in terms of solution speed and quality. This paper models the MCMP problem as a Capacitated Vehicle Routing Problem (CVRP) and proposes a Two-Stage Deep Reinforcement Learning (TSDRL) framework. The first stage determines the optimal number of drones based on node count using an attention-based policy network, while the second stage designs a novel encoder-decoder policy network to construct paths for each UAV. Trained via policy gradients, this model rapidly generates high-quality paths. To address collision constraints in 3D environments, the RRT* algorithm optimizes paths. Simulation results demonstrate that the proposed model outperforms existing deep reinforcement learning methods and heuristic algorithms in computational efficiency and solution quality, with strong generalization across different aircraft models.

The increasing demand for efficient and timely aircraft skin inspection in airline operations necessitates automated solutions. Skin damages, such as impact dents, lightning strikes, and cracks, pose critical safety risks. Line maintenance inspections, including pre-flight, transit, and post-flight checks, are often constrained by tight schedules. In real-world scenarios, inspection tasks must be allocated dynamically due to varying operational environments, such as the presence of ground support equipment on the apron. Leveraging multi-UAV simultaneous localization and mapping (SLAM) technology, rapid environmental maps can be generated to derive coverage nodes for skin inspection. Prolonged task allocation times can extend the overall inspection cycle, potentially delaying flight departures. Thus, airlines require solutions that minimize the coverage time of multiple drones to ensure punctuality. Traditional manual inspections suffer from inefficiencies, safety concerns, low coverage, and inadequate precision. Multi-UAV technologies, leveraging individual drone flexibility, have shown promise in enhancing the efficiency and accuracy of line maintenance inspections.

The core of automated detection lies in Multi-UAV Cooperative Mission Planning (MCMP), which involves viewpoint generation, viewpoint assignment, and path planning. Viewpoints are derived from 3D aircraft models through decomposition methods like cell decomposition or grid partitioning, optimized under constraints such as overlap ratio, shooting distance, and camera angle to ensure full coverage. Viewpoint assignment and path planning entail allocating viewpoints to drones and generating collision-free flight paths. These aspects align with the Vehicle Routing Problem (VRP), a classic combinatorial optimization problem focused on efficiently assigning vehicles to serve customers under constraints while minimizing total travel distance. The key research challenge is rapidly obtaining high-quality solutions, which resonates with the need for swift solutions in aircraft skin coverage detection. Specifically, this study models the 3D viewpoint assignment and path planning problem as an extension of VRP, generating high-quality path solutions under multiple constraints.

Extensive research has been conducted on various VRP variants in transportation and robotics. In logistics, truck-drone collaborative delivery is often modeled as the Vehicle Routing Problem with Drones or the Traveling Salesman Problem with Drones. These problems are typically formulated using Mixed Integer Linear Programming (MILP) and solved via multi-level strategies. Although MILP provides theoretical optimality, its computational complexity escalates with problem scale. Hence, heuristic algorithms are commonly employed to find suboptimal solutions within reasonable timeframes. For instance, some studies base their models on TSP for truck-drone coordination, aiming to minimize total delivery time or optimize time and energy consumption. Evolutionary multi-tasking algorithms have been proposed for multi-UAV emergency delivery, treating the original problem as the primary task and an unconstrained version as an auxiliary task. However, these heuristics are often problem-specific, limiting their applicability.

Recently, learning-based methods have emerged as alternatives for solving VRP and other combinatorial optimization problems. Neural network models introduced deep learning to TSP, solving it end-to-end. Subsequent work incorporated Transformer-based encoder-decoder structures, surpassing traditional heuristics in solution quality and computational efficiency for multiple COPs. However, these methods often overlook vehicle states and partially constructed paths, potentially compromising solution quality. Deep reinforcement learning combined with attention mechanisms has successfully addressed heterogeneous capacitated VRP, outperforming non-learning benchmarks. Multi-agent DRL frameworks with local search strategies have further improved solutions. Despite these advances, learning-based methods still face challenges with complex constraints, leaving room for improvement compared to heuristics.

Current MCMP research predominantly models the problem as VRP variants, solved using heuristic methods. While DRL for COPs is evolving, it faces limitations in airline inspection contexts: (1) Heuristic methods rely on expert domain knowledge, suffer from high computational complexity, and lack adaptability, failing to meet airline inspection efficiency demands. (2) Existing DRL encoders struggle with high-dimensional feature extraction, losing node semantic information and resulting in poor node embeddings; decoders simplistically use node embedding averages for graph representation, failing to dynamically perceive path state changes. These limitations significantly impact path planning solution quality and efficiency.

To address these issues, this paper proposes a Two-Stage Deep Reinforcement Learning (TSDRL) approach for MCMP with flight distance constraints. Contributions include: (1) Designing a TSDRL model where the first stage determines the optimal drone count based on node quantity, serving the second stage; the second stage’s encoder introduces dual batch normalization and gated aggregation to enhance node embedding quality; the decoder incorporates UAV selection and graph aggregation modules to dynamically perceive each drone’s path state and node changes, selecting appropriate nodes. (2) Simulations on various airline inspection problem scales show TSDRL outperforms heuristics and other DRL methods in speed and solution quality; evaluation across aircraft models validates its generalization capability.

The MCMP problem for aircraft skin detection involves multiple UAVs starting from a common task origin, each equipped with a camera, assigned to nodes covering the skin area. Visiting all nodes ensures full coverage. Nodes are generated using existing viewpoint generation algorithms. In the TSDRL model, drones correspond to “vehicles” in CVRP, nodes to customer points, maximum flight distance to vehicle capacity, and traveled distance to collision-free flight distance between nodes. The study aims to allocate a certain number of UAVs and solve CVRP efficiently to minimize the maximum flight distance across all UAVs. Assumptions include: (1) collision-free distance between any two nodes is less than the maximum flight distance; (2) each node is serviced by one drone; (3) drones have mutual collision avoidance programs, ignoring additional time for avoidance; (4) all drones have the same maximum flight distance, used as an energy constraint; (5) after coverage, each drone returns to the origin.

Given $n+1$ nodes (including one task origin and $n$ nodes) denoted as $X = \{x_i\}_{i=0}^n$, where $x_0$ is the origin, the node set is defined as $X’ = X \setminus \{x_0\}$. Each node $x_i \in \mathbb{R}^4$ is defined as $\{(s_i, d_i)\}$, where $s_i$ contains the node’s 3D coordinates and $d_i$ represents the node’s demand. Unlike typical CVRP, the demand $d_i$ is dynamic, computed via the RRT* algorithm as the collision-free distance from node $x_{i-1}$ to $x_i$, expressed as $d_i = \text{RRT}^*(x_{i-1}, x_i)$. The drone fleet is denoted as $V = \{v_i\}_{i=0}^{U_{\text{num}}}$, where $v_i = \{(Q_i)\}$, with $Q_i$ being the maximum flight distance. Binary variable $y_{ij}^v$ equals 1 if drone $v$ travels directly from node $x_i$ to $x_j$, else 0. $L_{ij}^v$ denotes the remaining flight distance of drone $v$ before reaching $x_j$. The mathematical model is formulated as follows:

$$
\min \max_{v \in V} \left( \sum_{i \in X} \sum_{j \in X} \text{RRT}^*(x_{i-1}, x_i) \times y_{ij}^v \right)
$$

Subject to:

$$
\sum_{v \in V} \sum_{j \in X} y_{ij}^v = 1, \quad i \in X’
$$

$$
\sum_{i \in X} y_{ij}^v – \sum_{k \in X} y_{jk}^v = 0, \quad v \in V, j \in X’
$$

$$
\sum_{v \in V} \sum_{j \in X} L_{ij}^v – \sum_{v \in V} \sum_{k \in X} L_{jk}^v = d_i, \quad j \in X’
$$

$$
d_j y_{ij}^v \leq L_{jk}^v \leq (Q_v – d_j) y_{ij}^v, \quad v \in V, i,j \in X
$$

$$
y_{ij}^v = \{0, 1\}, \quad v \in V, i,j \in X
$$

$$
L_{ij}^v \geq 0, \quad d_i \geq 0, \quad v \in V, i,j \in X
$$

Equation (1) minimizes the maximum flight distance across drones. Constraints (2) and (3) ensure each node is serviced once and routes are continuous per drone. Constraint (4) balances flight distance before and after servicing a node. Constraint (5) enforces capacity limits. Constraints (6) and (7) define binary and non-negative variables.

The proposed TSDRL model is detailed through its two stages, each with a Deep Reinforcement Learning (DRL) model, including Markov Decision Process (MDP) definitions, encoder-decoder policy networks, and training via policy gradients. Different action selection strategies are used to obtain high-quality solutions. The first stage plans the number of drones based on node scale, passed to the second stage, where the policy network selects a drone and assigns tasks iteratively, followed by path planning.

In the first stage, the MDP defines state $S_1 = \{S_g, S_d\}$, where $S_g$ is the global graph feature from the encoder, and $S_d = \{U_{\text{num}}, E_i\}$ includes the current drone count $U_{\text{num}}$ and path points $E_i$ for each drone. Action $A_1 = \{a_{1,t}\} = \{x_i^t\}$ selects node $x_i$ at step $t$. State transition $\tau_1$ updates $S_d$ based on actions, such as incrementing $U_{\text{num}}$ if capacity is exceeded or appending nodes to paths. Reward $R$ is the negative maximum flight distance. The policy $\pi_{\theta_1}$ outputs a probability distribution over nodes, with the joint probability of a solution $\pi$ given by $p(\pi|s) = \prod_{t=1}^T p_{\theta_1}(\pi_t|s_{t-1}, \pi_{t-1})$.

The first-stage policy network uses an encoder with an embedding layer and $N$ attention modules. Initial node embedding $h_i^{(0)} = W^x \times x_i + b^x$ with dimension 128 is processed through multi-head attention (MHA) and feed-forward (FF) layers with skip connections and batch normalization (BN). The graph embedding $h_g^{(N)}$ is the mean of node embeddings. The decoder constructs context embedding $h_c^{(t)}$ using $h_g^{(N)}$, the last node’s embedding, and remaining distance, then applies glimpse attention to compute compatibility scores, followed by softmax for node probabilities.

In the second stage, the MDP defines state $S_{2,t} = (V_t, X_t)$, where $V_t = \{v_1^t, \ldots, v_{U_{\text{num}}^t\} = \{(O_1^t, G_1^t), \ldots, (O_{U_{\text{num}}^t, G_{U_{\text{num}}^t)\}$ includes remaining distance $O_i^t$ and partial path $G_i^t$ for each drone, and $X_t = \{(p_0, d_0^t), \ldots, (p_n, d_n^t)\}$ includes node positions and demands. Action $A_{2,t} = \{a_{2,t}\} = \{v_i^t, x_j^t\}$ selects drone $v_i$ and node $x_j$ at step $t$. State transition $\tau_2$ updates $V_t$ and $X_t$ by reducing remaining distance and appending nodes to paths. Reward $R$ is the maximum flight distance. The policy $\pi_{\theta_2}$ is trained to maximize the expected reward, with joint probability $P(S_{2,\Upsilon}|S_{2,0}) = \prod_{t=1}^{\Upsilon} p_{\theta_2}(a_{2,t}|S_{2,t})$.

The second-stage policy network features an encoder with dual batch normalization and gated aggregation modules. Node embeddings are computed through MHA and FF layers with BN and gating: $e_{\text{MHA}}^{(l)} = G(e^{(l-1)}, \text{MHA}^{(l)}(\text{BN}(e^{(l-1)})))$ and $e_{\text{node}} = G(e_{\text{MHA}}^{(l)}, \text{FF}(\text{BN}(e_{\text{MHA}}^{(l)})))$, where $G(e_{\text{in}}, e_{\text{out}}) = e_{\text{in}} + \sigma(W e_{\text{in}} + b) \odot e_{\text{out}}$ with Sigmoid function $\sigma$. The decoder includes a UAV selection module that constructs feature context $C_V^t$ for each drone, including last node coordinates and remaining distance, then uses linear projection and FF layers to compute drone feature embedding $H_V^t$. Path feature embedding $H_R^t$ is derived from max-pooled path contexts. The concatenated features are processed to output drone selection probabilities $p_V^t$ via softmax.

A graph aggregation module dynamically updates graph embeddings based on visited and unvisited nodes. For visited nodes, embedding $e_{\text{node}}$ is weighted by a mask $mask_v^t$, then summed and max-pooled to form $H_v^t$. Similarly, unvisited node embedding $H_{\text{unv}}^t$ is computed. Context embedding $e_{\text{context}}^t$ concatenates $H_v^t$, $H_{\text{unv}}^t$, and the first and last node embeddings. This context is used in MHA with node embeddings to compute attention scores, followed by softmax for node probabilities. Masks are updated after each action.

Both stages are trained using REINFORCE with a rollback baseline. The policy network parameters $\theta = (\theta_1, \theta_2)$ are updated via gradient ascent on the expected reward, with baseline network $\theta_{\text{bl}} = (\theta_{\text{bl}}^1, \theta_{\text{bl}}^2)$ updated if the policy network significantly outperforms it. The gradient update is:

$$
\nabla_\theta L(\theta|B) = -\mathbb{E}_{p_\theta(B)}[(R(\pi) – R(\pi_{\text{bl}})) \nabla_\theta \log p_\theta(\pi|B)]
$$

where $\pi$ is sampled from $p_\theta$, and $\pi_{\text{bl}}$ is greedily selected from $p_{\theta_{\text{bl}}}$. The Adam optimizer is used for updates.

Simulations use a Boeing 737-300 model, with node datasets generated via existing methods for scales 30-1 (30 nodes, 1 origin, max flight distance 400 m), 50-1, and 90-1. Each scale has 20,000 training instances and 100 test instances. Nodes include 3D coordinates and RRT*-computed collision-free distances. The origin is randomly generated outside node coordinate ranges. Implementation uses PyTorch on a GPU, with testing on an Intel Core i9 CPU. Training involves 100 epochs, 1,000 instances per epoch, batch size 100, and Adam optimizer with learning rate $1 \times 10^{-4}$. Evaluation metrics include average drone count (#UAVs), maximum flight distance (obj), solving time (Time), and gap from the best solution.

Comparative models include: SISRs, an advanced heuristic for VRP; SA, an improved simulated annealing method; AM, an attention-based DRL method; and GAT, a graph attention network approach. TSDRL is evaluated under greedy and sample-1,000 strategies. The gap is computed as $\text{Gap} = \frac{L_{\text{SISRs}} – L}{L_{\text{SISRs}}} \times 100\%$.

Learning curves for 90-1 scale show TSDRL converges faster and achieves better rewards than AM and GAT. The graph aggregation module dynamically perceives path states and node changes, enabling efficient node selection. Path planning illustrations demonstrate TSDRL’s effectiveness in generating optimal drone formation paths. Tables 1-3 summarize results for different scales, indicating TSDRL’s superiority in solution quality and time efficiency over AM and GAT, and competitive performance against SISRs with faster computation. As problem scale increases, SISRs and SA times grow exponentially, while TSDRL scales linearly, making it suitable for airline inspection.

Table 1: Performance comparison for 30-1 scale
Algorithm	obj. (m)	Time (s)	#UAVs	Gap
SISRs	209.45	334.68	1	—
SA	259.52	133.45	2	14.36%
AM(Greedy)	362.83	0.59	1	73.22%
AM(Sample 1,000)	285.46	5.80	1	36.29%
GAT(Greedy)	302.74	0.96	1	44.54%
GAT(Sample 1,000)	248.16	20.15	1	28.37%
TSDRL(Greedy)	296.43	0.57	1	41.52%
TSDRL(Sample 1,000)	248.16	5.80	1	18.45%

Table 2: Performance comparison for 50-1 scale
Algorithm	obj. (m)	Time (s)	#UAVs	Gap
SISRs	257.12	589.46	2	—
SA	348.62	317.23	3	35.58%
AM(Greedy)	371.96	0.62	2	44.66%
AM(Sample 1,000)	350.20	6.60	2	36.20%
GAT(Greedy)	367.87	0.88	2	43.07%
GAT(Sample 1,000)	344.94	24.48	2	34.15%
TSDRL(Greedy)	354.21	0.59	2	38.20%
TSDRL(Sample 1,000)	334.59	6.40	2	30.12%

Table 3: Performance comparison for 90-1 scale
Algorithm	obj. (m)	Time (s)	#UAVs	Gap
SISRs	329.16	1108.47	4	—
SA	366.55	784.62	5	11.3%
AM(Greedy)	389.45	0.66	4	18.31%
AM(Sample 1,000)	367.25	8.80	4	11.57%
GAT(Greedy)	370.86	1.33	4	12.66%
GAT(Sample 1,000)	364.83	32.15	4	10.83%
TSDRL(Greedy)	368.32	0.61	4	11.89%
TSDRL(Sample 1,000)	352.63	8.60	4	7.14%

Ablation studies evaluate the impact of dual batch normalization with gated aggregation and graph aggregation modules. Comparisons include TSDRL-N1 (without graph aggregation), TSDRL-N2 (without dual BN and gating), TSDRL-N3 (without all), and full TSDRL. Training curves for 90-1 scale show TSDRL-N1 converges fastest, TSDRL-N3 slowest, and TSDRL balances stability and convergence. Tables 4-6 detail results, indicating synergistic effects between modules, with full TSDRL achieving the lowest gaps. For instance, in 90-1 scale, TSDRL(Sample 1,000) reduces the gap to 7.13%, outperforming ablated versions.

Table 4: Ablation results for 30-1 scale
Algorithm	obj. (m)	Time (s)	#UAVs	Gap
TSDRL-N1(Greedy)	320.55	0.52	1	53.04%
TSDRL-N1(Sample 1,000)	278.43	5.40	1	32.93%
TSDRL-N2(Greedy)	304.61	0.55	1	45.43%
TSDRL-N2(Sample 1,000)	266.40	5.28	1	27.19%
TSDRL-N3(Greedy)	362.83	0.59	1	73.23%
TSDRL-N3(Sample 1,000)	285.46	5.80	1	36.29%
TSDRL(Greedy)	296.43	0.57	1	41.53%
TSDRL(Sample 1,000)	248.16	5.80	1	18.48%

Table 5: Ablation results for 50-1 scale
Algorithm	obj. (m)	Time (s)	#UAVs	Gap
TSDRL-N1(Greedy)	370.11	0.55	2	43.94%
TSDRL-N1(Sample 1,000)	345.25	5.23	2	34.28%
TSDRL-N2(Greedy)	360.10	0.58	2	40.05%
TSDRL-N2(Sample 1,000)	340.55	5.92	2	32.45%
TSDRL-N3(Greedy)	371.96	0.62	2	44.66%
TSDRL-N3(Sample 1,000)	350.20	6.60	2	36.20%
TSDRL(Greedy)	354.21	0.59	2	37.76%
TSDRL(Sample 1,000)	334.59	6.40	2	30.13%

Table 6: Ablation results for 90-1 scale
Algorithm	obj. (m)	Time (s)	#UAVs	Gap
TSDRL-N1(Greedy)	381.43	0.59	4	15.88%
TSDRL-N1(Sample 1,000)	362.71	7.82	4	10.19%
TSDRL-N2(Greedy)	370.98	0.63	4	12.71%
TSDRL-N2(Sample 1,000)	358.64	7.82	4	8.96%
TSDRL-N3(Greedy)	389.45	0.66	4	18.32%
TSDRL-N3(Sample 1,000)	367.25	8.80	4	11.57%
TSDRL(Greedy)	368.32	0.61	4	11.90%
TSDRL(Sample 1,000)	352.63	8.60	4	7.13%

Computational complexity analysis for the first-stage encoder is $O(N^2 \cdot D)$ for MHA and $O(N \cdot D^2)$ for FF per layer, with total encoder complexity $6(O(N^2 \cdot D) + O(N \cdot D^2))$. Decoder complexity per step is $O(N)$, leading to $O(N^2)$ overall. For the second stage, encoder complexity is $6(O(N^2 \cdot D) + O(N \cdot D^2) + O(N \cdot D))$ with dual BN and gating. Decoder complexity includes $O(N \cdot D^2)$ for graph aggregation and $O(U_{\text{num}} \cdot D^2)$ for UAV selection, resulting in $O(N^2 \cdot D^2)$ overall.

Key parameters like attention heads $M$, embedding dimension $dim$, and encoder layers $N$ are analyzed. For 90-1 scale, optimal values are $M=8$, $dim=128$, $N=6$, achieving the best trade-off between solution quality and time. Higher values increase computation without gains, as shown in Tables 7-9.

Table 7: Impact of attention heads (M) for 90-1 scale
Parameters	obj. (m)	Time (s)	#UAVs	Gap
TSDRL(Greedy, M=4)	379.40	0.55	4	15.26%
TSDRL(Sample 1,000, M=4)	369.12	7.2	4	12.14%
TSDRL(Greedy, M=8)	368.32	0.61	4	11.90%
TSDRL(Sample 1,000, M=8)	352.63	8.60	4	7.13%
TSDRL(Greedy, M=16)	377.43	0.72	4	14.66%
TSDRL(Sample 1,000, M=16)	367.29	9.4	4	11.58%

Table 8: Impact of embedding dimension (dim) for 90-1 scale
Parameters	obj. (m)	Time (s)	#UAVs	Gap
TSDRL(Greedy, dim=128)	369.22	0.66	4	11.90%
TSDRL(Sample 1,000, dim=128)	352.63	8.30	4	7.13%
TSDRL(Greedy, dim=256)	370.47	0.61	4	12.55%
TSDRL(Sample 1,000, dim=256)	360.12	8.60	4	9.41%
TSDRL(Greedy, dim=512)	375.90	0.74	4	14.20%
TSDRL(Sample 1,000, dim=512)	367.19	9.35	4	11.55%

Table 9: Impact of encoder layers (N) for 90-1 scale
Parameters	obj. (m)	Time (s)	#UAVs	Gap
TSDRL(Greedy, N=4)	373.09	0.53	4	13.35%
TSDRL(Sample 1,000, N=4)	368.43	7.89	4	11.93%
TSDRL(Greedy, N=6)	368.32	0.61	4	11.90%
TSDRL(Sample 1,000, N=6)	352.63	8.60	4	7.13%
TSDRL(Greedy, N=8)	369.73	0.79	4	12.33%
TSDRL(Sample 1,000, N=8)	365.87	10.41	4	11.16%

Generalization tests on Boeing 737-100 models with scales 20-1, 40-1, and 80-1 show TSDRL’s adaptability. Using policies trained on 90-1 scale, TSDRL outperforms AM and GAT in solution quality across scales, with TSDRL(Sample 1,000) showing the best generalization. As scale increases, the advantage of TSDRL becomes more pronounced, demonstrating its robustness for different aircraft models and drone formation tasks.

In conclusion, for rapid airline inspection scenarios, this paper proposes a deep reinforcement learning-based solution model. The first stage determines the optimal number of drones, and the second stage’s encoder and decoder enhancements improve node embedding quality and dynamic state perception. Offline-trained TSDRL quickly solves MCMP, outperforming heuristics and other DRL methods in speed and solution quality. Future work will explore multi-origin problems and larger scales, designing more efficient models. The integration of advanced modules ensures that drone formation operations are optimized for real-world applications, highlighting the importance of collaborative path planning in autonomous systems.