Multi-Agent Reinforcement Learning for UAV Collaborative Path Planning

In recent years, the rapid advancement of modern aviation technology has led to the widespread adoption of unmanned aerial vehicles (UAVs), commonly referred to as drones, across various fields such as surveillance, transportation, and rescue operations. These UAV drones offer high flexibility, low cost, and rapid deployment capabilities, making them indispensable in complex environments. However, as the application scope of UAV drones expands, challenges related to their collaborative operations, particularly in path planning, have become increasingly prominent. Traditional methods for UAV drone path planning often rely on centralized control or simple rule-based scheduling, which suffer from inefficiencies and lack of adaptability in large-scale UAV drone swarms. For instance, centralized approaches require a single processor to manage all UAV drones, leading to computational bottlenecks and slow response times as the number of UAV drones increases. Similarly, rule-based methods struggle to handle dynamic environmental changes, often resulting in suboptimal paths or mission failures. This underscores the need for more intelligent and adaptive solutions to enhance the performance of UAV drone swarms in collaborative tasks.

To address these limitations, researchers have explored various approaches for UAV drone collaborative path planning. For example, some studies have employed deterministic policy search methods to optimize UAV drone paths by constructing environmental models and defining objective functions. While these methods can handle high-dimensional state and action spaces, they are prone to local optima, leading to low path coverage and incomplete task execution. Other approaches, such as random tree-based algorithms, improve search efficiency but often generate paths with poor smoothness and redundant nodes, reducing the overall coverage and efficiency of UAV drones. Additionally, optimization algorithms inspired by natural phenomena have been proposed, yet they still face issues in path smoothness and coverage. These existing methods generally exhibit low path coverage, meaning that UAV drones fail to comprehensively scan or monitor target areas, which can result in missed critical information and compromised mission outcomes. This is particularly detrimental in applications like disaster relief or environmental monitoring, where thorough area coverage is essential. Therefore, there is a pressing demand for innovative techniques that can significantly improve the path coverage of UAV drone swarms in complex scenarios.

In this context, multi-agent reinforcement learning (MARL) emerges as a promising solution for UAV drone collaborative path planning. MARL enables multiple intelligent agents, each controlling a UAV drone, to learn optimal strategies through cooperation and competition in dynamic environments. By leveraging MARL, UAV drones can adapt to real-time changes, avoid obstacles, and coordinate their movements to maximize area coverage. This approach overcomes the drawbacks of traditional methods by distributing computational load among agents and enhancing flexibility. In this article, we propose a novel MARL-based method for UAV drone collaborative path planning, aiming to achieve high path coverage and efficient task execution. We describe the problem formulation, design the state and action spaces, compute reward functions, and implement a path evaluation mechanism to select optimal flight paths. Through extensive experiments, we demonstrate the superiority of our method in terms of path coverage, planning time, and path length, contributing to the intelligent development of UAV drone technologies.

The core of our approach lies in modeling the UAV drone collaborative path planning problem within a MARL framework. We begin by describing the task in detail. Consider a swarm of UAV drones operating in a complex environment, where each UAV drone is represented as a point mass moving at a constant velocity. Let $P(x, y, z, \theta, \alpha)$ denote the initial pose of a UAV drone, with $(x, y, z)$ being its coordinates, and $\theta$ and $\alpha$ representing the horizontal roll and pitch angles, respectively. The starting position is $P$, and the target mission position is $S$. The mathematical model for multi-UAV drone collaborative path planning can be expressed as $P \xrightarrow{r(q)} S$, where $r(q)$ is the generated path parameterized by $q$. To simplify analysis, we assume each UAV drone flies at a constant speed $v$. The motion equation for the $i$-th UAV drone at time $k$ is given by:

$$ \begin{bmatrix} x_i(k+1) \\ y_i(k+1) \end{bmatrix} = \begin{bmatrix} x_i(k) \\ y_i(k) \end{bmatrix} + v \Delta u \begin{bmatrix} \cos \phi(k) \\ \sin \phi(k) \end{bmatrix} $$

Here, $[x_i(k), y_i(k)]$ represents the position of the $i$-th UAV drone at time $k$, $\Delta u$ is the time step, and $\phi(k)$ is the heading angle at time $k$. This formulation allows us to predict the trajectory of each UAV drone over time, which is crucial for path planning.

Next, we construct a two-dimensional map of the UAV drone flight environment by simplifying the three-dimensional terrain. The map is defined mathematically as:

$$ Z(x, y) = a \sin(y + b) + c \sin x + d \cos(e \sqrt{x^2 + y^2}) + f \cos y + g \cos(g \sqrt{x^2 + y^2}) $$

In this equation, $a$, $b$, $c$, $d$, $e$, $f$, and $g$ are custom constants that control the terrain features, and $Z$ represents the elevation or environmental constraints in the 2D map. This simplification helps reduce computational complexity while capturing essential environmental details for UAV drone navigation.

We also incorporate constraints based on UAV drone kinematics. Assuming a fixed-wing tilt-rotor UAV drone model, we calculate the minimum turning radius $R_{\text{min}}$ to ensure feasible paths. This is derived from the UAV drone’s speed and maximum turning angle:

$$ R_{\text{min}} = \frac{v^2}{h \tan(\beta/2)} $$

Here, $v$ is the flight speed, $h$ is the gravitational acceleration, and $\beta$ is the maximum steering angle. A smaller turning radius allows for more agile maneuvers, but excessive turning can increase energy consumption. To mitigate this, we introduce a turning penalty function $X$ to discourage unnecessary turns and promote straight-line flight:

$$ X = \frac{M_s}{\omega_{\text{max}} \tau} $$

where $M_s$ is the turning angle of path $s$, $\omega_{\text{max}}$ is the maximum turning angle of the UAV drone, and $\tau$ is the maximum penalty value for path $s$. This penalty function is integrated into the path evaluation process to optimize energy efficiency.

With the task described, we proceed to generate flyable candidate paths using MARL. The state space for each UAV drone agent includes its current position, velocity, heading, and environmental information such as obstacle locations. The action space consists of discrete or continuous actions like changing direction or speed increments. To compute the risk value $f$ at the current position, we combine global state information with environmental constraints:

$$ f = \varepsilon w + (1 – w) X(Z, R_{\text{min}}) $$

Here, $\varepsilon$ represents the global state space, $w$ denotes the action space weight, and $X(Z, R_{\text{min}})$ is the path planning equation considering the environment $Z$ and minimum turning radius $R_{\text{min}}$. This risk value helps the UAV drone assess the safety of its current location.

The reward function for each candidate path is designed to guide the UAV drone toward the target while avoiding obstacles. Using a deep Q-network (DQN), we compute the reward $q(n)$ for the $n$-th path as:

$$ q(n) = j + \upsilon \times f $$

where $j$ is the number of hidden layers in the neural network, and $\upsilon$ is a penalty value for not reaching the target. This reward function encourages the UAV drone to minimize risk and maximize progress.

To handle obstacle avoidance, we employ a grid-based map representation, where the environment is divided into cells representing free spaces or obstacles. We use policy gradient methods to train the UAV drone agents, with the gradient $J$ defined as:

$$ J = -\frac{1}{q(n)} \sum_{i=1}^m Q(\omega, \delta, \gamma) $$

Here, $m$ is the number of cumulative rewards, $\omega$ is the state value, $\delta$ is the action value, and $\gamma$ is the policy value. This approach enables the UAV drone agents to learn optimal policies through trial and error.

Furthermore, we predict the UAV drone’s future positions using a dynamic window method. Based on the current state, we estimate the next possible position $\xi(k+1)$ at time $k+1$:

$$ \xi(k+1) = J + \chi \cos \mu(k) $$

where $\chi$ is a feature vector of the flight environment, and $\mu(k)$ is the tilt angle at time $k$. By iteratively predicting positions, we generate a set of flyable candidate paths for each UAV drone. These paths are then evaluated to select the best one for collaborative planning.

The path evaluation process involves computing a coverage reward function and combining it with the turning penalty. For the $\beta$-th candidate path, the coverage reward $E_\beta$ is given by:

$$ E_\beta = \frac{1}{\xi(k+1)} \sum l_0 $$

where $l_0$ represents the detectable area in the flight environment. A higher $E_\beta$ indicates better coverage of the target region. We then construct a comprehensive path evaluation function $Q_\kappa$ for the $\kappa$-th path:

$$ Q_\kappa = \omega_1 E_\beta + \omega_2 X $$

Here, $\omega_1$ and $\omega_2$ are weights that balance coverage and turning penalty. The path with the highest $Q_\kappa$ value is selected as the optimal flight path for the UAV drone. This evaluation ensures that the chosen path maximizes area coverage while minimizing energy consumption due to turns.

To validate our MARL-based method, we conducted extensive experiments in a simulated environment. The experimental platform involved a Crazyflie UAV drone control system, implemented in Python, and a Unity3D simulation for visualizing UAV drone flights. We designed a rectangular flight area of 10m × 10m, divided into grid cells, with random static obstacles to mimic complex air traffic. Key parameters for the UAV drones are summarized in Table 1.

Table 1: UAV Drone Flight Parameters
Parameter	Value
Takeoff Coordinates	(0, 0, 5)
Target Coordinates	(9, 9, 5)
Number of Obstacles	5
Safety Distance	10m
Maximum Flight Speed	20m/s
Maximum Flight Height	100m
Minimum Flight Height	5m
Flight Time	45s

We configured the MARL algorithm with specific parameters to train the UAV drone agents. These parameters are detailed in Table 2.

Table 2: Multi-Agent Reinforcement Learning Algorithm Parameters
Parameter	Value
Learning Rate	0.001
Discount Factor	0.900
Experience Replay Pool Size	10,000
Batch Size	32
Number of Agents	10
Hidden Layers	3
Neurons per Hidden Layer	128
Training Episodes	100
Exploration Rate	0.100

We evaluated the performance using three key metrics: path coverage, planning time, and path length. Path coverage $C$ is defined as the ratio of the area covered by the UAV drone’s path to the total target area:

$$ C = \frac{S’}{S} \times 100\% $$

where $S$ is the total area, and $S’$ is the covered area. Planning time $T_{\text{plan}}$ measures the computational efficiency from input to output:

$$ T_{\text{plan}} = \frac{1}{M} \sum_{j=1}^M (t_{j,\text{end}} – t_{i,\text{start}}) $$

Here, $t_{j,\text{end}}$ and $t_{i,\text{start}}$ are the end and start timestamps for the $j$-th experiment, and $M$ is the number of cumulative rewards. Path length $L$ calculates the total distance traveled by the UAV drone:

$$ L = \sum_{i=1}^{N-1} \sqrt{(x_{i+1} – x_i)^2 + (y_{i+1} – y_i)^2} $$

where $N$ is the number of discrete sampling points along the path, and $(x_i, y_i)$ are the coordinates of the $i$-th point.

We compared our method with three existing approaches: Method 1 (deterministic policy search), Method 2 (random tree-based algorithm), and Method 3 (optimization algorithm). The results for maximum path coverage are shown in Figure 2, where our method achieved a coverage of 92%, significantly higher than the 66%, 70%, and 75% of the other methods. This demonstrates that our MARL-based approach enables UAV drones to cover more of the target area, reducing the risk of missed spots in missions like surveillance or search and rescue.

Furthermore, we analyzed planning time and path length to assess efficiency and optimality. As summarized in Table 3, our method outperformed the others with a planning time of 120ms and a path length of 28.5m. This indicates faster computational processing and shorter, more efficient flight paths for UAV drones. In contrast, Method 1 had a planning time of 200ms and path length of 35.2m, while Method 2 and Method 3 showed intermediate values. The superior performance of our method can be attributed to the intelligent coordination among UAV drone agents, which allows for real-time adaptation and optimal path selection based on the evaluation function.

Table 3: Comprehensive Comparison of Planning Time and Path Length
Method	Planning Time (ms)	Path Length (m)
Our Method	120	28.5
Method 1	200	35.2
Method 2	180	32.8
Method 3	150	30.1

The high path coverage achieved by our method is crucial for applications where comprehensive area monitoring is essential. For instance, in disaster response, UAV drones need to scan large regions to locate survivors or assess damage. Low coverage could lead to missed areas, delaying rescue efforts. Similarly, in environmental monitoring, incomplete coverage might result in inaccurate data collection. Our MARL-based approach addresses these issues by ensuring that UAV drones collaboratively plan paths to maximize coverage, even in complex environments with obstacles.

Another advantage of our method is its scalability. As the number of UAV drones increases, the multi-agent system distributes the computational load, preventing bottlenecks associated with centralized control. Each UAV drone agent learns independently while cooperating with others, enabling efficient path planning for large swarms. This makes our method suitable for real-world scenarios involving dozens or hundreds of UAV drones, such as in agricultural monitoring or traffic management.

We also explored the impact of different parameters on performance. For example, varying the learning rate or discount factor in the MARL algorithm can affect convergence speed and solution quality. Through ablation studies, we found that the parameters in Table 2 provided a good balance between exploration and exploitation, allowing UAV drone agents to learn effective policies without excessive training time. Additionally, adjusting the weights $\omega_1$ and $\omega_2$ in the evaluation function can tailor the paths to specific mission requirements, such as prioritizing coverage over energy efficiency or vice versa.

Despite the promising results, there are limitations to our approach. The simulation environment, while realistic, may not capture all real-world complexities, such as wind disturbances or communication delays between UAV drones. Future work could involve testing in physical UAV drone swarms to validate robustness. Moreover, the current method assumes a 2D simplification of the environment; extending it to 3D could enhance applicability in mountainous or urban terrains where altitude variations are significant.

In conclusion, we have presented a multi-agent reinforcement learning method for UAV drone collaborative path planning that significantly improves path coverage. By formulating the problem with environmental constraints, designing appropriate state and action spaces, and implementing a path evaluation mechanism, our method enables UAV drones to generate and select optimal flight paths. Experimental results show a maximum path coverage of 92%, along with reduced planning time and path length compared to existing methods. This contributes to the advancement of intelligent UAV drone technologies, with potential applications in surveillance, rescue, and beyond. Future research will focus on optimizing algorithm efficiency, integrating real-time sensor data, and exploring hybrid approaches with other machine learning techniques to further enhance the capabilities of UAV drone swarms.

Looking ahead, the integration of MARL with emerging technologies like 5G communication or edge computing could enable even faster and more reliable UAV drone operations. For example, real-time data exchange between UAV drones and ground stations could improve coordination in dynamic environments. Additionally, incorporating human-in-the-loop controls might allow for interactive path planning in complex missions. As UAV drone usage continues to grow, methods like ours will play a vital role in ensuring safe, efficient, and effective collaborative operations, ultimately unlocking new possibilities for autonomous systems in society.