Reinforcement Learning for Coordinated UAV Swarm Trajectory Planning: A Comprehensive Review

The rapid advancement of UAV drone technology has catalyzed their widespread adoption across military, industrial, and civilian domains. The emergence of low-altitude economies as a strategic frontier further accelerates this trend, pushing operational scales and mission complexity to unprecedented levels. While single-UAV operations are well-studied, the superior capabilities of UAV swarms—encompassing distributed perception, cooperative task execution, and inherent redundancy—are indispensable for large-scale, complex missions such as urban logistics, disaster response, and precision agriculture. The core enabler for these swarms is intelligent, real-time trajectory planning, which must simultaneously optimize for global mission objectives while adhering to stringent safety, dynamic, and resource constraints. Traditional planning methods, including graph-search, sampling-based, and numerical optimization algorithms, often falter in the face of the high-dimensional, dynamic, and partially observable environments characteristic of UAV drone swarm operations. Their reliance on precise models and computational intensity limits scalability and real-time adaptability.

Reinforcement Learning (RL) has emerged as a powerful, model-free paradigm to address these challenges. By learning optimal policies through continuous trial-and-error interaction with an environment, RL agents can develop adaptive strategies for complex decision-making tasks. For UAV drone swarm trajectory planning, RL offers a promising framework to generate cooperative paths that balance multiple objectives like time efficiency, energy consumption, and collision avoidance, all without requiring an explicit, accurate dynamical model of the world. This article provides a comprehensive, first-person review of RL-driven methods for UAV swarm trajectory planning. I begin by formalizing the problem and its core challenges, followed by a systematic overview of foundational and advanced RL algorithms tailored for this domain. I then dissect their application across key sub-problems, synthesize current achievements and limitations, and conclude with a perspective on future research directions.

1. Problem Formalization and Core Challenges in UAV Swarm Trajectory Planning

UAV Swarm Trajectory Planning can be formally defined as the process of generating a set of spatially and temporally coordinated paths for a group of UAV drone agents, denoted as $\mathcal{A} = \{1, 2, …, N\}$, from their initial states to designated goal regions, while satisfying a set of mission-specific objectives $\mathcal{J}$ and constraints $\mathcal{C}$ over a time horizon $T$.

The state of the $i$-th UAV drone at time $t$ is typically represented as $\mathbf{s}_i^t = [\mathbf{p}_i^t, \mathbf{v}_i^t, \boldsymbol{\psi}_i^t, e_i^t, …]$, where $\mathbf{p}$ denotes position, $\mathbf{v}$ velocity, $\boldsymbol{\psi}$ attitude, and $e$ remaining energy. The joint state of the swarm is $\mathbf{S}^t = \{\mathbf{s}_1^t, \mathbf{s}_2^t, …, \mathbf{s}_N^t\}$. The environment state includes static ($\mathcal{O}_{static}$) and dynamic obstacles ($\mathcal{O}_{dynamic}^t$). The collective action at time $t$ is $\mathbf{A}^t = \{\mathbf{a}_1^t, \mathbf{a}_2^t, …, \mathbf{a}_N^t\}$, where $\mathbf{a}_i^t$ could be a discrete command or a continuous control input like acceleration or angular rate.

The optimization objective is to find a joint policy $\pi^*(\mathbf{A}^t | \mathbf{S}^t)$ that maximizes the expected cumulative reward $R$:

$$
\pi^* = \arg\max_{\pi} \mathbb{E}_{\pi} \left[ \sum_{t=0}^{T} \gamma^t R(\mathbf{S}^t, \mathbf{A}^t, \mathbf{S}^{t+1}) \right]
$$

where $\gamma \in [0, 1)$ is a discount factor. The reward function $R$ is designed to encapsulate mission goals such as reaching targets, minimizing time or energy, and penalizing constraint violations like collisions.

The primary constraints $\mathcal{C}$ include:
1. Dynamical Feasibility: Trajectories must respect the UAV drone‘s motion constraints: $|\mathbf{v}_i| \leq v_{max}$, $|\mathbf{a}_i| \leq a_{max}$, $\kappa_i \leq \kappa_{max}$ (curvature).
2. Collision Avoidance: Maintain a minimum safe distance $d_{safe}$ between all agents and obstacles: $||\mathbf{p}_i^t – \mathbf{p}_j^t||_2 \geq d_{safe}, \forall i \neq j$, and $||\mathbf{p}_i^t – \mathbf{o}_k^t||_2 \geq d_{safe}$.
3. Connectivity Maintenance: For cooperative tasks, the communication graph $\mathcal{G}^t$ must remain connected or possess a minimum algebraic connectivity $\lambda_2(\mathcal{G}^t) > 0$ to ensure information flow.
4. Formation & Cooperative Constraints: Specific relative positions or task allocation must be maintained: $\mathbf{p}_i^t = \mathbf{p}_{lead}^t + \mathbf{d}_i$, where $\mathbf{d}_i$ is the desired offset in a formation.

The operational scenarios for UAV drone swarms define the specific instantiation of $\mathcal{J}$ and $\mathcal{C}$. The table below summarizes key scenarios and their associated planning challenges.

Mission Scenario	Primary Objectives ($\mathcal{J}$)	Key Constraints & Challenges ($\mathcal{C}$)
Urban Air Logistics	Minimize delivery time, maximize throughput.	Dynamic urban obstacles (buildings, vehicles), strict no-fly zones, high-density traffic management, energy limits for long routes.
Area Search & Surveillance	Maximize coverage rate, minimize time-to-detect targets.	Efficient coverage path generation, obstacle avoidance in complex terrain, dynamic re-planning upon target discovery, limited endurance.
Precision Agriculture / Power Inspection	Complete coverage of area/linear asset, minimize operational time.	Following complex geometric patterns (crop rows, power lines), avoiding static structures (pylons, trees), maintaining sensor orientation.
Dynamic Show / Aerial Light Shows	Precise tracking of spatiotemporal trajectories for artistic patterns.	Extremely tight relative positioning, strict synchronization, smooth and continuous motion, high safety assurance in dense formations.
Military Covert Ops / Strike	Maximize mission success probability, minimize exposure/detection risk.	Adversarial dynamic threats, electronic warfare (denied GPS/communication), stealth constraints (low observability paths), complex multi-target assignment.

The core challenges that exacerbate the problem complexity are:
• Curse of Dimensionality: The joint state-action space grows exponentially with the number of UAV drones $N$, making centralized planning intractable for large swarms.
• Partial Observability & Non-Stationarity: In decentralized settings, each agent has a limited local view. Furthermore, from a single agent’s perspective, the environment is non-stationary because other learning agents are simultaneously changing their policies.
• Multi-Objective Trade-offs: The reward function must carefully balance often competing goals, e.g., speed vs. energy, aggressiveness vs. safety, individual vs. team reward.
• Sim-to-Real Gap: Policies trained in simulation frequently degrade in the real world due to modeling inaccuracies in dynamics, sensors, and environment.

2. A Taxonomy of Reinforcement Learning Methods for UAV Planning

RL provides a suite of algorithms to learn the optimal policy $\pi^*$. I categorize them based on their underlying principles and suitability for different aspects of the UAV drone swarm planning problem.

2.1 Foundational Single-Agent RL Algorithms

These algorithms form the basis, often applied to single-UAV problems or used as building blocks within multi-agent frameworks.

Value-Based Methods (e.g., DQN and variants): These algorithms learn an action-value function $Q(\mathbf{s}, \mathbf{a})$, representing the expected return of taking action $\mathbf{a}$ in state $\mathbf{s}$. The Deep Q-Network (DQN) uses a neural network to approximate $Q$ in high-dimensional state spaces. It employs experience replay and a target network for stability. The policy is derived by selecting the action with the highest Q-value: $\pi(\mathbf{s}) = \arg\max_{\mathbf{a}} Q(\mathbf{s}, \mathbf{a})$. DQN is inherently designed for discrete action spaces. Extensions like Double DQN and Dueling DQN address overestimation bias and improve learning efficiency. For a UAV drone with discrete movement commands (e.g., move north, east, south, west, hover), DQN can be directly applied. The update rule for Double DQN is:

$$
y = r + \gamma Q_{\hat{\theta}} \left( \mathbf{s}’, \arg\max_{\mathbf{a}’} Q_{\theta}(\mathbf{s}’, \mathbf{a}’) \right)
$$

where $\theta$ are the parameters of the online network and $\hat{\theta}$ are the parameters of the target network.

Policy Gradient Methods (e.g., REINFORCE, PPO): These methods parameterize the policy directly as $\pi_{\theta}(\mathbf{a}|\mathbf{s})$ and optimize the parameters $\theta$ to maximize the expected reward $J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}}[R(\tau)]$. The gradient is estimated as:

$$
\nabla_{\theta} J(\theta) \approx \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(\mathbf{a}_t | \mathbf{s}_t) \hat{A}_t \right]
$$

where $\hat{A}_t$ is an estimate of the advantage function. Proximal Policy Optimization (PPO) is a highly popular policy gradient variant that constraints policy updates to prevent destructively large steps, using a clipped objective:

$$
L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]
$$

where $r_t(\theta) = \frac{\pi_{\theta}(\mathbf{a}_t|\mathbf{s}_t)}{\pi_{\theta_{old}}(\mathbf{a}_t|\mathbf{s}_t)}$. PPO is sample-efficient and stable, making it excellent for training UAV drones with continuous control outputs (e.g., thrust and torque).

Actor-Critic Methods (e.g., A3C, DDPG): These hybrid methods combine a policy network (the Actor, $\pi_{\theta}(\mathbf{s})$) with a value function network (the Critic, $V_{\phi}(\mathbf{s})$ or $Q_{\phi}(\mathbf{s}, \mathbf{a})$). The Critic evaluates the Actor’s actions, providing a lower-variance estimate of the advantage for policy updates. Deep Deterministic Policy Gradient (DDPG) is designed for continuous action spaces. The Actor outputs a deterministic action $\mathbf{a} = \pi_{\theta}(\mathbf{s})$, and the Critic learns the Q-value. The Actor is updated by applying the chain rule to the expected return with respect to the actor parameters, effectively performing gradient ascent:

$$
\nabla_{\theta} J(\theta) \approx \mathbb{E}_{\mathbf{s} \sim \rho^{\pi}} \left[ \nabla_{\theta} \pi_{\theta}(\mathbf{s}) \nabla_{\mathbf{a}} Q_{\phi}(\mathbf{s}, \mathbf{a})|_{\mathbf{a}=\pi_{\theta}(\mathbf{s})} \right]
$$

DDPG also uses experience replay and target networks. It is particularly well-suited for UAV drone trajectory planning where smooth, continuous control is required.

2.2 Multi-Agent Reinforcement Learning (MARL) Frameworks

For UAV drone swarms, MARL paradigms are essential. They can be categorized based on the learning structure and information sharing.

Centralized Training with Decentralized Execution (CTDE): This is the most prevalent paradigm for cooperative swarms. During training, algorithms have access to global information (states/actions of all agents) to learn complex cooperative strategies. However, during execution, each UAV drone agent uses only its local observations to make decisions. This combines the benefits of centralized learning with the practicality of decentralized deployment.

• MADDPG: A direct extension of DDPG to multi-agent settings under CTDE. Each agent has its own Actor network (decentralized policy) but is trained with a Critic that takes the joint state and joint action of all agents as input. The critic for agent $i$ is $Q_i^{\phi}(\mathbf{s}_1, …, \mathbf{s}_N, \mathbf{a}_1, …, \mathbf{a}_N)$. This allows the critic to understand the interactive nature of the environment during training, guiding individual actors toward cooperative behavior.

• Value Decomposition Networks (VDN) & QMIX: These methods aim to learn a joint action-value function $Q_{jt}(\boldsymbol{\tau}, \mathbf{u})$ by factorizing it into individual agent utilities $Q_i(\tau_i, u_i)$. QMIX imposes a monotonicity constraint between the joint and individual Q-values via a mixing network, ensuring that global argmax can be achieved by individual argmax operations. This is highly effective for tightly-cooperative UAV swarm tasks.

Decentralized Training & Execution (DTDE): Each agent learns solely based on its local experience, treating other agents as part of the environment. This is simpler and has no communication overhead during training but suffers from the non-stationarity problem, often leading to unstable training and less sophisticated cooperation. Independent PPO (IPPO) or Independent DQN (I-DQN) fall under this category.

Communication-Based MARL: These methods explicitly learn communication protocols between UAV drone agents. Agents are equipped with communication channels, and their policies learn what message to send and how to interpret received messages to improve coordination, such as in the work on CommNet or ATOC.

The following table provides a comparative summary of key RL algorithms relevant to UAV swarm planning.

Algorithm Class	Core Principle	Action Space	Key Advantages for UAV Swarms	Primary Limitations
DQN (Value-Based)	Learn optimal Q-function; $\epsilon$-greedy policy.	Discrete	Good for high-dimensional state inputs (e.g., images); foundational for discrete decision-making.	Discretization of continuous control leads to loss of precision; poor sample efficiency.
PPO (Policy Gradient)	Direct policy optimization with a clipped objective for stability.	Continuous	High stability, robust performance, easy to tune. Ideal for training continuous flight controllers.	Can be sample hungry; may converge to sub-optimal policies in complex multi-modal reward landscapes.
DDPG (Actor-Critic)	Deterministic policy gradient using an off-policy critic.	Continuous	Sample efficient, suitable for precise continuous control (velocity, acceleration commands).	Hyperparameter sensitive; can be brittle and prone to performance collapse.
MADDPG (MARL/CTDE)	CTDE extension of DDPG; centralized critics for decentralized actors.	Continuous	Enables learning of complex cooperative and competitive behaviors. Natural fit for multi-UAV drone teams.	Scalability issues; training complexity grows with agent count. Requires global info for training.
QMIX (MARL/CTDE)	Factorizes joint Q-value under monotonic constraint.	Discrete	Guarantees consistency between local and global greedy actions. Excellent for tightly-coordinated discrete swarm decisions.	Limited to discrete action spaces. Mixing network complexity.

3. Application Domains and Algorithmic Adaptations

RL methods are adapted and applied to tackle specific sub-problems within the broader UAV drone swarm trajectory planning challenge.

3.1 Global Path Planning in Complex Environments

This involves finding a feasible, roughly optimal path from start to goal in a large, often partially known, environment with static obstacles. RL replaces traditional search or sampling. The UAV drone‘s state includes its global position and goal location, often fused with a local occupancy grid from sensors. DQN-based methods can treat movement between grid cells as discrete actions. More advanced approaches use PPO or DDPG to output continuous heading and speed commands. A key reward shaping technique is to use a potential field-inspired reward:

$$
R_{global} = R_{goal} + w_{obs} \cdot R_{repulsion} + w_{progress} \cdot R_{progress}
$$

where $R_{goal}$ is a large positive reward upon reaching the goal, $R_{repulsion} = \sum_{o \in \mathcal{O}} \frac{1}{||\mathbf{p}_{uav} – \mathbf{p}_o||}$ penalizes proximity to obstacles, and $R_{progress} = \Delta d_{to\_goal}$ encourages moving closer to the target.

3.2 Local Reactive Obstacle Avoidance and Dynamic Interaction

This focuses on real-time, reactive maneuvers to avoid unexpected static obstacles or other dynamic agents (like other UAV drones or moving vehicles). The state is highly localized, including high-resolution lidar/vision data and relative velocities of nearby objects. This is a continuous control problem, making PPO and DDPG the go-to choices. The policy learns to map raw sensor inputs directly to low-level control commands. MARL frameworks like MADDPG are crucial here when avoiding other intelligent agents, as they can learn implicit collision-avoidance protocols. The reward function is critical and often includes:

$$
R_{local} = -\mathbb{I}_{collision} \cdot C_{col} – w_{jerk} \cdot ||\mathbf{j}||^2 + w_{clearance} \cdot d_{min}
$$

where $\mathbb{I}_{collision}$ is an indicator for collision, $C_{col}$ is a large penalty, $\mathbf{j}$ is jerk (to encourage smoothness), and $d_{min}$ is the minimum distance to any obstacle during the step (encouraging safe clearance).

3.3 Multi-Agent Cooperative Planning and Task Allocation

This is the heart of swarm intelligence, where the objective is system-wide, such as covering an area, surrounding a target, or dividing search regions. CTDE MARL algorithms excel here. The global reward must incentivize cooperation. For area coverage, a global reward could be the increase in total covered area. For target assignment, it could be the sum of individual task completion rewards. Graph Neural Networks (GNNs) are increasingly integrated into MARL actors/critics to explicitly model the swarm’s communication topology, improving performance in large-scale UAV teams. The policy for agent $i$ may depend on the aggregated features of its neighbors $\mathcal{N}_i$ in the communication graph:

$$
\mathbf{h}_i^{l+1} = \sigma \left( \mathbf{W}^{l} \mathbf{h}_i^{l} + \sum_{j \in \mathcal{N}_i} \mathbf{\Theta}^{l} \mathbf{h}_j^{l} \right)
$$

where $\mathbf{h}_i^l$ is the feature of agent $i$ at layer $l$, and this graph convolution output is fed into the policy network.

3.4 Energy- and Communication-Aware Trajectory Optimization

For long-endurance missions, trajectories must optimize energy consumption, often linked to aerodynamic efficiency and maneuver smoothness. Communication-aware planning ensures the swarm maintains a connected network for data sharing. RL can jointly optimize these by including relevant terms in the reward and state. The state can include the agent’s remaining battery $e_i$ and the link quality to neighbors. The reward function incorporates an energy cost proportional to control effort and a penalty for dropped connections:

$$
R_{e2e} = R_{task} – w_{energy} \cdot (||\mathbf{a}_t||^2 + k \cdot v_t^2) – w_{comm} \cdot \sum_{j \in \mathcal{N}_i} \mathbb{I}_{(LQ_{ij} < LQ_{thresh})}
$$

where $R_{task}$ is the primary mission reward, $||\mathbf{a}_t||^2$ penalizes control effort, $v_t^2$ penalizes high speed (related to drag), and $\mathbb{I}_{(LQ_{ij} < LQ_{thresh})}$ indicates a poor communication link.

The table below summarizes the alignment between RL algorithms and specific UAV drone swarm planning challenges.

Planning Challenge	Suitable RL Algorithm Families	Critical Reward Components & State Design
Global Path in Cluttered Static Maps	DQN (discrete), PPO/DDPG (continuous)	State: Agent & goal pose, local occupancy grid. Reward: Goal reward, distance-to-goal progress, obstacle proximity penalty.
Dynamic Obstacle & Inter-Agent Avoidance	PPO, DDPG, MADDPG	State: Raw/downsampled lidar/depth data, relative velocities of nearby objects/agents. Reward: Large collision penalty, smoothness penalty (jerk), clearance bonus.
Cooperative Area Coverage / Search	QMIX, MADDPG, MAPPO (CTDE frameworks)	State: Agent pose, shared map of coverage/exploration status. Reward: Global coverage increase, penalty for overflight of covered areas.
Formation Flight & Shape Maintenance	MADDPG, PPO with shared reward	State: Agent pose, desired relative poses to neighbors. Reward: Negative sum of squared formation tracking errors, collision penalty.
Energy-Efficient Long-Duration Flight	DDPG, PPO (continuous, precise control)	State: Pose, velocity, remaining battery, wind estimate. Reward: Negative weighted sum of control effort squared and velocity squared, task completion bonus.
Communication-Constrained Swarm Planning	MADDPG with GNN, CommNet-style algorithms	State: Agent pose, neighbor poses, link quality indicators. Reward: Task reward, penalty for loss of connectivity to a critical number of neighbors.

4. Synthesis, Open Challenges, and Future Outlook

The integration of RL into UAV drone swarm trajectory planning has demonstrated remarkable potential. Key achievements include: (1) The development of end-to-end policies that can react to complex, dynamic environments directly from sensor data, bypassing the need for explicit feature engineering and fragile perception-planning-control pipelines. (2) The successful application of CTDE frameworks like MADDPG and QMIX to learn sophisticated, emergent cooperative behaviors such as flocking, distributed search, and adversarial maneuvering. (3) The ability to handle multi-objective optimization naturally through reward shaping, balancing safety, efficiency, and cooperation in a single learned policy.

However, significant open challenges remain before widespread real-world deployment. The table below outlines these challenges and potential research directions.

Core Challenge	Description	Promising Research Directions
Sample Inefficiency & Training Scalability	Training sophisticated swarm policies requires millions of environment interactions, which is prohibitively slow and expensive in simulation and impossible in the real world.	• Advanced Simulators: Development of high-fidelity, accelerated simulators (e.g., NVIDA Isaac Sim). • Transfer & Meta-Learning: Training policies on a distribution of tasks/simulators for fast adaptation to new scenarios. • Model-Based RL: Learning an approximate dynamics model to supplement real experience with simulated rollouts.
The Sim-to-Real Transfer Gap	Policies often fail to generalize from simulation to physical drones due to discrepancies in dynamics, actuation noise, sensor models, and environmental conditions.	• Domain Randomization: Varying simulator parameters (mass, inertia, friction, sensor noise) during training to create robust policies. • System Identification & Dynamics-Aware RL: Explicitly modeling and compensating for sim-to-real discrepancies within the RL framework. • Real-World Online Fine-Tuning: Using safe, on-policy algorithms to adapt pre-trained sim policies with limited real-flight data.
Safety and Verifiability	RL policies are “black boxes.” It is difficult to provide formal guarantees that they will never violate critical safety constraints (e.g., collide, enter a no-fly zone).	• Safe RL: Integrating formal methods (Shielded RL, Control Barrier Functions) to override or guide RL actions to ensure safety. • Interpretability & Explainability: Developing methods to understand why a policy made a certain decision, crucial for certification and human trust. • Hybrid Architectures: Using RL for high-level strategy while relying on proven, verifiable low-level controllers for stabilization and basic obstacle avoidance.
Generalization to Unseen Scenarios & Swarm Sizes	Policies trained for a specific number of UAV drones in a specific environment often cannot handle variations in swarm size or radically new obstacle layouts.	• Permutation-Invariant Architectures: Using GNNs or attention mechanisms that generalize across different numbers and orderings of agents. • Curriculum & Progressive Training: Starting training with simple scenarios and few agents, gradually increasing complexity and swarm size. • Foundation Models for Embodied AI: Exploring large pre-trained models that can provide common-sense reasoning and be fine-tuned for specific swarm tasks.
Integrated Task & Motion Planning	Most work separates high-level task allocation from low-level trajectory generation. True autonomy requires tight coupling.	• Hierarchical RL: A high-level RL policy selects sub-tasks or goals, while a low-level RL policy executes the motion to achieve them. • End-to-End Multi-Task Learning: Training a single policy on a diverse set of tasks (search, deliver, inspect) to enable flexible mission execution.

Looking forward, I envision several converging trends. First, the development of “**UAV drone** swarm foundation models” – large-scale policies pre-trained on vast, diverse datasets of simulated swarm behaviors – which can be efficiently fine-tuned for specific applications. Second, the rise of heterogeneous swarms, where RL must manage agents with different capabilities (e.g., quadrotors, fixed-wings, ground vehicles). Third, the integration with “**云边端**” (cloud-edge-device) computing architectures, where heavy training occurs in the cloud, policy distillation happens at the edge, and lightweight execution runs on the UAV drone itself. Finally, the creation of standardized benchmarks and simulation platforms will be crucial for fair comparison and accelerated progress in the field.

In conclusion, Reinforcement Learning has fundamentally expanded the toolbox for UAV drone swarm trajectory planning, offering a path toward adaptive, scalable, and intelligent collective autonomy. While substantial hurdles in safety, efficiency, and generalization remain, the ongoing synthesis of RL with advances in simulation, verification, and distributed systems promises to unlock the full potential of UAV swarms for transformative applications across society.