Intelligent Obstacle Avoidance Control Method for Drone Formation in Unknown Environments

The advancement of drone technology has propelled their widespread application across various fields, including large-scale aerial surveying, precision agriculture, and complex operational scenarios. While single drones are effective for specific tasks, their limitations in coverage area, task complexity, and redundancy become apparent for missions requiring extensive spatial operations or enhanced robustness. Consequently, the coordination of multiple drones into a cohesive drone formation has emerged as a critical research area. A drone formation leverages the synergies between individual units to accomplish tasks more efficiently and reliably than a single agent could. However, operating a drone formation in an unknown, obstacle-populated environment introduces significant challenges. The core problem extends beyond simple pathfinding; it requires the simultaneous resolution of two key conflicts: the avoidance of environmental static obstacles and the prevention of intra-formation collisions, all without prior knowledge of the environment’s layout. Ensuring the safe, coordinated, and intelligent navigation of a drone formation under these conditions remains a pivotal and demanding problem in autonomous systems.

Existing strategies for collision avoidance in multi-agent systems can be broadly categorized. Traditional methods, such as those based on mathematical optimization, heuristic algorithms, or geometric guidance, often depend on pre-constructed maps—either global or local—and involve computationally intensive path-search algorithms. These approaches are generally unsuitable for real-time navigation in truly unknown environments due to their offline nature and high computational load. The Artificial Potential Field (APF) method is notable for its simplicity and real-time performance. It conceptualizes the target as an attractive force and obstacles as repulsive forces, guiding the agent along the resultant force’s negative gradient. Despite its appeal, the classical APF method suffers from well-documented issues like local minima (where attractive and repulsive forces cancel out, trapping the agent) and oscillations in narrow passages. Although numerous improvements have been proposed, tuning the field parameters for a cohesive drone formation often leads to congestion or collisions when facing multiple obstacles, as the repulsive forces from both the environment and neighboring drones become difficult to balance.

In recent years, machine learning, particularly Deep Reinforcement Learning (DRL), has offered a promising alternative. DRL agents learn optimal policies through direct interaction with the environment, eliminating the need for explicit environment modeling. This “end-to-end” learning capability is highly advantageous for unknown environments. While early DRL successes like Deep Q-Networks (DQN) excelled in discrete action spaces, they were inapplicable to the continuous control domain of drone flight. The Deep Deterministic Policy Gradient (DDPG) algorithm, an actor-critic method designed for continuous action spaces, addressed this limitation. However, standard DDPG can be sample-inefficient, suffer from training instability due to value function overestimation, and exhibit poor generalization. Therefore, enhancing DDPG for robust and efficient policy learning is crucial for its application in guiding a drone formation leader.

This paper proposes a novel hierarchical control architecture for intelligent obstacle avoidance in a fixed-wing drone formation operating in an unknown static environment. The core innovation lies in the fusion of a centralized, improved reinforcement learning module for the leader drone with a decentralized, APF-based consensus control protocol for the follower drones. For the leader, we introduce a Greedy-DDPG algorithm. This improvement modifies the exploration strategy by incorporating a greedy selection mechanism from a sampled action set, which accelerates early-stage training and mitigates value overestimation, leading to a more stable and generalizable obstacle avoidance policy. For the follower drones, we design a distributed control law that integrates the consensus protocol—ensuring formation shape maintenance and velocity alignment—with a repulsive potential field for both environmental obstacles and other drones. This dual-layer approach allows the intelligent leader to chart a safe and smooth global path, while the followers autonomously maintain the formation geometry and avoid collisions with each other and local obstacles, ensuring the holistic safety and cohesion of the entire drone formation.

The subsequent sections are organized as follows. First, we formally define the problem, including the drone kinematic model, environmental assumptions, and control objectives for the drone formation. Next, we detail the Greedy-DDPG algorithm for leader drone control, covering the state/action design, reward function formulation, and network architecture. Following that, we present the follower drone control strategy based on artificial potential field and consensus theory, along with a theoretical stability analysis. Finally, we provide comprehensive simulation results and a concluding summary.

1. Problem Formulation

1.1 Preliminaries and Communication Graph

We consider a drone formation consisting of \(n+1\) fixed-wing drones operating in a two-dimensional plane, comprising one leader (denoted as agent 0) and \(n\) followers. A centralized communication structure is assumed for the leader-follower links: all follower drones can receive the leader’s position and velocity information reliably and without delay. However, communication among follower drones is limited by their onboard sensors. Each follower can only perceive the states of other followers within its limited detection range. This interaction topology can be modeled as an undirected graph \(G=(V, E)\).

Let \(V=\{1, 2, …, n\}\) be the set of follower nodes. The edge set \(E=\{(i, j) | i, j \in V, i \neq j\}\) exists if drones \(i\) and \(j\) are within mutual detection range. The adjacency matrix \(B(G)=[b_{ij}]\) and the Laplacian matrix \(L=D(G)-B(G)\) are defined as follows, where \(D(G)=\text{diag}(d(i))\) is the degree matrix with \(d(i)=\sum_{j=1}^{n} b_{ij}\).

$$b_{ij} = \begin{cases} 1, & \text{if } R_{ij} \leq R_s \\ 0, & \text{if } R_{ij} > R_s \end{cases}$$
where \(R_{ij}\) is the Euclidean distance between drones \(i\) and \(j\), and \(R_s\) is the constant sensor detection radius.

1.2 Drone Kinematic Model

For control design and simulation, we employ a simplified double-integrator kinematic model for each drone in the drone formation. The model for the \(k\)-th drone is:

$$
\begin{align}
\dot{\mathbf{x}}_k &= \mathbf{v}_k, \\
\dot{\mathbf{v}}_k &= \mathbf{u}_k, \quad k=1, 2, …, n+1.
\end{align}
$$

where \(\mathbf{x}_k=(x_k, y_k)^T \in \mathbb{R}^2\) is the position vector, \(\mathbf{v}_k=(v_{k_x}, v_{k_y})^T \in \mathbb{R}^2\) is the velocity vector, and \(\mathbf{u}_k \in \mathbb{R}^2\) is the acceleration control input. Fixed-wing drones have inherent motion constraints compared to rotorcraft. They cannot hover and have limitations on turning rate. These are modeled as:

$$
\begin{align}
0 < v_{\text{min}} \leq \|\mathbf{v}_k\|_2 \leq v_{\text{max}}, \\
|\dot{\theta}_k| \leq \dot{\theta}_{\text{max}}.
\end{align}
$$

Here, \(v_{\text{min}}\) and \(v_{\text{max}}\) are the minimum and maximum allowable speeds, \(\dot{\theta}_k\) is the heading angular rate, and \(\dot{\theta}_{\text{max}}\) is its maximum allowable value.

1.3 Environment and Control Objectives

The unknown environment contains \(m\) static obstacles. Each obstacle \(p\) is modeled as a circular region with a center \((x_p, y_p)\) and radius \(r_p\):
$$c_p = (x_p, y_p, r_p), \quad p=1,2,…,m.$$
The primary control objectives for the drone formation are defined as follows:

  1. Obstacle Avoidance: Every drone must maintain a safe distance from all obstacles.
    $$ \min(\| (x_l, y_l) – (x_p, y_p) \|_2) > r_p, \quad \forall p. $$
    $$ \min(\| (x_i, y_i) – (x_p, y_p) \|_2) > r_p, \quad \forall p, \forall i. $$
  2. Inter-Agent Collision Avoidance: Drones within the drone formation must maintain a minimum safe distance from each other.
    $$ \min(\| (x_i, y_i) – (x_j, y_j) \|_2) > r_{\text{safe}}, \quad \forall i \neq j. $$
  3. Formation Convergence and Target Reaching: The leader must reach the designated target position, and the followers must achieve and maintain a desired geometric formation relative to the leader while matching velocities.
    $$ \lim_{t \to \infty} \left( \mathbf{x}_l(t) – \mathbf{x}_{\text{target}}(t) \right) = 0. $$
    $$ \lim_{t \to \infty} \left( \mathbf{v}_i(t) – \mathbf{v}_j(t) \right) = 0, \quad \forall i, j. $$
    $$ \lim_{t \to \infty} \left( \mathbf{r}_i(t) – \mathbf{r}’_i(t) \right) = 0, \quad \forall i. $$
    Here, \(\mathbf{r}_i(t)\) is the actual relative position of follower \(i\) to the leader, and \(\mathbf{r}’_i(t)\) is its desired relative position in the formation geometry.

2. Leader Drone Obstacle Avoidance via Enhanced Reinforcement Learning

The leader’s role is to navigate towards the global target while intelligently avoiding unknown obstacles, creating a feasible path for the entire drone formation. We formulate this as a continuous control problem solved by an enhanced DDPG algorithm, termed Greedy-DDPG.

2.1 State, Action, and Reward Design

State Space \(s_t\): The state must provide sufficient information for decision-making without relying on explicit obstacle coordinates (which are unknown a priori). We design a 5-dimensional state vector leveraging potential field concepts:
$$ s_t = \left( \frac{d_{\text{target}}}{k_r},\ \tanh(\| \mathbf{F}_{\text{rep}_u} \|_2),\ \theta_{\text{target}},\ \theta_{F},\ \theta_l \right)^T. $$

  • \(d_{\text{target}}\): Distance from the leader to the target position.
  • \(\mathbf{F}_{\text{rep}_u}\): The total repulsive force from detected obstacles, calculated using a standard APF repulsive function \( \mathbf{F}_{\text{rep}} = \mathbf{F}_{\text{rep1}} + \mathbf{F}_{\text{rep2}} \). The \(\tanh\) function normalizes the force magnitude to \([-1,1]\), improving training stability.
  • \(\theta_{\text{target}}\): The bearing angle to the target relative to the leader’s frame.
  • \(\theta_{F}\): The direction angle of the combined force (repulsive + attractive).
  • \(\theta_l\): The leader’s current heading angle.

Action Space \(a_t\): Aligned with the drone’s kinematic constraints, the action is defined as the heading angular rate, a continuous scalar value.
$$ a_t = \dot{\theta}_l \in [-\dot{\theta}_{\text{max}}, \dot{\theta}_{\text{max}}]. $$
The speed control is managed separately by a simple proportional controller based on the attractive potential field.

Reward Function \(r_t\): A well-shaped reward is critical for learning. We combine dense shaping rewards with a sparse terminal reward.

Dense Rewards:

  • Action Penalty \(R_a\): Penalizes aggressive turning to encourage smooth flight. \( R_a = -\left( \frac{|a|}{a_{\text{max}}} – 0.5 \right) \) if \( \frac{|a|}{a_{\text{max}}} > 0.3 \), else 0.
  • Obstacle Force Penalty \(R_F\): Guides the drone away from obstacles based on the magnitude of repulsive force. \( R_F = \text{clip}(-4 \| \mathbf{F}_{\text{rep}_u} \|_2, -4, 4) \).
  • Heading Error Penalty \(R_{\theta}\): Applied when far from obstacles to encourage direct movement toward the target. \( R_{\theta} = -2|\theta_l – \theta_{\text{target}}| + 0.4 \) if \( \|\mathbf{F}_{\text{rep}_u}\|_2 < 0.1 \) and \( |\theta_l – \theta_{\text{target}}| > 0.2 \).
  • Progress Reward \(R_r\): Encourages reducing distance to target and moving towards it. \( R_r = \exp\left(-\frac{d_{\text{target}}}{200}\right) + \frac{\dot{d}_{\text{target}}}{5} \), where \(\dot{d}_{\text{target}}\) is the rate of distance decrease (negative when approaching).

Terminal Reward \(R_{\text{end}}\): A significant reward (or penalty) given upon episode termination, which occurs on collision, excessive deviation from the target, or success (reaching very close to the target).

The total return for an episode is: \( R_{\text{epi}} = R_{\text{end}} + \sum_{t=0}^{T_{\text{end}}} \left[ R_a(t) + R_F(t) + R_{\theta}(t) + R_r(t) \right] \).

2.2 Greedy-DDPG Algorithm

The standard DDPG algorithm employs an actor-network \(\mu(s|\theta^\mu)\) (policy) and a critic network \(Q(s,a|\theta^Q)\) (value function), along with their target networks. Exploration is typically achieved by adding correlated Ornstein-Uhlenbeck (OU) noise to the actor’s output. While effective, this can lead to slow initial learning and value overestimation.

Our Greedy-DDPG modification alters the exploration mechanism. At each timestep \(t\), instead of adding noise to a single action, the actor network generates a deterministic action \(\mu(s_t)\). We then create a candidate action set \(A_t\) by sampling \(N_{\text{cand}}\) noise values \(\{\epsilon_i\}\) from the OU process:
$$ A_t = \{ a_i = \text{clip}(\mu(s_t) + \epsilon_i, a_{\text{low}}, a_{\text{high}}) \mid i=1,…,N_{\text{cand}} \}. $$
Each candidate action \(a_i\) is evaluated by the current critic network to estimate its Q-value: \(Q(s_t, a_i)\). The final action \(a_t\) is selected from \(A_t\) based on a decaying greedy-soft policy:

$$
a_t =
\begin{cases}
\arg\max_{a \in A_t} Q(s_t, a), & \text{with probability } \epsilon_1, \\
\arg\min_{a \in A_t} Q(s_t, a), & \text{with probability } \epsilon_2(t), \\
\text{a random action from } A_t, & \text{with probability } 1-\epsilon_1-\epsilon_2(t).
\end{cases}
$$

Here, \(\epsilon_1\) is a constant small probability for exploiting the current critic’s best guess. \(\epsilon_2(t)\), the probability of choosing the worst-valued action, decays over episodes (e.g., \(\epsilon_2(\Delta+1) = \epsilon_2(\Delta) \cdot r_{\epsilon_2}, r_{\epsilon_2} < 1\)). This mechanism provides three key benefits for training the drone formation leader: 1) Accelerated Early Learning: In early episodes, the critic is inaccurate. Choosing actions with extreme Q-values (both max and min) provides stronger, more varied gradients for updating both actor and critic, speeding up initial policy improvement. 2) Mitigated Overestimation Bias: Periodically selecting low-value actions helps correct the Q-function overestimation common in DDPG. 3) Guided Exploration: The exploration is not purely random but is biased by the critic’s current understanding, making it more efficient than naive random noise.

The networks are then updated using standard DDPG rules. The critic is updated by minimizing the Temporal Difference (TD) error loss, and the actor is updated using the deterministic policy gradient. Target networks are softly updated.

Table 1: Neural Network Architecture Parameters
Layer Actor Network Critic Network
Input State \(s_t\) (dim 5) State \(s_t\) (dim 5) & Action \(a_t\) (dim 1)
Hidden 1 Fully Connected (60 neurons), ReLU Fully Connected (80 neurons for state, 80 for action), ReLU
Hidden 2 Fully Connected (80 neurons), ReLU Fully Connected (60 neurons), ReLU
Output Fully Connected (1 neuron), Tanh Fully Connected (1 neuron), Linear

3. Follower Drone Formation Control with Collision Avoidance

The followers’ control objective is dual: maintain the prescribed drone formation geometry relative to the moving leader, and avoid collisions with both static obstacles and other drones. We propose a control law that integrates a consensus protocol with an artificial potential field.

3.1 Integrated Control Law Design

For follower drone \(i\) in the drone formation, the control input \(\mathbf{u}_i\) is designed as follows:

$$
\mathbf{u}_i = -\sum_{j=1}^{n} b_{ij} \left[ \gamma_0 (\hat{\mathbf{x}}_i – \hat{\mathbf{x}}_j) + \gamma_1 (\hat{\mathbf{v}}_i – \hat{\mathbf{v}}_j) \right] – h_i \left[ \gamma_0 \hat{\mathbf{x}}_i + \gamma_1 \hat{\mathbf{v}}_i \right] + \mathbf{f}_i.
$$

Let’s define the components:

  • \(\hat{\mathbf{x}}_i = \mathbf{x}_i – (\mathbf{x}_l + \mathbf{r}’_i)\): Position error relative to its desired location in the formation (leader position \(\mathbf{x}_l\) plus desired offset \(\mathbf{r}’_i\)).
  • \(\hat{\mathbf{v}}_i = \mathbf{v}_i – \mathbf{v}_l\): Velocity error relative to the leader’s velocity \(\mathbf{v}_l\).
  • \(b_{ij}\): Adjacency element from the follower communication graph \(G\).
  • \(h_i\): Leader-follower connection weight. \(h_i=1\) as all followers receive leader info.
  • \(\gamma_0, \gamma_1 > 0\): Control gain parameters for position and velocity consensus, respectively.
  • \(\mathbf{f}_i\): The total repulsive force from obstacles and other drones, acting as a collision avoidance term.

The term \(-\sum b_{ij}[\gamma_0 (\hat{\mathbf{x}}_i – \hat{\mathbf{x}}_j) + \gamma_1 (\hat{\mathbf{v}}_i – \hat{\mathbf{v}}_j)]\) enforces consensus (position and velocity matching) among neighboring followers. The term \(-h_i[\gamma_0 \hat{\mathbf{x}}_i + \gamma_1 \hat{\mathbf{v}}_i]\) drives the follower’s state error relative to the leader to zero. The combined effect of these two terms ensures the drone formation achieves and maintains the desired geometry. The repulsive force \(\mathbf{f}_i\) is calculated similarly to the leader’s but includes other drones within the safe distance \(r_{\text{safe}}\) as additional repulsive sources:
$$ \mathbf{f}_i = -\nabla U_{\text{obs}}(\mathbf{x}_i) – \sum_{j \neq i} \nabla U_{\text{col}}(\|\mathbf{x}_i – \mathbf{x}_j\|), $$
where \(U_{\text{obs}}\) and \(U_{\text{col}}\) are repulsive potential functions for obstacles and collision, typically of the form \(U(r) = \frac{1}{2}k \left( \frac{1}{r} – \frac{1}{r_0} \right)^2\) for \(r < r_0\), and 0 otherwise, with \(r_0\) being the influence range.

3.2 Stability Analysis

The stability of the follower subsystem under the proposed control law can be analyzed using Lyapunov theory. Consider the overall system state for followers, stacking position and velocity errors. The closed-loop dynamics can be written in matrix form. We consider a Lyapunov function candidate that captures the total energy of the formation system, including the potential energy from the formation errors and the kinetic energy:

$$
V(\mathbf{\hat{x}}, \mathbf{\hat{v}}) = \frac{1}{2} \gamma_0 \mathbf{\hat{x}}^T (L+H) \mathbf{\hat{x}} + \frac{1}{2} \mathbf{\hat{v}}^T \mathbf{\hat{v}} + P(\mathbf{x}).
$$

Here, \(L\) is the Laplacian of the follower graph, \(H = \text{diag}(h_i)\), and \(P(\mathbf{x})\) is the total potential energy from obstacle/colision avoidance (a positive definite function). Taking the time derivative along the system trajectories and substituting the control law \(\mathbf{u}_i\), we get:
$$ \dot{V} = \mathbf{\hat{v}}^T \left( -\gamma_0 (L+H)\mathbf{\hat{x}} + \dot{\mathbf{\hat{v}}} \right) + \dot{P}. $$
With \(\dot{\mathbf{\hat{v}}} = \mathbf{u} – \mathbf{1} \otimes \dot{\mathbf{v}}_l\) and assuming the leader’s acceleration is bounded or zero for analysis, substituting the expression for \(\mathbf{u}\) yields:
$$ \dot{V} = -\gamma_1 \mathbf{\hat{v}}^T (L+H) \mathbf{\hat{v}} + \mathbf{\hat{v}}^T (-\nabla P) + \nabla P \cdot \mathbf{v}. $$
Noting that \(\mathbf{\hat{v}}^T (-\nabla P) + \nabla P \cdot \mathbf{v} = \nabla P \cdot (\mathbf{v} – \mathbf{\hat{v}}) = \nabla P \cdot (\mathbf{1} \otimes \mathbf{v}_l)\), which is bounded. The first term, \(-\gamma_1 \mathbf{\hat{v}}^T (L+H) \mathbf{\hat{v}}\), is negative semi-definite. Since the graph \(G\) is connected and \(H\) is positive definite, \((L+H)\) is positive definite. Therefore, by applying the LaSalle’s invariance principle, we can conclude that the system states will converge to the largest invariant set where \(\mathbf{\hat{v}} = 0\). On this set, the dynamics force \(\mathbf{\hat{x}} \to 0\), provided the repulsive potentials \(P(\mathbf{x})\) are designed to have minima consistent with the desired formation spacing. Thus, the followers achieve velocity consensus with the leader and converge to their desired formation positions while avoiding collisions, ensuring the stability of the overall drone formation control strategy.

4. Simulation Experiments and Results

We conducted extensive simulations to validate the proposed hierarchical control method for the drone formation.

4.1 Leader Training: Greedy-DDPG vs. DDPG

The leader’s Greedy-DDPG controller was trained in a stochastic environment with 5 randomly placed circular obstacles per episode. Key training parameters are listed below.

Table 2: Key Training Hyperparameters
Parameter Value
Actor Learning Rate 0.001
Critic Learning Rate 0.001
Discount Factor (\(\gamma\)) 0.99
Replay Buffer Size 50,000
Minibatch Size 128
Soft Update Rate (\(\tau\)) 0.001
Greedy Prob. \(\epsilon_1\) 0.3
Initial Greedy Prob. \(\epsilon_2(0)\) 0.1
Decay Rate \(r_{\epsilon_2}\) 0.95

Training Performance: The Greedy-DDPG algorithm reached the performance threshold (average episode reward > 8000) in approximately 170 episodes, which was 5.9% faster than the standard DDPG baseline. The learning curve showed a more rapid initial rise, confirming that the greedy exploration strategy accelerated early-phase learning.

4.2 Single Leader Performance and Generalization

We first evaluated the trained leader policy in a fixed, unseen test scenario and compared it against a standard APF method and the baseline DDPG leader. Figure X (conceptual plot) shows the trajectories.

  • APF Method: The path was feasible but exhibited noticeable oscillations when encountering the first two obstacles, as the sudden change in the number of sensed obstacles caused a jump in the net repulsive force direction.
  • Baseline DDPG: The path was smoother but came dangerously close to one obstacle (minimum distance ~9.1m). This “riskier” path, while acceptable for a single drone, reduces the safety margin for the following drone formation.
  • Greedy-DDPG: Our method produced a smooth trajectory with larger clearances from all obstacles (minimum distance >17m), providing a safer corridor for the follower drones.

To test generalization, we performed Monte Carlo simulations in three different environments: 1) Random Obstacles (trained setting), 2) Large Obstacles, 3) Small, Dense Obstacles. Success was defined as reaching within 0.5m of the target without collision.

Table 3: Monte Carlo Success Rates for Leader Navigation (%)
Environment Scenario Greedy-DDPG Baseline DDPG
Random Obstacles (Training-like) 97.5% 85.2%
Large Obstacles 95.5% 72.7%
Small, Dense Obstacles 96.4% 86.8%

The results demonstrate that the Greedy-DDPG policy maintains high success rates across varied, unseen obstacle configurations, confirming its superior generalization capability compared to the baseline, which is crucial for operations in unknown environments.

4.3 Full Drone Formation Performance

We simulated a drone formation with 1 leader and 4 followers in a diamond-shaped configuration. The leader was controlled by the trained Greedy-DDPG policy, and followers used the proposed integrated control law (Eq. 10). The formation successfully navigated through the field of unknown obstacles. Key observations are summarized below.

Table 4: Formation Performance Metrics Comparison
Metric Proposed Method (Greedy-DDPG Leader + APF-Consensus Followers) Baseline (APF for All)
Max Formation Position Error < 10 m > 25 m
Behavior in Narrow Gaps Smooth, coordinated passage without congestion. Oscillation and temporary hovering, risking collision.
Leader Path Smoothness High (learned policy avoids sharp turns). Moderate (oscillations due to force discontinuities).
Overall Safety Margin Larger clearance from obstacles. Variable, often closer to obstacles.

The simulation confirmed that the proposed architecture enables the drone formation to: 1) Maintain Cohesion: Followers effectively tracked the leader’s smooth path while adjusting their relative positions. 2) Avoid All Collisions: No collisions occurred between drones or with obstacles. 3) Adapt Shape: The formation temporarily deformed when necessary to pass through tight spaces and reliably re-formed afterwards. In contrast, a formation where all drones used only reactive APF control showed significant instability, with large formation errors and oscillatory behavior that could lead to intra-formation collisions in more constrained settings.

5. Conclusion

This paper presented an intelligent, hierarchical control method for a fixed-wing drone formation to navigate safely in an unknown environment littered with static obstacles. The core of the approach is a synergistic combination of deep reinforcement learning for high-level, intelligent guidance and artificial potential fields with consensus for low-level, reactive formation keeping and collision avoidance. The proposed Greedy-DDPG algorithm for the leader drone introduces a novel action selection mechanism during exploration that significantly improves training efficiency and the generalizability of the learned obstacle avoidance policy. For the follower drones, the integrated control law ensures robust formation maintenance and simultaneous collision avoidance with both obstacles and neighboring drones, backed by a formal stability analysis.

Comprehensive simulation results validate the effectiveness of the proposed framework. The Greedy-DDPG leader learns a safer and smoother navigation policy than baseline methods. The full drone formation successfully completes missions in complex unknown environments, maintaining cohesion and avoiding all collisions, outperforming a baseline where all agents use only reactive potential field methods. This work demonstrates a significant step towards reliable autonomous drone formation operations in uncertain real-world settings.

Future work will focus on extending the framework to account for more realistic drone dynamics, implementing it under a fully distributed communication architecture without a central leader, and testing the system’s robustness against dynamic obstacles and communication delays.

Scroll to Top