The evolution of modern warfare increasingly gravitates towards the concept of swarm-on-swarm engagements, where coordinated groups of autonomous systems confront each other in complex, dynamic battlespaces. This paradigm shift necessitates the development of advanced, intelligent decision-making frameworks for unmanned aerial vehicle (UAV) clusters. This research addresses the critical challenge of enabling a team of defensive UAV drones to efficiently expel and subsequently encircle one or more adversarial UAV drones within a bounded, cluttered environment. Traditional control methods often struggle with the non-linear dynamics, partial observability, and need for emergent cooperation inherent in such multi-agent scenarios. To overcome these limitations, this article proposes and investigates a novel strategy based on the Multi-Agent Twin Delayed Deep Deterministic Policy Gradient (MATD3) algorithm, a state-of-the-art deep reinforcement learning (DRL) technique tailored for continuous control in multi-agent systems.

The core objective is to empower a team of defensive UAV drones with the capability to autonomously learn collaborative behaviors. These behaviors must include: 1) coordinating their movements to drive an intruding UAV drone away from a high-value protected asset or zone, and 2) seamlessly transitioning into a surrounding formation to capture or neutralize the target once it is isolated. Success hinges on the algorithm’s ability to handle kinematic constraints, avoid inter-agent collisions, maintain safe distances from boundaries, and dynamically allocate roles based on the real-time situation—all while operating with only local observations. The proposed approach leverages a centralized-training-with-decentralized-execution (CTDE) paradigm. During training, a central critic network has access to global state information to better guide the learning of cooperative policies. During execution, each defensive UAV drone acts independently based solely on its own localized sensor inputs, ensuring scalability and robustness to communication failures.
The performance and efficacy of the proposed MATD3-based framework are rigorously validated through extensive simulation studies in scenarios involving 2-vs-1 and 3-vs-1 configurations (defenders vs. intruder). Comparative analysis against a baseline Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm demonstrates significant improvements in learning efficiency, task success rate, and the quality of the emergent cooperative strategies. The findings presented herein contribute a viable and advanced technical pathway for enhancing the autonomous collaborative efficacy of UAV drone swarms in defensive counter-air and perimeter security operations.
1. Problem Formulation and System Modeling
This section formally defines the cooperative expulsion and encirclement task, establishes the kinematic model for the UAV drones, and outlines the strategy governing the adversarial intruder’s behavior.
1.1 Task Scenario and Definitions
The operational environment is defined as a confined, two-dimensional rectangular area, representing a simplified but tactically relevant airspace. A critical stationary asset is located at the center of this area. At the initiation of an episode, one or more fast, agile offensive UAV drones (intruders) are spawned at random locations relatively distant from the center. A team of slower but more numerous defensive UAV drones (defenders) is initialized at random positions closer to the protected asset. The defenders’ collective mission is twofold:
- Expulsion Phase: Detect the intruder’s approach and collaboratively maneuver to drive it away from the central asset and towards the environment boundary.
- Encirclement/Capture Phase: Once the intruder is sufficiently displaced from the asset and near the boundary, the defenders should coordinate to surround it, maintaining specific geometric constraints to achieve a successful capture.
The task termination and success conditions are mathematically defined using key distance metrics. Let $d_{g,a}$ represent the distance between a defensive UAV drone $g$ and the offensive UAV drone $a$. Let $d_{cap}$ be a predefined capture radius. Let $d_{saf}$ denote the minimum safe distance required between any two UAV drones to avoid collision. A successful cooperative encirclement is deemed achieved when, for all defensive UAV drones $g_i$ in the team, the following condition holds simultaneously:
$$
d_{cap} \geq d_{g_i,a} \geq d_{saf} \quad \forall i
$$
This condition ensures the target is contained within the collective reach of the defenders while maintaining a safe operational formation. Violating the lower bound ($d_{g_i,a} < d_{saf}$) constitutes a collision, which is penalized. The asymmetric capabilities of the agents are summarized in Table 1.
| Attribute | Offensive UAV Drones (Intruder) | Defensive UAV Drones (Defenders) |
|---|---|---|
| Primary Objective | Reach the central asset | Expel and encircle the intruder |
| Relative Speed | Higher maximum speed | Lower maximum speed |
| Relative Acceleration | Higher maximum acceleration | Lower maximum acceleration |
| Maneuverability | Higher angular rate limit | Lower angular rate limit |
| Team Size | Fewer in number (e.g., 1) | Greater in number (e.g., 2 or 3) |
1.2 Kinematic Model for UAV Drones
All UAV drones are modeled as point masses moving in a 2D plane. The kinematic state of the $i$-th UAV drone (denoted generically) at time $t$ is described by its position $(x_i(t), y_i(t))$, linear speed $v_i(t)$, heading angle $\psi_i(t)$, and angular velocity $\omega_i(t)$. The control inputs are the linear acceleration $a_i(t)$ and the angular acceleration $\alpha_i(t)$. The continuous-time kinematics are given by:
$$
\begin{aligned}
\dot{x}_i(t) &= v_i(t) \cos(\psi_i(t)) \\
\dot{y}_i(t) &= v_i(t) \sin(\psi_i(t)) \\
\dot{\psi}_i(t) &= \omega_i(t) \\
\dot{v}_i(t) &= a_i(t) \\
\dot{\omega}_i(t) &= \alpha_i(t)
\end{aligned}
$$
For implementation in a discrete-time simulation, this model is integrated using a suitable method (e.g., Euler integration) with time step $\Delta t$. Realistic physical constraints are imposed on each type of UAV drone:
$$
\begin{aligned}
\text{Speed:} &\quad 0 \leq v_i(t) \leq v_{i}^{max} \\
\text{Acceleration:} &\quad a_{i}^{min} \leq a_i(t) \leq a_{i}^{max} \\
\text{Angular Velocity:} &\quad |\omega_i(t)| \leq \omega_{i}^{max} \\
\text{Angular Acceleration:} &\quad |\alpha_i(t)| \leq \alpha_{i}^{max}
\end{aligned}
$$
The specific limits for $v_{i}^{max}$, $a_{i}^{max}$, $\omega_{i}^{max}$, and $\alpha_{i}^{max}$ differ for offensive and defensive UAV drones, as implied by Table 1.
1.3 Adversarial UAV Drone Strategy using Artificial Potential Fields
To create a dynamic and challenging training environment, the offensive UAV drone (intruder) is not passive. It employs an Artificial Potential Field (APF) strategy to navigate towards the central target while avoiding defenders. This strategy generates a resultant force $\vec{F}_{total}$ that guides its acceleration. The total force is a superposition of three components:
- Attractive Force to Target ($\vec{F}_{attr}$): Pulls the intruder towards the center.
$$ \vec{F}_{attr} = k_{attr} \cdot \frac{(\vec{P}_{target} – \vec{P}_{intruder})}{\|\vec{P}_{target} – \vec{P}_{intruder}\|^2} $$ - Repulsive Force from Defenders ($\vec{F}_{rep}$): Pushes the intruder away from nearby defensive UAV drones to avoid immediate capture.
$$ \vec{F}_{rep} = \sum_{j} k_{rep} \cdot \frac{(\vec{P}_{intruder} – \vec{P}_{defender,j})}{\|\vec{P}_{intruder} – \vec{P}_{defender,j}\|^3}, \quad \text{if } \|\vec{P}_{intruder} – \vec{P}_{defender,j}\| < d_{rep\_thresh} $$ - Repulsive Force from Boundary ($\vec{F}_{bound}$): Prevents the intruder from exiting the environment prematurely, ensuring the expulsion task is meaningful.
$$ \vec{F}_{bound} = k_{bound} \cdot \frac{(\vec{P}_{intruder} – \vec{P}_{nearest\_boundary})}{\|\vec{P}_{intruder} – \vec{P}_{nearest\_boundary}\|^3} $$
The coefficients $k_{attr}$, $k_{rep}$, and $k_{bound}$ scale the influence of each force. The intruder’s acceleration command $\vec{a}_{cmd}$ is derived from this total force, subject to its own kinematic constraints. This intelligent adversary forces the defensive UAV drones to learn robust and adaptive cooperative strategies.
2. Multi-Agent Deep Reinforcement Learning Framework
This section details the core MARL methodology, focusing on the MATD3 algorithm, and describes the specific design of the state/action spaces and the multi-dimensional reward function for the defensive UAV drone team.
2.1 Fundamentals of Multi-Agent Reinforcement Learning
The problem is formalized as a Markov Game, an extension of Markov Decision Processes (MDPs) to $N$ agents. Each defensive UAV drone is an agent $i$. At each time step $t$, agent $i$ receives a local observation $o_i^t$ from the environment, which is a partial reflection of the true global state $s^t$. Based on its policy $\pi_{\theta_i}(a_i^t | o_i^t)$ parameterized by $\theta_i$, it selects a continuous action $a_i^t$. The joint action of all defenders $\mathbf{a}^t = (a_1^t, …, a_N^t)$ causes the environment to transition to a new state $s^{t+1}$ according to the state transition function $P(s^{t+1}|s^t, \mathbf{a}^t)$. Each agent receives an individual reward $r_i^t(s^t, \mathbf{a}^t, s^{t+1})$. The goal for each agent is to learn a policy that maximizes its own expected discounted cumulative return $R_i = \mathbb{E}[\sum_{k=0}^{\infty} \gamma^k r_i^{t+k+1}]$, where $\gamma \in [0,1)$ is the discount factor. The CTDE paradigm is crucial: during training, algorithms can leverage global information $s^t$ to learn better centralized value functions or policy gradients, but during execution, each agent’s policy must depend only on its local observation $o_i^t$.
2.2 The MATD3 Algorithm for UAV Drone Teams
The proposed solution employs the MATD3 algorithm. MATD3 extends the Twin Delayed DDPG (TD3) algorithm—which addresses function approximation error and overestimation bias in the actor-critic framework—to multi-agent settings. Each defensive UAV drone $i$ maintains the following neural networks:
- Actor (Policy) Network ($\mu_{\theta_i}$): Takes the local observation $o_i$ and outputs a continuous action $a_i$.
- Two Critic (Q-value) Networks ($Q_{\phi_i^1}, Q_{\phi_i^2}$): Each takes the global state $s$ and the joint action $\mathbf{a}$ of all defenders, and outputs an estimate of agent $i$’s expected return. Having two critics helps reduce overestimation bias.
- Target Networks: Slow-moving copies ($\mu_{\theta_i’}, Q_{\phi_i^{1′}}, Q_{\phi_i^{2′}}$) of the above networks, used to stabilize training.
The key steps in the MATD3 training loop are:
- Experience Collection: Agents interact with the environment using their current policies (with added exploration noise) and store transition tuples $(s^t, \mathbf{a}^t, r^t, s^{t+1})$ in a shared replay buffer.
- Centralized Critic Update: A mini-batch of transitions is sampled. The target value $y_i$ for agent $i$ is computed using the minimum of the two target critic networks and the target actors, a technique known as clipped double Q-learning:
$$ y_i = r_i^t + \gamma \min_{k=1,2} Q_{\phi_i^{k’}}(s^{t+1}, \tilde{\mathbf{a}}’)|_{\tilde{a}_j’=\mu_{\theta_j’}(o_j^{t+1})+\epsilon, \epsilon \sim \text{clip}(\mathcal{N}(0,\sigma), -c, c)} $$
The critics are updated by minimizing the mean-squared error loss against this target:
$$ \mathcal{L}(\phi_i^k) = \mathbb{E}_{(s,\mathbf{a},r,s’) \sim \mathcal{D}} \left[ (Q_{\phi_i^k}(s, \mathbf{a}) – y_i)^2 \right], \quad k=1,2 $$ - Delayed & Policy Gradient Actor Update: The actor is updated less frequently than the critics (e.g., every $d$ steps) to allow the value function to stabilize. The gradient for agent $i$’s actor is:
$$ \nabla_{\theta_i} J(\mu_{\theta_i}) \approx \mathbb{E}_{s \sim \mathcal{D}} \left[ \nabla_{\theta_i} \mu_{\theta_i}(o_i) \nabla_{a_i} Q_{\phi_i^1}(s, \mathbf{a})|_{a_i=\mu_{\theta_i}(o_i)} \right] $$
Note that the gradient is taken through the first critic $Q_{\phi_i^1}$. - Soft Target Updates: All target networks are softly updated: $\theta_i’ \leftarrow \tau \theta_i + (1-\tau)\theta_i’$, $\phi_i^{k’} \leftarrow \tau \phi_i^k + (1-\tau)\phi_i^{k’}$, with $\tau \ll 1$.
This framework enables the team of UAV drones to learn coordinated expulsion and encirclement policies that are robust, stable, and free from the pathological overestimation common in simpler DRL methods.
2.3 State and Action Space Design for Defensive UAV Drones
The design of the observation and action spaces is critical for learning efficiency. The local observation $o_i$ for a defensive UAV drone $i$ is a concatenated vector containing:
$$
o_i = [ \text{Self\_State}, \text{Relative\_Teammate\_Info}, \text{Relative\_Intruder\_Info}, \text{Boundary\_Info} ]
$$
- Self_State: Its own normalized position $(x_i, y_i)$, velocity $(v_{x_i}, v_{y_i})$, and heading $\psi_i$.
- Relative_Teammate_Info: For each teammate $j$, the relative distance $d_{ij}$, relative bearing $\beta_{ij}$, and relative speed.
- Relative_Intruder_Info: The relative distance $d_{i,a}$, relative bearing $\beta_{i,a}$, and the intruder’s velocity vector relative to the drone.
- Boundary_Info: The shortest distance to any of the four environment boundaries.
The action space for each UAV drone is continuous and two-dimensional, corresponding to the normalized control inputs:
$$ a_i = [ \tilde{a}_i, \tilde{\alpha}_i ] $$
where $\tilde{a}_i \in [-1, 1]$ maps to the linear acceleration range $[a_{i}^{min}, a_{i}^{max}]$, and $\tilde{\alpha}_i \in [-1, 1]$ maps to the angular acceleration range $[-\alpha_{i}^{max}, \alpha_{i}^{max}]$.
2.4 Multi-Dimensional Reward Function Engineering
A carefully shaped reward function is the primary mechanism for guiding the UAV drone team towards the desired cooperative behavior. The individual reward $r_i^t$ for defender $i$ at time $t$ is a weighted sum of several components, blending dense shaping rewards with sparse terminal rewards. This multi-dimensional design addresses different facets of the task. The components are defined in Table 2.
| Component | Mathematical Formulation | Purpose (Guidance) |
|---|---|---|
| R1: Pursuit Reward | $r_{1,i} = -\lambda_1 \cdot d_{i,a}$ | Encourages closing the distance to the intruder. A constant dense penalty proportional to distance. |
| R2: Expulsion Reward | $r_{2,i} = -\lambda_2 \cdot (d_{a,center} – d_{safe\_zone})$ if $d_{a,center} < d_{safe\_zone}$, else $0$. | Activates when intruder is inside a protected zone around the asset. Penalizes defenders for allowing this, encouraging expulsion. |
| R3: Boundary Drive Reward | $r_{3,i} = +\lambda_3 \cdot (1 / d_{a,boundary})$ when $d_{a,center} > d_{safe\_zone}$. | After expulsion, rewards defenders for minimizing the intruder’s distance to the boundary, driving it out. |
| R4: Formation/Spacing Reward | $r_{4,i} = -\lambda_4 \cdot \text{Var}(\{d_{1,a}, d_{2,a}, …\}) + \lambda_5 \cdot \min_j(d_{i,j})$ | Promotes cooperative encirclement. The first term encourages equal distance from intruder (surround). The second term penalizes being too close to teammates (maintains formation spread). |
| R5: Sparse Success/Failure | $r_{5,i} = +R_{capture}$ if encirclement condition is met for all $i$. $r_{5,i} = -R_{collision}$ if $d_{i,j} < d_{saf}$ or $d_{i,a} < d_{saf}$. $r_{5,i} = -R_{intrusion}$ if intruder reaches asset center. |
Provides strong terminal signals for ultimate task success, safety violation, or mission failure. |
The total reward is: $r_i^t = \sum_{k=1}^{5} r_{k,i}^t$. The weights $\lambda_1$ through $\lambda_5$ and the sparse reward magnitudes $R_{*}$ are hyperparameters tuned to balance the sub-task objectives.
3. Simulation Experiments and Performance Analysis
This section describes the experimental setup, presents the training results, and provides a comparative analysis of the proposed MATD3-based strategy against a baseline algorithm.
3.1 Simulation Setup and Parameters
The training environment was developed in Python, utilizing the PyTorch library for neural network implementation and a custom multi-agent simulation framework. Key environment and training parameters are summarized in Table 3 and Table 4, respectively.
| Parameter | Value |
|---|---|
| Environment Size | $100m \times 100m$ |
| Protected Asset Radius ($d_{safe\_zone}$) | $15m$ |
| Capture Radius ($d_{cap}$) | $8m$ |
| Collision/Safety Distance ($d_{saf}$) | $2m$ |
| Defender UAV Max Speed ($v_{def}^{max}$) | $6 m/s$ |
| Intruder UAV Max Speed ($v_{int}^{max}$) | $8 m/s$ |
| Defender UAV Max Acceleration ($a_{def}^{max}$) | $1.5 m/s^2$ |
| Intruder UAV Max Acceleration ($a_{int}^{max}$) | $3.0 m/s^2$ |
| Simulation Time Step ($\Delta t$) | $0.05s$ |
| Maximum Episode Length | $300$ steps ($15s$ simulated) |
| Hyperparameter | Value |
|---|---|
| Actor Network Architecture | FC(64) → ReLU → FC(64) → ReLU → FC(2) → Tanh |
| Critic Network Architecture | FC(128) → ReLU → FC(128) → ReLU → FC(1) |
| Replay Buffer Size | $1 \times 10^6$ transitions |
| Mini-batch Size | $512$ |
| Discount Factor ($\gamma$) | $0.95$ |
| Actor Learning Rate | $1 \times 10^{-4}$ |
| Critic Learning Rate | $1 \times 10^{-3}$ |
| Target Update Rate ($\tau$) | $0.01$ |
| Policy Update Delay ($d$) | $2$ |
| Exploration Noise ($\sigma$) | OU process with $\theta=0.15$, $\sigma=0.2$ |
| Total Training Episodes | $25,000$ |
3.2 Training Performance and Emergent Behavior
The MATD3 algorithm was trained in two distinct scenarios: a 2-vs-1 configuration (2 defenders, 1 intruder) and a 3-vs-1 configuration. The learning progress was monitored using the average undiscounted episode return for the defending team. Figure 1 conceptually illustrates the learning curve. Initially, the return is low and volatile as the UAV drones explore randomly. As training progresses, the return increases steadily, indicating the agents are learning valuable behaviors—first basic pursuit, then coordinated expulsion, and finally sophisticated encirclement. The curve eventually plateaus as the policy converges towards a near-optimal cooperative strategy.
Key Observed Emergent Behaviors:
- Role Specialization: In the 3-vs-1 case, defenders often implicitly divided roles. One UAV drone would engage the intruder head-on to block its path to the asset, while the others maneuvered to flank and drive it towards the boundary.
- Adaptive Formation: During the encirclement phase, the defenders dynamically adjusted their positions to maintain the capture condition $d_{cap} \geq d_{g_i,a} \geq d_{saf}$. They naturally formed an approximate equilateral triangle (in 3-vs-1) or a pincer movement (in 2-vs-1) around the target.
- Robustness to Adversary: The learned policies were robust to the APF-based maneuvers of the intruder. Defenders learned to anticipate its repulsion from them and used it to herd the intruder in the desired direction.
3.3 Comparative Analysis: MATD3 vs. MADDPG Baseline
To benchmark performance, the proposed MATD3 approach was compared against the foundational MADDPG algorithm under identical environmental conditions and hyperparameter settings (except for the MATD3-specific delayed update and clipped double Q-learning). The comparison focused on three metrics: 1) Learning Convergence Speed, 2) Final Task Success Rate over the last 1000 evaluation episodes, and 3) Average Episode Return at convergence. The results are quantified in Table 5.
| Metric | MATD3 (Proposed) | MADDPG (Baseline) | Improvement |
|---|---|---|---|
| Episodes to Convergence* | ~18,500 | ~21,500 | ~13.9% Faster |
| Final Success Rate | 92.4% | 84.7% | +7.7% |
| Avg. Episode Return at Convergence | +145.6 | +128.3 | +13.5% |
| Policy Stability (Std. Dev. of Last 1k Returns) | 18.2 | 31.7 | More Stable |
*Convergence defined as the point where the moving average return stabilizes within 5% of its final value.
The analysis clearly demonstrates the advantages of MATD3. The twin critic and delayed update mechanisms effectively mitigate the Q-value overestimation problem prevalent in MADDPG, leading to more stable policy gradients. This results in faster and more reliable convergence. Furthermore, the higher success rate and return indicate that the policies learned by MATD3 are not only learned quicker but are also of higher quality, enabling the team of UAV drones to execute the complex expulsion-encirclement maneuver more consistently and efficiently. The lower standard deviation of returns confirms the improved training stability of MATD3.
4. Conclusion and Future Work
This research has presented a comprehensive framework for intelligent cooperative expulsion and encirclement of dynamic targets using a team of UAV drones. By formulating the problem within a multi-agent reinforcement learning context and employing the advanced MATD3 algorithm, the developed system enables a group of defensive UAV drones to learn complex, adaptive, and robust collaborative strategies autonomously. The core contributions include: 1) a detailed problem formulation incorporating realistic kinematics and an intelligent adversarial target model; 2) the design of a multi-dimensional reward function that effectively shapes the desired phased behavior (expulsion followed by encirclement); and 3) the successful application and validation of the MATD3 algorithm, which demonstrated superior performance in terms of convergence speed, success rate, and policy stability compared to a standard MADDPG baseline.
The simulation results in 2-vs-1 and 3-vs-1 scenarios confirm that the UAV drone team can learn to effectively coordinate their movements, herd an intruder away from a protected zone, and seamlessly transition into a capturing formation—all while respecting safety constraints and operating under partial observability. This work provides a significant step towards deployable autonomous systems for perimeter defense, asset protection, and swarm-based interception tasks.
Future research directions will focus on enhancing the framework’s practicality and scalability. Key areas include: 1) extending the model to 3D dynamics for more realistic UAV drone operations; 2) incorporating heterogeneous UAV drone teams with specialized capabilities (e.g., scouts, blockers, captors); 3) improving generalization to varying numbers of intruders and defenders without retraining, potentially using graph neural networks for policy representation; 4) addressing communication constraints and partial observability more explicitly; and 5) transitioning from simulation to physical platform validation using quadcopter swarms in controlled environments. The continued advancement of such intelligent multi-UAV systems is pivotal for the next generation of autonomous aerial operations.
