Multi-UAV Cooperative Pursuit Method Based on GRU-MAPPO

Recent advancements in drone technology have expanded applications across military and civilian domains. Multi-Unmanned Aerial Vehicle cooperative pursuit represents a critical capability where pursuit drones collaborate to capture evasive targets. This capability holds strategic importance for defense systems, interception missions, and search-and-rescue operations. However, existing methods face limitations under partial observability constraints, inefficient coordination, and slow convergence in complex 3D environments. To address these challenges, we propose a novel GRU-MAPPO framework integrating gated recurrent units with multi-agent proximal policy optimization, significantly enhancing cooperative pursuit performance for Unmanned Aerial Vehicles.

Problem Formulation and Modeling

The kinematics of Unmanned Aerial Vehicles in a 3D coordinate system follow:

$$
\begin{cases}
\dot{x}_i(t) = v_i(t) \cos \phi_i \cos \theta_i \\
\dot{y}_i(t) = v_i(t) \cos \phi_i \sin \theta_i \\
\dot{z}_i(t) = v_i(t) \sin \phi_i \\
\dot{\theta}_i(t) = \omega_i(t)
\end{cases}
$$

where $ (x_i, y_i, z_i) $ denotes position, $ v_i $ represents velocity, and $ \omega_i $ is yaw angular velocity. Physical constraints are defined as:

$$
-v_{\text{max}} < v_i < v_{\text{max}}, \quad -\omega_{\text{max}} < \omega_i < \omega_{\text{max}}
$$

The pursuit mission involves $ N $ cooperative drones capturing a single evader $ M $ with superior maneuverability. Key constraints include:

Capture condition: $ \min_{i} D(U_i(T), M(T)) \leq d_{\text{capture}} $
Collision avoidance: $ D(U_i(t), U_j(t)) > d_{\text{collision}}, \forall i \neq j $
Boundary constraint: $ P_{U_i}(t) \in \Omega, \forall i $

We model this as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) represented by the tuple $ (S, A, T, R, O, Z, \gamma) $:

$$
\begin{aligned}
T(s, a, s’) &= P(s’|s, a) \\
Z(s, a, o) &= P(o|s, a)
\end{aligned}
$$

Table 1: UAV observation space structure
Observation Component	Variables	Dimension
Self-state	$ (v_i, \omega_i, \theta_i) $	3
Target information	$ (d_{i,M}, \alpha_{i,M}, v_M, \omega_M, \theta_M) $	5
Nearest ally	$ (d_{i,j}, \beta_{i,j}, \chi_{i,j}) $	3

GRU-MAPPO Framework

Our architecture addresses partial observability through gated recurrent units integrated with MAPPO:

$$
\begin{aligned}
r_t &= \sigma(W_r \cdot [h_{t-1}, x_t] + b_r) \\
z_t &= \sigma(W_z \cdot [h_{t-1}, x_t] + b_z) \\
\tilde{h}_t &= \tanh(W_h \cdot [r_t \odot h_{t-1}, x_t] + b_h) \\
h_t &= (1 – z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t
\end{aligned}
$$

The hybrid reward function combines sparse and shaped components:

$$
\begin{aligned}
R_t = & \omega_p R_p + \omega_d R_d + \omega_f R_f + \omega_c R_c + \omega_s R_s \\
R_p &= \begin{cases} r_{\text{capture}} & D_{U,M} < d_{\text{capture}} \\ 0 & \text{otherwise} \end{cases} \\
R_d &= -D_{U,M} \\
R_f &= -\left| \chi_{i,j} – \frac{\pi}{N} \right| \\
R_c &= \begin{cases} r_{\text{collision}} & d_{\min} < d_{\text{collision}} \\ 0 & \text{otherwise} \end{cases} \\
R_s &= \begin{cases} \frac{d_{\min} – d_{\text{safe}}}}{d_{\text{safe}} – d_{\text{collision}}} & d_{\min} < d_{\text{safe}} \\ 0 & \text{otherwise} \end{cases}
\end{aligned}
$$

Curriculum learning dynamically adjusts difficulty through linear regulators:

$$
d_{\text{capture}}^* = \begin{cases}
d_{\text{capture}} [\lambda_d + (1 – \lambda_d)k] & k < 1 \\
d_{\text{capture}} & \text{otherwise}
\end{cases}, \quad k = \frac{\text{step}}{\text{Maxstep}}
$$

Experimental Validation

Simulations in ROS/Gazebo environment demonstrate the effectiveness of our approach in drone technology applications:

Table 2: Performance comparison in standard environment
Algorithm	Success Rate	Avg. Reward	Capture Time (s)
LSTM-MAPPO	79%	81	26.0
MAPPO	68%	69	26.9
A3C	67%	71	26.2
MADDPG	62%	60	28.0
IPPO	40%	47	30.4
GRU-MAPPO (Ours)	85%	86	23.5

Key findings demonstrate the superiority of our Unmanned Aerial Vehicle strategy:

12.6% reduction in capture time compared to MAPPO
17% higher success rate in noisy environments (80% vs 68%)
Faster convergence with curriculum learning (6×10⁵ vs 6.9×10⁵ steps)
Effective generalization to random and hybrid evasion strategies

Conclusion

This research advances drone technology through a novel GRU-MAPPO framework for multi-Unmanned Aerial Vehicle cooperative pursuit. By integrating gated recurrent units with multi-agent reinforcement learning, we effectively address partial observability challenges in 3D environments. The curriculum learning linear regulator accelerates training convergence, while the hybrid reward structure balances exploration and exploitation. Experimental validation confirms 12.6% faster target capture and 17% higher success rates compared to baseline methods, demonstrating significant improvements in cooperative autonomy for Unmanned Aerial Vehicles. Future work will extend this framework to dynamic obstacle environments with enhanced sensor models.