Multi-Agent Dueling Double Q-Network for Cooperative Collision Avoidance in Low-Altitude Drone Swarms

Recent advancements in drone technology have accelerated the deployment of multi-Unmanned Aerial Vehicle (UAV) systems across urban surveillance, military reconnaissance, and logistics. These low-altitude operations face dynamic spatial constraints from buildings and unpredictable interactions between drones, creating significant collision risks. Traditional approaches—rule-based methods, optimization controllers, and reactive algorithms—struggle with scalability and real-time adaptability in dense environments. This work introduces a novel Multi-Agent Dueling Double Deep Q-Network (MAD3QN) framework to address these challenges through decentralized decision-making with shared intelligence.

Problem Formulation

Consider \(N\) drones operating in a 2D airspace with static obstacles. Each Unmanned Aerial Vehicle \(i\) has position \(\mathbf{P}_i(t)\), velocity \(v_i(t)\), and heading \(\psi_i(t)\). The objective combines collision avoidance, target convergence, and kinematic feasibility:

$$
\begin{aligned}
\max_{\pi} & \quad \mathbb{E} \left[ \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^T \gamma^t r_i(t) \right] \\
\text{s.t.} & \quad \|\mathbf{P}_i(t) – \mathbf{P}_j(t)\| \geq R_c \quad \forall i \neq j, t \\
& \quad \|\mathbf{P}_i(t) – \mathbf{P}_o\| \geq R_c \quad \forall o \in \text{obstacles} \\
& \quad \|\mathbf{P}_i(T) – \mathbf{P}^g_i\| \leq R_g \\
& \quad |\Delta \psi_i(t)| \leq \Delta \psi_{\max}, |v_i(t)| \leq v_{\max}
\end{aligned}
$$

where \(R_c\) is the safety radius, \(\mathbf{P}^g_i\) is the goal position, and \(\gamma\) is the discount factor. This Markov Game requires distributed policies robust to partial observability inherent in drone technology.

Perception and Action Representation

Each Unmanned Aerial Vehicle uses a sector-based observation model (Figure 2) with \(K\) detection zones. The observation vector \(\mathcal{O}_i\) integrates egocentric state and sector-specific threats:

$$
\mathcal{O}_i = \left\{ \underbrace{v_i, \psi_i, D^g_i, \phi^g_i}_{\text{Ego State}}; \bigcup_{k=1}^K \underbrace{ \left\{ v^{\text{int}}_k, \psi^{\text{int}}_k, D^{\text{int}}_k, \phi^{\text{int}}_k, D^{\text{obs}}_k, \phi^{\text{obs}}_k \right\}}_{\text{Sector } k} \right\}
$$

Dimensions remain fixed via zero-padding for unoccupied sectors. The discrete action space enables real-time decisions:

$$
\mathcal{A} = \left\{ (\Delta\psi, \Delta v) \mid \Delta\psi \in \{ -\delta_\psi, 0, +\delta_\psi \}, \Delta v \in \{ -\delta_v, 0, +\delta_v \} \right\}
$$

Multi-Factor Reward Design

The composite reward \(r_i(t)\) for each Unmanned Aerial Vehicle combines five objectives:

Component Formula Purpose
Goal Reward \( r_g = \begin{cases} C_g & \| \mathbf{P}_i – \mathbf{P}^g_i \| \leq R_g \\ 0 & \text{otherwise} \end{cases} \) Target achievement
Distance Incentive \( r_d = \frac{ D^g_i(t-1) – D^g_i(t) }{ v_{\max} } \) Path efficiency
Heading Guidance \( r_h = C_h \cos(\phi^g_i) \) Directional alignment
Collision Penalty \( r_c = \begin{cases} C_{\text{uav}} & \text{UAV conflict} \\ C_{\text{obs}} & \text{Obstacle strike} \\ 0 & \text{safe} \end{cases} \) Safety enforcement
Smoothness Bonus \( r_s = C_s \left( |\Delta\psi_t – \Delta\psi_{t-1}| + |\Delta v_t – \Delta v_{t-1}| \right) \) Kinematic stability

Total reward: \( r_i = r_g + r_d + r_h + r_c + r_s \). This structure balances mission objectives with safety constraints critical for drone technology.

MAD3QN Architecture

Our framework (Figure 3) uses parameter-shared dueling networks to address multi-agent non-stationarity. Each agent’s Q-function decomposes as:

$$
Q(\mathbf{s}, \mathbf{a}; \theta) = V(\mathbf{s}; \theta_V) + \left( A(\mathbf{s}, \mathbf{a}; \theta_A) – \frac{1}{|\mathcal{A}|} \sum_{\mathbf{a}’} A(\mathbf{s}, \mathbf{a}’; \theta_A) \right)
$$

where \(V\) evaluates state value and \(A\) measures action advantage. The Double Q-learning target prevents overestimation:

$$
y^{\text{D3QN}} = r + \gamma Q\left( \mathbf{s}’, \underset{\mathbf{a}’}{\arg\max} Q(\mathbf{s}’, \mathbf{a}’; \theta); \theta^- \right)
$$

Neural network architecture processes the fixed-length \(\mathcal{O}_i\):

Layer Nodes Activation Input Dim
Input \(4 + 6K\)
FC1 128 ReLU
FC2 64 ReLU
Value Stream 1 Linear
Advantage Stream \(|\mathcal{A}|\) Linear

Distributed execution leverages local observations while centralized training stabilizes learning through experience sharing across all Unmanned Aerial Vehicle agents.

Experimental Validation

Tests used a Python/Unity simulation environment with parameters:

Parameter Value Parameter Value
Learning Rate 0.001 Safety Radius \(R_c\) 15m
Discount \(\gamma\) 0.995 Max Speed \(v_{\max}\) 25m/s
Batch Size 256 Heading Change \(\delta_\psi\) 15°
Sectors \(K\) 8 Speed Change \(\delta_v\) 3m/s

Training convergence (Figure 5) shows MAD3QN’s superiority over single-agent D3QN in multi-Unmanned Aerial Vehicle environments:

$$
\text{Average Reward} = \frac{1}{N}\sum_{i=1}^N \sum_{t=0}^T \gamma^t r_i(t)
$$

After 5,000 episodes, MAD3QN achieved stable rewards ≈43.5 vs. D3QN’s ≈20.5 with higher variance.

Performance Metrics

Monte Carlo tests (200 trials per scenario) evaluated success rate, path length, and flight time:

Drones (\(N\)) Success Rate (%) Avg. Distance (m) Avg. Time (s)
2 99.5 804.20 52.7
3 99.5 806.52 55.9
4 99.0 811.53 56.3
5 98.5 815.79 57.9
6 98.0 819.83 58.6

Comparative analysis (\(N=6\), 4 obstacles) demonstrates MAD3QN’s dominance:

Method Success Rate (%) Distance (m) Time (s)
MAD3QN (Ours) 98.0 819.83 58.6
D3QN 83.5 832.63 62.3
MADQN 86.5 839.42 63.1
IDQN 79.5 847.56 69.2
Random 3.0 4932.8 436.2

Deployment in a geospatial Unity simulation (Figures 7–8) using Chengdu city data confirmed real-world viability for drone technology applications.

Conclusion

This work presents MAD3QN—a scalable solution for multi-Unmanned Aerial Vehicle collision avoidance in constrained airspace. Key innovations include sector-based perception, multi-objective rewards, and dueling double Q-learning with parameter sharing. Validated in diverse simulations, the approach achieves >98% success under high-density interactions while minimizing path deviation. Future work will extend to 3D conflict resolution and hardware-in-the-loop validation, advancing safe integration of autonomous drone technology in urban air mobility.

Scroll to Top