Multi-Agent Dueling Double Q-Network for Cooperative Collision Avoidance in Low-Altitude Drone Swarms

Recent advancements in drone technology have accelerated the deployment of multi-Unmanned Aerial Vehicle (UAV) systems across urban surveillance, military reconnaissance, and logistics. These low-altitude operations face dynamic spatial constraints from buildings and unpredictable interactions between drones, creating significant collision risks. Traditional approaches—rule-based methods, optimization controllers, and reactive algorithms—struggle with scalability and real-time adaptability in dense environments. This work introduces a novel Multi-Agent Dueling Double Deep Q-Network (MAD3QN) framework to address these challenges through decentralized decision-making with shared intelligence.

Problem Formulation

Consider $N$ drones operating in a 2D airspace with static obstacles. Each Unmanned Aerial Vehicle $i$ has position $\mathbf{P}_i(t)$, velocity $v_i(t)$, and heading $\psi_i(t)$. The objective combines collision avoidance, target convergence, and kinematic feasibility:

$$
\begin{aligned}
\max_{\pi} & \quad \mathbb{E} \left[ \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^T \gamma^t r_i(t) \right] \\
\text{s.t.} & \quad \|\mathbf{P}_i(t) – \mathbf{P}_j(t)\| \geq R_c \quad \forall i \neq j, t \\
& \quad \|\mathbf{P}_i(t) – \mathbf{P}_o\| \geq R_c \quad \forall o \in \text{obstacles} \\
& \quad \|\mathbf{P}_i(T) – \mathbf{P}^g_i\| \leq R_g \\
& \quad |\Delta \psi_i(t)| \leq \Delta \psi_{\max}, |v_i(t)| \leq v_{\max}
\end{aligned}
$$

where $R_c$ is the safety radius, $\mathbf{P}^g_i$ is the goal position, and $\gamma$ is the discount factor. This Markov Game requires distributed policies robust to partial observability inherent in drone technology.

Perception and Action Representation

Each Unmanned Aerial Vehicle uses a sector-based observation model (Figure 2) with $K$ detection zones. The observation vector $\mathcal{O}_i$ integrates egocentric state and sector-specific threats:

$$
\mathcal{O}_i = \left\{ \underbrace{v_i, \psi_i, D^g_i, \phi^g_i}_{\text{Ego State}}; \bigcup_{k=1}^K \underbrace{ \left\{ v^{\text{int}}_k, \psi^{\text{int}}_k, D^{\text{int}}_k, \phi^{\text{int}}_k, D^{\text{obs}}_k, \phi^{\text{obs}}_k \right\}}_{\text{Sector } k} \right\}
$$

Dimensions remain fixed via zero-padding for unoccupied sectors. The discrete action space enables real-time decisions:

$$
\mathcal{A} = \left\{ (\Delta\psi, \Delta v) \mid \Delta\psi \in \{ -\delta_\psi, 0, +\delta_\psi \}, \Delta v \in \{ -\delta_v, 0, +\delta_v \} \right\}
$$

Multi-Factor Reward Design

The composite reward $r_i(t)$ for each Unmanned Aerial Vehicle combines five objectives:

Component	Formula	Purpose
Goal Reward	$ r_g = \begin{cases} C_g & \\| \mathbf{P}_i – \mathbf{P}^g_i \\| \leq R_g \\ 0 & \text{otherwise} \end{cases} $	Target achievement
Distance Incentive	$ r_d = \frac{ D^g_i(t-1) – D^g_i(t) }{ v_{\max} } $	Path efficiency
Heading Guidance	$ r_h = C_h \cos(\phi^g_i) $	Directional alignment
Collision Penalty	$ r_c = \begin{cases} C_{\text{uav}} & \text{UAV conflict} \\ C_{\text{obs}} & \text{Obstacle strike} \\ 0 & \text{safe} \end{cases} $	Safety enforcement
Smoothness Bonus	$ r_s = C_s \left( \|\Delta\psi_t – \Delta\psi_{t-1}\| + \|\Delta v_t – \Delta v_{t-1}\| \right) $	Kinematic stability

Total reward: $ r_i = r_g + r_d + r_h + r_c + r_s $. This structure balances mission objectives with safety constraints critical for drone technology.

MAD3QN Architecture

Our framework (Figure 3) uses parameter-shared dueling networks to address multi-agent non-stationarity. Each agent’s Q-function decomposes as:

$$
Q(\mathbf{s}, \mathbf{a}; \theta) = V(\mathbf{s}; \theta_V) + \left( A(\mathbf{s}, \mathbf{a}; \theta_A) – \frac{1}{|\mathcal{A}|} \sum_{\mathbf{a}’} A(\mathbf{s}, \mathbf{a}’; \theta_A) \right)
$$

where $V$ evaluates state value and $A$ measures action advantage. The Double Q-learning target prevents overestimation:

$$
y^{\text{D3QN}} = r + \gamma Q\left( \mathbf{s}’, \underset{\mathbf{a}’}{\arg\max} Q(\mathbf{s}’, \mathbf{a}’; \theta); \theta^- \right)
$$

Neural network architecture processes the fixed-length $\mathcal{O}_i$:

Layer	Nodes	Activation	Input Dim
Input	–	–	$4 + 6K$
FC1	128	ReLU	–
FC2	64	ReLU	–
Value Stream	1	Linear	–
Advantage Stream	$\|\mathcal{A}\|$	Linear	–

Distributed execution leverages local observations while centralized training stabilizes learning through experience sharing across all Unmanned Aerial Vehicle agents.

Experimental Validation

Tests used a Python/Unity simulation environment with parameters:

Parameter	Value	Parameter	Value
Learning Rate	0.001	Safety Radius $R_c$	15m
Discount $\gamma$	0.995	Max Speed $v_{\max}$	25m/s
Batch Size	256	Heading Change $\delta_\psi$	15°
Sectors $K$	8	Speed Change $\delta_v$	3m/s

Training convergence (Figure 5) shows MAD3QN’s superiority over single-agent D3QN in multi-Unmanned Aerial Vehicle environments:

$$
\text{Average Reward} = \frac{1}{N}\sum_{i=1}^N \sum_{t=0}^T \gamma^t r_i(t)
$$

After 5,000 episodes, MAD3QN achieved stable rewards ≈43.5 vs. D3QN’s ≈20.5 with higher variance.

Performance Metrics

Monte Carlo tests (200 trials per scenario) evaluated success rate, path length, and flight time:

Drones ($N$)	Success Rate (%)	Avg. Distance (m)	Avg. Time (s)
2	99.5	804.20	52.7
3	99.5	806.52	55.9
4	99.0	811.53	56.3
5	98.5	815.79	57.9
6	98.0	819.83	58.6

Comparative analysis ($N=6$, 4 obstacles) demonstrates MAD3QN’s dominance:

Method	Success Rate (%)	Distance (m)	Time (s)
MAD3QN (Ours)	98.0	819.83	58.6
D3QN	83.5	832.63	62.3
MADQN	86.5	839.42	63.1
IDQN	79.5	847.56	69.2
Random	3.0	4932.8	436.2

Deployment in a geospatial Unity simulation (Figures 7–8) using Chengdu city data confirmed real-world viability for drone technology applications.

Conclusion

This work presents MAD3QN—a scalable solution for multi-Unmanned Aerial Vehicle collision avoidance in constrained airspace. Key innovations include sector-based perception, multi-objective rewards, and dueling double Q-learning with parameter sharing. Validated in diverse simulations, the approach achieves >98% success under high-density interactions while minimizing path deviation. Future work will extend to 3D conflict resolution and hardware-in-the-loop validation, advancing safe integration of autonomous drone technology in urban air mobility.