Recent advancements in drone technology have accelerated the deployment of multi-Unmanned Aerial Vehicle (UAV) systems across urban surveillance, military reconnaissance, and logistics. These low-altitude operations face dynamic spatial constraints from buildings and unpredictable interactions between drones, creating significant collision risks. Traditional approaches—rule-based methods, optimization controllers, and reactive algorithms—struggle with scalability and real-time adaptability in dense environments. This work introduces a novel Multi-Agent Dueling Double Deep Q-Network (MAD3QN) framework to address these challenges through decentralized decision-making with shared intelligence.

Problem Formulation
Consider \(N\) drones operating in a 2D airspace with static obstacles. Each Unmanned Aerial Vehicle \(i\) has position \(\mathbf{P}_i(t)\), velocity \(v_i(t)\), and heading \(\psi_i(t)\). The objective combines collision avoidance, target convergence, and kinematic feasibility:
$$
\begin{aligned}
\max_{\pi} & \quad \mathbb{E} \left[ \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^T \gamma^t r_i(t) \right] \\
\text{s.t.} & \quad \|\mathbf{P}_i(t) – \mathbf{P}_j(t)\| \geq R_c \quad \forall i \neq j, t \\
& \quad \|\mathbf{P}_i(t) – \mathbf{P}_o\| \geq R_c \quad \forall o \in \text{obstacles} \\
& \quad \|\mathbf{P}_i(T) – \mathbf{P}^g_i\| \leq R_g \\
& \quad |\Delta \psi_i(t)| \leq \Delta \psi_{\max}, |v_i(t)| \leq v_{\max}
\end{aligned}
$$
where \(R_c\) is the safety radius, \(\mathbf{P}^g_i\) is the goal position, and \(\gamma\) is the discount factor. This Markov Game requires distributed policies robust to partial observability inherent in drone technology.
Perception and Action Representation
Each Unmanned Aerial Vehicle uses a sector-based observation model (Figure 2) with \(K\) detection zones. The observation vector \(\mathcal{O}_i\) integrates egocentric state and sector-specific threats:
$$
\mathcal{O}_i = \left\{ \underbrace{v_i, \psi_i, D^g_i, \phi^g_i}_{\text{Ego State}}; \bigcup_{k=1}^K \underbrace{ \left\{ v^{\text{int}}_k, \psi^{\text{int}}_k, D^{\text{int}}_k, \phi^{\text{int}}_k, D^{\text{obs}}_k, \phi^{\text{obs}}_k \right\}}_{\text{Sector } k} \right\}
$$
Dimensions remain fixed via zero-padding for unoccupied sectors. The discrete action space enables real-time decisions:
$$
\mathcal{A} = \left\{ (\Delta\psi, \Delta v) \mid \Delta\psi \in \{ -\delta_\psi, 0, +\delta_\psi \}, \Delta v \in \{ -\delta_v, 0, +\delta_v \} \right\}
$$
Multi-Factor Reward Design
The composite reward \(r_i(t)\) for each Unmanned Aerial Vehicle combines five objectives:
| Component | Formula | Purpose |
|---|---|---|
| Goal Reward | \( r_g = \begin{cases} C_g & \| \mathbf{P}_i – \mathbf{P}^g_i \| \leq R_g \\ 0 & \text{otherwise} \end{cases} \) | Target achievement |
| Distance Incentive | \( r_d = \frac{ D^g_i(t-1) – D^g_i(t) }{ v_{\max} } \) | Path efficiency |
| Heading Guidance | \( r_h = C_h \cos(\phi^g_i) \) | Directional alignment |
| Collision Penalty | \( r_c = \begin{cases} C_{\text{uav}} & \text{UAV conflict} \\ C_{\text{obs}} & \text{Obstacle strike} \\ 0 & \text{safe} \end{cases} \) | Safety enforcement |
| Smoothness Bonus | \( r_s = C_s \left( |\Delta\psi_t – \Delta\psi_{t-1}| + |\Delta v_t – \Delta v_{t-1}| \right) \) | Kinematic stability |
Total reward: \( r_i = r_g + r_d + r_h + r_c + r_s \). This structure balances mission objectives with safety constraints critical for drone technology.
MAD3QN Architecture
Our framework (Figure 3) uses parameter-shared dueling networks to address multi-agent non-stationarity. Each agent’s Q-function decomposes as:
$$
Q(\mathbf{s}, \mathbf{a}; \theta) = V(\mathbf{s}; \theta_V) + \left( A(\mathbf{s}, \mathbf{a}; \theta_A) – \frac{1}{|\mathcal{A}|} \sum_{\mathbf{a}’} A(\mathbf{s}, \mathbf{a}’; \theta_A) \right)
$$
where \(V\) evaluates state value and \(A\) measures action advantage. The Double Q-learning target prevents overestimation:
$$
y^{\text{D3QN}} = r + \gamma Q\left( \mathbf{s}’, \underset{\mathbf{a}’}{\arg\max} Q(\mathbf{s}’, \mathbf{a}’; \theta); \theta^- \right)
$$
Neural network architecture processes the fixed-length \(\mathcal{O}_i\):
| Layer | Nodes | Activation | Input Dim |
|---|---|---|---|
| Input | – | – | \(4 + 6K\) |
| FC1 | 128 | ReLU | – |
| FC2 | 64 | ReLU | – |
| Value Stream | 1 | Linear | – |
| Advantage Stream | \(|\mathcal{A}|\) | Linear | – |
Distributed execution leverages local observations while centralized training stabilizes learning through experience sharing across all Unmanned Aerial Vehicle agents.
Experimental Validation
Tests used a Python/Unity simulation environment with parameters:
| Parameter | Value | Parameter | Value |
|---|---|---|---|
| Learning Rate | 0.001 | Safety Radius \(R_c\) | 15m |
| Discount \(\gamma\) | 0.995 | Max Speed \(v_{\max}\) | 25m/s |
| Batch Size | 256 | Heading Change \(\delta_\psi\) | 15° |
| Sectors \(K\) | 8 | Speed Change \(\delta_v\) | 3m/s |
Training convergence (Figure 5) shows MAD3QN’s superiority over single-agent D3QN in multi-Unmanned Aerial Vehicle environments:
$$
\text{Average Reward} = \frac{1}{N}\sum_{i=1}^N \sum_{t=0}^T \gamma^t r_i(t)
$$
After 5,000 episodes, MAD3QN achieved stable rewards ≈43.5 vs. D3QN’s ≈20.5 with higher variance.
Performance Metrics
Monte Carlo tests (200 trials per scenario) evaluated success rate, path length, and flight time:
| Drones (\(N\)) | Success Rate (%) | Avg. Distance (m) | Avg. Time (s) |
|---|---|---|---|
| 2 | 99.5 | 804.20 | 52.7 |
| 3 | 99.5 | 806.52 | 55.9 |
| 4 | 99.0 | 811.53 | 56.3 |
| 5 | 98.5 | 815.79 | 57.9 |
| 6 | 98.0 | 819.83 | 58.6 |
Comparative analysis (\(N=6\), 4 obstacles) demonstrates MAD3QN’s dominance:
| Method | Success Rate (%) | Distance (m) | Time (s) |
|---|---|---|---|
| MAD3QN (Ours) | 98.0 | 819.83 | 58.6 |
| D3QN | 83.5 | 832.63 | 62.3 |
| MADQN | 86.5 | 839.42 | 63.1 |
| IDQN | 79.5 | 847.56 | 69.2 |
| Random | 3.0 | 4932.8 | 436.2 |
Deployment in a geospatial Unity simulation (Figures 7–8) using Chengdu city data confirmed real-world viability for drone technology applications.
Conclusion
This work presents MAD3QN—a scalable solution for multi-Unmanned Aerial Vehicle collision avoidance in constrained airspace. Key innovations include sector-based perception, multi-objective rewards, and dueling double Q-learning with parameter sharing. Validated in diverse simulations, the approach achieves >98% success under high-density interactions while minimizing path deviation. Future work will extend to 3D conflict resolution and hardware-in-the-loop validation, advancing safe integration of autonomous drone technology in urban air mobility.
