HAP-DQN: A Hierarchical Adaptive Deep Q-Network for Multi-UAV Cooperative Path Planning

The application of Unmanned Aerial Vehicles (UAVs), or drones, has expanded dramatically across various sectors. In complex tasks like large-scale infrastructure inspection, surveillance, and logistics, a single UAV drone often proves insufficient due to limitations in coverage, efficiency, and robustness. Consequently, multi-UAV systems, or UAV drone swarms, have emerged as a pivotal technology, promising enhanced task performance through cooperation. However, orchestrating a fleet of UAV drones in a cluttered, dynamic environment to perform a mission efficiently and safely remains a significant challenge. The core of this challenge lies in cooperative path planning, where each UAV drone must plan its trajectory not only to avoid static obstacles and reach its goal but also to dynamically avoid collisions with other UAV drones in the swarm.

Traditional path planning algorithms, such as A* or Rapidly-exploring Random Trees (RRT), often struggle with the combinatorial complexity introduced by multiple interacting agents. While centralized optimization methods can provide solutions, they suffer from scalability issues and require perfect global information, which is impractical in real-world scenarios. Recent advancements in Deep Reinforcement Learning (DRL) offer a promising paradigm for decentralized, learning-based control. Among DRL algorithms, Deep Q-Networks (DQN) have shown great potential for solving sequential decision-making problems. Yet, applying standard DQN to multi-UAV cooperative path planning reveals several critical limitations that hinder performance and learning efficiency.

This paper identifies three fundamental shortcomings of standard DQN in this context. First, the commonly used ε-greedy exploration strategy is inefficient and undirected, leading to slow convergence as UAV drones waste time on random actions in vast state spaces. Second, the reward signal in navigation tasks is typically sparse (e.g., a large reward only upon reaching the goal), making it difficult for the learning agent to associate its actions with long-term outcomes. Third, standard uniform experience replay treats all past transitions equally, potentially overlooking rare but crucial experiences, such as successful last-moment collision avoidance between UAV drones, which are vital for learning robust cooperative behavior.

To address these challenges, this paper proposes the HAP-DQN algorithm—a novel DQN variant enhanced with Hierarchical exploration, Adaptive prioritized replay, and Potential-based reward shaping. HAP-DQN is designed to enable a swarm of UAV drones to learn efficient, collision-free cooperative paths autonomously. The hierarchical exploration strategy intelligently guides UAV drones towards their goal when safe, switching to careful, policy-driven exploration near obstacles. The potential-based reward shaping provides a dense, guiding signal by rewarding progress toward the goal, effectively mitigating the sparse reward problem. Finally, the adaptive prioritized replay mechanism ensures that experiences critical for safety and cooperation, especially those involving interactions between multiple UAV drones, are replayed more frequently, accelerating the learning of cooperative collision avoidance policies.

Problem Formulation: Multi-UAV Cooperative Inspection as a Decentralized MDP

We consider a cooperative inspection task where a swarm of $ N $ UAV drones, denoted as $ \mathcal{U} = \{U_1, U_2, \dots, U_N\} $, must navigate from their individual start positions $ S_i $ to designated target inspection points $ T_i $ within a shared 3D environment. The environment contains static obstacles $ \mathcal{O}_{\text{static}} $ such as terrain, buildings, and power lines. The primary challenge is that each UAV drone $ U_i $ must also treat every other UAV drone $ U_j (j \neq i) $ as a dynamic obstacle to avoid.

The objective is to find a set of optimal paths $ \{ \tau^*_1, \tau^*_2, \dots, \tau^*_N \} $ that minimize a global cost function $ J $, which balances efficiency and safety for the entire swarm of UAV drones:

$$
\min J = \sum_{i=1}^{N} \left( w_l \cdot \text{length}(\tau_i) + w_t \cdot \text{time}(\tau_i) + w_r \cdot \text{risk}(\tau_i) \right)
$$

where $ \text{length}(\tau_i) $ and $ \text{time}(\tau_i) $ measure the path length and completion time for UAV drone $ U_i $, $ \text{risk}(\tau_i) $ penalizes proximity to obstacles and other UAV drones, and $ w_l, w_t, w_r $ are weighting coefficients.

All paths must satisfy the following constraints:

Kinematic Constraints: $ \kappa(\tau_i) \leq \kappa_{\text{max}}, \phi(\tau_i) \leq \phi_{\text{max}} $, where $ \kappa $ and $ \phi $ are path curvature and climb angle.
Boundary Constraints: All waypoints must lie within the operational airspace boundaries (see Eq. 1 in original text).
Static Obstacle Avoidance: $ \min_{o \in \mathcal{O}_{\text{static}}} \| p – o \|_2 > d_{\text{static}}, \forall p \in \tau_i $.
Cooperative Collision Avoidance: $ \| p_{i,t} – p_{j,t} \|_2 > d_{\text{safe}}, \forall i \neq j, \forall t $.
Communication Constraint: A UAV drone $ U_i $ can only perceive the state of neighbor UAV drones $ U_j $ within a communication range $ d_{\text{comm}} $.

We model this as a decentralized Partially Observable Markov Decision Process (Dec-POMDP) with parameter sharing. Each UAV drone is an independent agent executing a shared policy based on its local observations. The MDP for each agent is defined by the tuple $ \langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle $.

Table 1: MDP Formulation for a Single UAV Drone Agent
Component	Description	Mathematical Definition / Composition
State Space $ \mathcal{S} $	The local observation of the UAV drone.	$ s_{i,t} = [ \mathbf{p}_{i,t}, \mathbf{v}_{i,t}, \mathbf{p}_{\text{goal}_i}, \mathbf{o}_{i,t}, \mathbf{n}_{i,t} ] $. – $ \mathbf{p}_{i,t}, \mathbf{v}_{i,t} \in \mathbb{R}^3 $: Position & velocity. – $ \mathbf{p}_{\text{goal}_i} \in \mathbb{R}^3 $: Target location. – $ \mathbf{o}_{i,t} \in \mathbb{R}^m $: Sensor readings of nearby static obstacles. – $ \mathbf{n}_{i,t} \in \mathbb{R}^{k \times 6} $: States (relative pos & vel) of $ k $ nearest neighbor UAV drones within $ d_{\text{comm}} $.
Action Space $ \mathcal{A} $	Discrete motion primitives.	$ \mathcal{A} = \{ \text{Forward, Turn Left, Turn Right, Climb, Descend,} $ $ \text{Climb-Left, Climb-Right} \} $. Each action corresponds to a fixed displacement.
Reward Function $ \mathcal{R} $	Guides the UAV drone towards desired behavior.	$ R(s_{i,t}, a_{i,t}) = R_{\text{goal}} + R_{\text{collision}} + R_{\text{step}} $. – $ R_{\text{goal}} = +r_g $ if $ \\|\mathbf{p}_i – \mathbf{p}_{\text{goal}_i}\\|_2 < \epsilon_g $. – $ R_{\text{collision}} = -r_c $ (static obstacle), $ -r_d $ (near-miss with another UAV drone). – $ R_{\text{step}} = -r_s $ per timestep (encourages efficiency).
Transition $ \mathcal{P} $ & Discount $ \gamma $	Environment dynamics and future reward importance.	$ \mathcal{P}(s’\|s,a) $ defined by drone kinematics and environment. Discount factor $ \gamma \in [0,1) $ (e.g., 0.99).

The HAP-DQN Algorithm: Core Components

The proposed HAP-DQN algorithm builds upon the standard DQN architecture but integrates three key innovations to address the specific challenges of multi-UAV path planning.

1. Potential-based Reward Shaping (P)

To tackle the sparse reward problem, we employ Potential-Based Reward Shaping (PBRS). PBRS adds a shaping reward $ F(s, a, s’) = \gamma \Phi(s’) – \Phi(s) $ to the environmental reward $ R $, which is guaranteed not to alter the optimal policy of the original MDP. For our UAV drone navigation task, we define the potential $ \Phi(s_i) $ as the negative distance to the goal:

$$
\Phi(s_i) = -k \cdot \| \mathbf{p}_i – \mathbf{p}_{\text{goal}_i} \|_2
$$

where $ k $ is a scaling factor. The modified reward becomes:

$$
R'(s_i, a_i, s’_i) = R(s_i, a_i) + \gamma \Phi(s’_i) – \Phi(s_i)
$$

This formulation provides a dense learning signal: the UAV drone receives a positive intrinsic reward for every action that reduces its distance to the target and a penalty for actions that increase it. This guides the UAV drones to learn goal-directed behavior much faster, forming a foundation upon which more complex obstacle avoidance can be built.

2. Hierarchical Exploration Strategy (H)

The standard ε-greedy strategy is replaced with a two-layer hierarchical approach that dynamically adjusts exploration based on situational safety. Let $ d_{\text{min}} $ be the minimum distance from the UAV drone to any obstacle (static or dynamic).

Macro-level (Goal-oriented Exploration): When $ d_{\text{min}} > d_{\text{safe}} $ (safe region), the UAV drone employs a goal-biased strategy with probability $ p_{\text{guidance}} $. It selects the action that minimizes the expected Euclidean distance to the goal:
$$ a^*_{\text{guidance}} = \arg \min_{a_k \in \mathcal{A}} \| f(\mathbf{p}_i, a_k) – \mathbf{p}_{\text{goal}_i} \|_2 $$
where $ f(\cdot) $ predicts the next position after taking action $ a_k $.
Micro-level (Policy-driven Exploration): When $ d_{\text{min}} \leq d_{\text{safe}} $ (danger zone), the UAV drone switches to a standard ε-greedy strategy based on the current Q-network to perform careful, policy-driven exploration for obstacle and inter-drone avoidance:
$$ a^*_{Q} = \arg \max_{a \in \mathcal{A}} Q(s, a; \theta) $$

This hierarchical mechanism prevents wasteful random exploration in open spaces, significantly accelerating convergence towards target regions while preserving the ability to learn delicate maneuvering in congested areas.

3. Adaptive Prioritized Experience Replay (A)

While Prioritized Experience Replay (PER) prioritizes transitions with high Temporal-Difference error (TD-error), it may undervalue experiences that are critical for learning safety and cooperation but have low TD-error once learned. Our adaptive replay mechanism augments the priority calculation with a “critical experience” bonus. The priority $ P(i) $ for transition $ i $ is:

$$
P(i) = (|\delta_i| + \epsilon)^\alpha + \omega_c \cdot C(i)
$$

where $ \delta_i $ is the TD-error, $ \alpha $ determines the prioritization strength, $ \omega_c $ is a weight for critical experiences, and $ C(i) $ is an indicator function $ C(i) \in \{0, 1\} $. $ C(i) = 1 $ if the transition is tagged as a critical experience, which includes:

Successful avoidance of a static obstacle at close range.
Successful maneuver that increases separation from another UAV drone below the safe distance.
Discovery of a significantly shorter path segment to a known state.

This ensures that experiences which are fundamental to learning safe, cooperative interactions between multiple UAV drones are replayed more frequently, reinforcing these essential behaviors in the policy network.

Simulation Experiments and Performance Analysis

We evaluate HAP-DQN in a custom 3D simulation environment designed for power grid inspection with UAV drones. The environment features terrain hills, tower structures, and transmission lines as static obstacles.

Table 2: Key Simulation and Algorithm Parameters
Parameter Category	Parameter	Value
Environment/Task	Space Dimensions	500m × 500m × 100m
	Number of UAV Drones $ N $	5
	Drone Speed	10 m/s
	Safety Distance $ d_{\text{safe}} $	5 m
HAP-DQN Specific	Potential Scale $ k $	0.5
	Crit. Exp. Weight $ \omega_c $	1.0
	Observed Neighbors $ k $	2
Training	Learning Rate	0.0001
	Discount Factor $ \gamma $	0.99
	Batch Size	128

Benchmark Algorithms and Evaluation Metrics

We compare HAP-DQN against four baseline algorithms:

Improved A* (TWA*): A classical search algorithm with time windows for dynamic collision avoidance.
I-DQN: Independent DQN, where each UAV drone learns with a standard DQN without sharing states or modeling other drones.
VDN: Value Decomposition Networks, a cooperative MARL algorithm that learns a centralized joint action-value function.
MADDPG: Multi-Agent Deep Deterministic Policy Gradient, a state-of-the-art actor-critic MARL algorithm for continuous action spaces (discretized for fair comparison).

The performance is evaluated using the following metrics, averaged over 100 independent test episodes:

Success Rate (SR): Percentage of UAV drones that reach their goal without collision.
Average Path Length (APL): Mean flight distance of successful UAV drones.
Cooperative Conflict Rate (CR): Average number of conflicts ($d < d_{\text{safe}}$) between UAV drones per episode.
Average Completion Time (ACT): Mean mission time of successful UAV drones.

Results and Discussion

The convergence curves during training demonstrate HAP-DQN’s superior learning efficiency. HAP-DQN achieves a steeper and earlier rise in average episodic reward compared to I-DQN, VDN, and MADDPG. This is attributed to the dense guidance from potential-based shaping and the efficient exploration of the hierarchical strategy, which allow the swarm of UAV drones to discover rewarding behaviors faster.

Table 3: Performance Comparison of Different Algorithms
Algorithm	Success Rate (SR %) ↑	Avg. Path Length (APL m) ↓	Cooperative Conflict Rate (CR) ↓	Avg. Completion Time (ACT s) ↓
Improved A* (TWA*)	81.5 ± 6.8	325.1 ± 10.2	0.11 ± 0.03	35.1 ± 1.9
I-DQN	71.4 ± 5.2	352.6 ± 15.8	0.18 ± 0.04	38.5 ± 2.1
VDN	86.2 ± 4.1	331.8 ± 12.3	0.09 ± 0.02	35.6 ± 1.8
MADDPG	92.1 ± 3.5	320.7 ± 11.1	0.05 ± 0.01	34.2 ± 1.6
HAP-DQN (Ours)	97.6 ± 2.5	315.4 ± 9.7	0.02 ± 0.01	33.1 ± 1.5

The quantitative results in Table 3 highlight the advantages of HAP-DQN. It achieves the highest success rate (97.6%), significantly outperforming I-DQN (71.4%) and VDN (86.2%), and also surpassing the strong MADDPG baseline (92.1%). This indicates the superior robustness of the learned policy. Furthermore, HAP-DQN plans the most efficient paths, yielding the shortest average path length and completion time. Most importantly, HAP-DQN demonstrates exceptional cooperative safety, with a conflict rate (0.02) that is nearly an order of magnitude lower than I-DQN and also lower than MADDPG. This drastic reduction in conflicts between UAV drones is a direct consequence of the adaptive prioritized replay mechanism, which reinforces successful collision avoidance maneuvers.

Ablation Study

To validate the contribution of each component, we conduct an ablation study by removing one module at a time:

AP-DQN: HAP-DQN without Hierarchical exploration (H). Uses standard ε-greedy.
HP-DQN: HAP-DQN without Adaptive prioritized replay (A). Uses uniform replay.
HA-DQN: HAP-DQN without Potential-based reward shaping (P). Uses original sparse reward.

The training curves of these variants confirm that each module is essential. HA-DQN (no shaping) shows extremely slow initial learning due to the sparse reward. AP-DQN (no hierarchical exploration) converges slower than the full HAP-DQN. HP-DQN (no adaptive replay) converges to a lower final performance and, in separate tests, exhibited a higher conflict rate, underscoring the importance of replaying critical cooperative experiences for the UAV drone swarm.

Conclusion

This paper presented the HAP-DQN algorithm for multi-UAV cooperative path planning. By integrating three synergistic enhancements—Hierarchical exploration, Adaptive prioritized replay, and Potential-based reward shaping—into the DQN framework, HAP-DQN effectively addresses the key challenges of inefficient exploration, sparse rewards, and difficult cooperative collision avoidance that plague standard methods when applied to swarms of UAV drones.

Extensive simulations in a complex 3D inspection environment demonstrate that HAP-DQN enables a fleet of UAV drones to learn policies that are significantly more efficient, robust, and safe than those learned by benchmark algorithms like I-DQN, VDN, and MADDPG. The algorithm achieves a high task success rate, plans shorter paths, and most notably, drastically reduces the rate of conflicts between UAV drones, which is paramount for the safe deployment of dense aerial swarms.

The proposed modules are conceptually general and could be integrated into other value-based or even policy-based multi-agent reinforcement learning frameworks for UAV drone coordination. Future work will focus on testing HAP-DQN with a larger number of UAV drones, incorporating more realistic communication models with delays and packet loss, and transferring the learned policy to physical UAV drone platforms in real-world inspection scenarios.

Component	Description	Mathematical Definition / Composition
State Space \( \mathcal{S} \)	The local observation of the UAV drone.	\( s_{i,t} = [ \mathbf{p}_{i,t}, \mathbf{v}_{i,t}, \mathbf{p}_{\text{goal}_i}, \mathbf{o}_{i,t}, \mathbf{n}_{i,t} ] \). – \( \mathbf{p}_{i,t}, \mathbf{v}_{i,t} \in \mathbb{R}^3 \): Position & velocity. – \( \mathbf{p}_{\text{goal}_i} \in \mathbb{R}^3 \): Target location. – \( \mathbf{o}_{i,t} \in \mathbb{R}^m \): Sensor readings of nearby static obstacles. – \( \mathbf{n}_{i,t} \in \mathbb{R}^{k \times 6} \): States (relative pos & vel) of \( k \) nearest neighbor UAV drones within \( d_{\text{comm}} \).
Action Space \( \mathcal{A} \)	Discrete motion primitives.	\( \mathcal{A} = \{ \text{Forward, Turn Left, Turn Right, Climb, Descend,} \) \( \text{Climb-Left, Climb-Right} \} \). Each action corresponds to a fixed displacement.
Reward Function \( \mathcal{R} \)	Guides the UAV drone towards desired behavior.	\( R(s_{i,t}, a_{i,t}) = R_{\text{goal}} + R_{\text{collision}} + R_{\text{step}} \). – \( R_{\text{goal}} = +r_g \) if \( \\|\mathbf{p}_i – \mathbf{p}_{\text{goal}_i}\\|_2 < \epsilon_g \). – \( R_{\text{collision}} = -r_c \) (static obstacle), \( -r_d \) (near-miss with another UAV drone). – \( R_{\text{step}} = -r_s \) per timestep (encourages efficiency).
Transition \( \mathcal{P} \) & Discount \( \gamma \)	Environment dynamics and future reward importance.	\( \mathcal{P}(s’\|s,a) \) defined by drone kinematics and environment. Discount factor \( \gamma \in [0,1) \) (e.g., 0.99).