A Hierarchical Reinforcement Learning-Based Navigation Scheme for Unmanned Drones

In recent years, the continuous development of technology and markets has led to an ever-expanding range of applications for unmanned aerial vehicles (UAVs), with the low-altitude economy gradually becoming a significant component of emerging industries. As highly autonomous flying platforms, drones are now widely deployed in numerous fields such as logistics delivery, agricultural plant protection, environmental monitoring, urban management, emergency rescue, and military reconnaissance. Their efficient and flexible operational modes enhance work efficiency, reduce costs, and have a relatively low environmental impact. However, as unmanned drones undertake increasingly complex tasks, they face significant challenges, particularly in autonomous navigation and path planning within intricate environments. Therefore, the navigation and planning of unmanned drones has become a critical issue in UAV technology. How to compute an optimal or feasible flight route from a starting point to a target based on mission objectives, flight constraints, and obstacle environments, and successfully execute it, has attracted extensive research from scholars.

The complexity of unmanned drone path planning is mainly reflected in the following aspects: Firstly, obstacles in the environment include both static structures like buildings and dynamic targets such as birds or other flying vehicles. Secondly, the flight process must satisfy multiple constraints, including maximum flight altitude, energy consumption, mission time limits, and flight safety. Furthermore, certain tasks require the unmanned drone to possess flexible adaptability, enabling real-time decision-making in response to dynamically changing environments. Consequently, path planning is not merely about finding a simple route but also involves balancing multiple objectives and constraints to find an optimal or suitable solution. For unmanned drone navigation planning, two primary strategies are commonly employed. The traditional one is a planning-and-control separation framework. Within this framework, a planner generates a trajectory that conforms to dynamic constraints and lies within safe space, while a controller produces control signals based on the error between state feedback and the trajectory reference to converge the system error. This method can cause the trajectory optimization result to fall into a local minimum, potentially leading to collisions. Secondly, there is a disturbance rejection problem in such a decoupled framework. The other strategy is end-to-end planning and control based directly on real-time sensory data. This learning-based approach can directly handle dynamic obstacles and is suitable for complex and rapidly changing environments, but it is relatively difficult to debug and integrate, and also faces issues with generalization and stability.

To address the above issues, this paper proposes a hierarchical reinforcement learning (HRL) framework to better tackle the navigation and planning problem for unmanned drones. The navigation scheme primarily consists of two stages. The first stage involves high-level goal planning: after autonomously exploring to build an environmental map, a global path is generated in a static, large-scale environment using a path search algorithm. Based on this global waypoint information, combined with the unmanned drone’s own state and point cloud information perceived by its LiDAR, a real-time goal-search reinforcement learning algorithm framework is designed. This framework can output an optimal temporary target point at a low frequency. The second stage employs a high-frequency controller, which converts the real-time target point into the inertial coordinate system. A low-level controller is trained to enable the unmanned drone to track the searched temporary target point. Finally, by integrating the low-level controller into the high-level planner, the complete unmanned drone navigation system is obtained. This multi-level approach can accelerate the convergence speed of deep reinforcement learning while also demonstrating good performance in collision prevention and path optimization, showcasing the strong robustness of unmanned drone navigation.

Problem Description and System Modeling

Problem Description

This paper addresses the navigation planning problem for unmanned drones by proposing a hierarchical reinforcement learning framework that moves beyond the planning-control separation paradigm. The core methodology involves autonomously exploring to build an environmental map, then using a pathfinding algorithm to generate a global path within a static obstacle environment. Based on this global waypoint information, the nearest relevant waypoint is selected in real-time. Subsequently, by utilizing the perception from the drone’s LiDAR to establish the kinematic relationships of obstacles and build a spatial obstacle model, a high-level deep reinforcement learning planner generates a temporary target position relative to the drone’s body coordinate frame, considering the drone’s own state and external information. The low-level controller then takes this target position and, using the drone’s own information, appropriately controls the drone’s attitude to allow it to smoothly reach the temporary target point. This process repeats until the unmanned drone reaches the final destination.

Unmanned Drone System Modeling

To establish an accurate mathematical model for the quadrotor unmanned drone, the following reasonable assumptions are made: The drone airframe is a rigid structure with complete geometric and mass distribution symmetry. The propellers are rigid, and the lift and drag they generate during flight are proportional to the square of their rotational speed. Ground effect and aerodynamic moments (except those generated by the propellers) are neglected. These assumptions are reasonable and have defined applicability boundaries for the urban delivery scenario targeted in this paper: the drone’s cruising altitude is typically greater than 5 meters, far from the ground, making ground effect negligible; the flight speed is relatively low (< 10 m/s), so aerodynamic moments are much smaller than control moments, having a limited impact on attitude control precision. Simulation experiments show that controllers based on this simplified model can maintain stability in actual flight tests (error < 0.1m), primarily due to the adaptive capability of deep reinforcement learning. The controller potentially learns compensation strategies for unmodeled dynamics through interaction with the environment during training, effectively suppressing errors introduced by model simplification.

The Earth-fixed frame and the body-fixed frame need to be defined. The Earth frame follows the North-East-Down (NED) convention and is considered the inertial frame. The body frame origin is at the drone’s center of mass (assumed to be the geometric center for a uniform rigid body drone), with $X_b$ pointing forward, $Y_b$ pointing to the right, and $Z_b$ pointing downward. The drone’s pose is correspondingly described by the transformation of the body frame relative to the Earth frame.

For a quadrotor unmanned drone, its pose is generally represented by coordinate position and Euler angles. Assuming the drone’s position is $p = [x, y, z]^T$, attitude is $[\phi, \theta, \psi]^T$, and velocity is $v = [\dot{x}, \dot{y}, \dot{z}]^T$, the rotation matrix from any body frame to the Earth frame is:

$$
R = \begin{bmatrix}
\cos\theta \cos\psi & \cos\psi \sin\theta \sin\phi – \sin\psi \cos\phi & \cos\psi \sin\theta \cos\phi + \sin\psi \sin\phi \\
\cos\theta \sin\psi & \sin\psi \sin\theta \sin\phi + \cos\psi \cos\phi & \sin\psi \sin\theta \cos\phi – \cos\psi \sin\phi \\
-\sin\theta & \sin\phi \cos\theta & \cos\phi \cos\theta
\end{bmatrix}
$$

where $\phi, \theta, \psi$ are the roll, pitch, and yaw angles of the unmanned drone, respectively. The kinematic equations for the drone’s attitude can be expressed as:

$$
\begin{bmatrix} \dot{\phi} \\ \dot{\theta} \\ \dot{\psi} \end{bmatrix} = \begin{bmatrix} 1 & \sin\phi \tan\theta & \cos\phi \tan\theta \\ 0 & \cos\phi & -\sin\phi \\ 0 & \sin\phi / \cos\theta & \cos\phi / \cos\theta \end{bmatrix} \begin{bmatrix} p \\ q \\ r \end{bmatrix}
$$

This equation relates the rate of change of the Euler angles to the body angular velocities $(p, q, r)$.

To model the dynamics of the unmanned drone, based on the Newton-Euler equations, a dynamic model is established to describe the motion of the drone under forces and moments, analyzing translational and rotational motion separately. Transforming the total thrust $T$ (acting along the negative body $Z_b$ axis) to the inertial frame using $R$ gives the thrust force vector: $F_{thrust} = R \cdot [0, 0, -T]^T$. Considering gravity $[0, 0, mg]^T$, the translational dynamics become:

$$
\begin{cases}
m\ddot{x} = (\cos\phi \sin\theta \cos\psi + \sin\phi \sin\psi) \cdot T \\
m\ddot{y} = (\cos\phi \sin\theta \sin\psi – \sin\phi \cos\psi) \cdot T \\
m\ddot{z} = mg – (\cos\phi \cos\theta) \cdot T
\end{cases}
$$

For rotational dynamics, the equation in the body frame is: $I \dot{\omega} + \omega \times (I \omega) = \tau$, where $I$ is the inertia matrix, $\omega = [p, q, r]^T$ is the angular velocity vector, and $\tau = [\tau_x, \tau_y, \tau_z]^T$ is the control torque vector. The relationship between the total thrust $T$ and control torques $(\tau_x, \tau_y, \tau_z)$ and the squared rotor speeds $(\omega_1^2, \omega_2^2, \omega_3^2, \omega_4^2)$ is:

$$
\begin{bmatrix} T \\ \tau_x \\ \tau_y \\ \tau_z \end{bmatrix} = \begin{bmatrix} k_F & k_F & k_F & k_F \\ 0 & -k_F l & 0 & k_F l \\ -k_F l & 0 & k_F l & 0 \\ k_M & -k_M & k_M & -k_M \end{bmatrix} \begin{bmatrix} \omega_1^2 \\ \omega_2^2 \\ \omega_3^2 \\ \omega_4^2 \end{bmatrix}
$$

where $k_F$ is the thrust coefficient, $k_M$ is the torque coefficient, and $l$ is the arm length from the center of mass to each rotor. Substituting into the rotational dynamics yields the complete rotational model. These dynamic and kinematic models form the core theoretical foundation for unmanned drone flight control, fully describing its force-motion relationships in space.

The Low-Level Controller

For the proposed hierarchical reinforcement learning navigation system, the control policy of the low-level controller must first be trained. The implementation flow of the deep reinforcement learning-based low-level controller is as follows: First, neural network weight parameters are initialized, constructing the basic architecture of the policy network and value function networks. The controller obtains the unmanned drone’s own state information in real-time via IMU sensors, including key motion parameters such as position, velocity, and attitude angles, while simultaneously receiving the temporary target point provided by the high-level planner as a navigation reference. These state observations are fed into the policy network, which outputs specific control actions—namely desired body rates and collective thrust—through forward propagation. The immediate reward value for this action is also evaluated, considering factors like path tracking accuracy and motion smoothness. The system stores this interaction data—current state, action taken, reward obtained, and next state—in an experience replay buffer, forming training samples for offline learning. When the accumulated experience data reaches a preset update step, the controller randomly samples a batch of data from the buffer and updates the network parameters via backpropagation of the temporal difference (TD) error, using gradient descent to gradually optimize the performance of the policy and value networks. The entire training process runs in episodes, terminating the current training episode when the unmanned drone successfully reaches the target point or reaches the maximum training steps. The core advantage of this deep reinforcement learning controller is its ability to map directly from sensor inputs to control outputs in an end-to-end manner, without requiring complex mathematical modeling, and to adaptively learn the optimal control policy, making it particularly suitable for real-time control problems of unmanned drones in complex dynamic environments. Through continuous experience accumulation and network updates, the controller can progressively improve its robustness against noise interference, model uncertainty, and environmental changes.

State and Action Space Design

Based on the quadrotor unmanned drone model, the state input for the low-level controller includes the temporary target point $p_{temp}$, the drone’s own Euler angles and angular velocities, linear velocity, the previous action, and its position.

$$
s_t = [\phi, \theta, \psi, p, q, r, p_x, p_y, p_z, v_x, v_y, v_z, v_{exp}, p_{exp}, a_{t-1}]
$$

The action space consists of desired body angular rates and collective thrust: $a_t = [q_{des}, p_{des}, r_{des}, T_{des}]$. The generated action is fed into the unmanned drone’s dynamic and kinematic equations to update its pose.

Reward Function Design

The reward function for the deep reinforcement learning-based unmanned drone low-level controller consists of three main parts:

$$
r = r_{event} + r_{goal} + r_{action} + r_{survive}
$$

The reward function includes event-triggered reward, goal-reaching reward, action smoothness reward, and survival reward. The event-triggered reward is sparse: a large positive reward is given when the unmanned drone reaches the target point; a large negative reward is given if it tips over (roll or pitch angle exceeds limits) or flies out of the map boundaries. The goal reward is negatively correlated with the distance to the target and the alignment between velocity direction and target direction, guiding the drone towards the target. The action reward penalizes drastic action changes to ensure smooth control commands. The survival reward encourages the unmanned drone to complete the task quickly. The specific formulations are:

$$
r_{event} = \begin{cases}
+30 & \text{if target reached} \\
-50 & \text{if } |\phi| > \pi/2 \text{ or } |\theta| > \pi/2 \\
-35 & \text{if out of map bounds}
\end{cases}
$$

$$
r_{goal} = -k_1 \cdot \Delta_{pos} + k_2 \cdot (v_{body} \cdot \Delta_{pos, unit})
$$

$$
r_{action} = -k_3 \cdot \| a_t – a_{t-1} \|^2
$$

$$
r_{survive} = -k_4
$$

where $\Delta_{pos}$ is the distance to target, $v_{body}$ is the velocity unit vector, $\Delta_{pos, unit}$ is the unit vector pointing from the drone to the target, and $k_1, k_2, k_3, k_4$ are tunable constants. The chosen values ($k_1=1.0, k_2=0.5, k_3=0.01, k_4=0.1$) achieve the best balance among the competing objectives in our experiments.

Training Process

The Soft Actor-Critic (SAC) algorithm is employed to train the low-level controller. SAC, an advanced algorithm based on the Actor-Critic framework, introduces the core idea of “maximum entropy,” giving it outstanding performance in stability and sample efficiency. The training process involves interacting with a simulated environment, storing experiences $(s_t, a_t, r_t, s_{t+1})$ in a replay buffer $R$, and periodically updating network parameters by sampling mini-batches from this buffer. SAC maintains two critic networks $Q_{\phi_1}, Q_{\phi_2}$, an actor network $\pi_{\theta}$, a state value network $V_{\psi}$, and a target value network $V_{\psi’}$. The key update steps from a sampled mini-batch are:

Critic Update: $y_i = r_i + \gamma V_{\psi’}(s_{i+1})$, minimize $\mathcal{L}(\phi_j) = \frac{1}{N}\sum_i (y_i – Q_{\phi_j}(s_i, a_i))^2$ for $j=1,2$.

Actor Update: Minimize $\mathcal{L}(\theta) = \frac{1}{N}\sum_i (\alpha \log \pi_{\theta}(a_i|s_i) – \min_{j=1,2} Q_{\phi_j}(s_i, a_i))$, where $a_i$ is sampled from $\pi_{\theta}(\cdot|s_i)$ and $\alpha$ is a learnable temperature parameter.

Value Network Update: $V_{target}(s_i) = \mathbb{E}_{a\sim\pi_{\theta}}[\min_{j=1,2} Q_{\phi_j}(s_i, a) – \alpha \log \pi_{\theta}(a|s_i)]$, minimize $\mathcal{L}(\psi) = \frac{1}{N}\sum_i (V_{target}(s_i) – V_{\psi}(s_i))^2$.

Target Network Soft Update: $\psi’ \leftarrow \tau \psi + (1-\tau) \psi’$.

Table 1: Training Hyperparameters for Low-Level Controller
Parameter	Value
Discount factor ($\gamma$)	0.99
Target update coefficient ($\tau$)	0.005
Learning rate	$3 \times 10^{-4}$
Hidden layers / neurons	2 layers, 256 neurons each
Batch size	256
Target entropy $\mathcal{H}_0$	-dim($\mathcal{A}$) = -4

The High-Level Planner

The high-level planner is implemented based on deep reinforcement learning (DRL), aiming to provide decision support for the autonomous navigation of unmanned drones in complex dynamic environments. The planner starts with neural network parameter initialization and optimizes its action decision policy through continuous interaction with the environment. The system first acquires the reference path provided by the global path planning module and initializes the unmanned drone’s key state information, including real-time data such as position, velocity, and attitude. Utilizing a look-ahead mechanism, the planner extracts local waypoints based on the global path. These waypoints must adhere to the global heading requirements while satisfying real-time obstacle avoidance needs. The introduction of an obstacle safety margin function ensures the generated path has sufficient safety clearance; this function can dynamically adjust the safety boundary based on obstacle information perceived by sensors.

The planner observes and analyzes the current environmental state through a deep neural network and outputs an action command, expressed as a target point offset $(\Delta p_x, \Delta p_y, \Delta p_z)$ in the drone’s body coordinate frame. An immediate reward is computed to evaluate the action’s quality. This state-action-reward-next-state data is stored in an experience replay pool, constituting important training samples for reinforcement learning. The low-level controller is responsible for translating these action commands into specific control signals to drive the unmanned drone’s motion. The system continuously monitors training progress. When a preset update step is reached, a batch of high-quality experience data is sampled from the replay pool, and network parameters are updated via backpropagation to gradually optimize the policy network.

State Space and Observation Processing

The state input for the high-level planner includes the current look-ahead guidance point, the unmanned drone’s own position, and the point cloud observation information.

$$
s_t = [p_x, p_y, p_z, p_{exp}, Obs_{outer}]
$$

The action space is the target point in the body frame: $a_t = [p_x, p_y, p_z]_{body}$. The external observation $Obs_{outer}$ is represented as a polar coordinate map of the perceived point cloud intensity. The processing of point cloud information involves four steps: 1) Raw Point Cloud Acquisition & Filtering: The LiDAR generates raw data at 30Hz. A pass-through filter retains points within a 10m radius, $\pm 30^\circ$ vertical FOV, and 120° horizontal FOV ($\pm 60^\circ$ from the heading). 2) Polar Gridding: Filtered points are projected onto a polar grid. The horizontal range $[-60°, +60°]$ is divided into $N_\theta=12$ sectors (10.0° resolution), and the vertical range $[-30°, +30°]$ into $N_r=8$ rings (7.5° resolution). Each grid cell $(r_i, \theta_j)$ holds the maximum intensity value (normalized to [0,1]) of points within it. Empty cells are set to 0. 3) Polar Feature Map: This yields a $12 \times 8$ 2D feature map representing obstacle occupancy intensity, converting sparse point clouds into a dense matrix. 4) Network Input Adaptation: The $12 \times 8$ feature vector is flattened into a 96-dimensional vector and concatenated with the drone state vector to form the complete observation input.

The look-ahead guidance point $p_{exp}$ is selected from the global path. A circle is drawn centered on the nearest forward path point. If subsequent points lie outside this circle, that point is chosen. If a subsequent point lies inside, the midpoint between the current and next point is selected as the guidance point.

Reward Function Design

The high-level planner’s reward function consists of two parts:

$$
r = r_{obs} + r_{action}
$$

The observation-based reward $r_{obs}$ includes sparse rewards for reaching the goal or colliding, and a dense distance penalty. The action reward $r_{action}$ evaluates if the generated point is close to the optimal global path point and avoids obstacles.

$$
r_{obs} = \begin{cases}
+100 & \text{if final goal reached} \\
-60 & \text{if collision occurs} \\
– k \cdot dist(s, s_{goal}) & \text{per-step penalty}
\end{cases}
$$

$$
r_{action} = -k_1 \cdot \Delta_{goal}
$$

where $dist(s, s_{goal})$ is the distance to the final goal, and $\Delta_{goal}$ is the deviation of the generated temporary target from the ideal path-following point. The constants $k=0.1$ and $k_1=0.05$ provide a balance between goal convergence and obstacle avoidance flexibility.

Integrated Training Process

The training process for the high-level planner is similar to the low-level controller but requires calling the already trained low-level controller to participate in environment interaction. Therefore, the performance of the high-level planner is directly influenced by the accuracy of the low-level controller. When the low-level controller’s accuracy is insufficient, the temporary target points generated by the high-level planner are difficult to track accurately, causing state transitions to deviate from expectations and trapping the high-level planner in a vicious cycle of “planning-loss of control-replanning.” Consequently, in the training workflow, this paper adopts an order of training the low-level controller first to build a precise tracking foundation, then training the high-level planner based on this fixed policy, ensuring stable convergence and efficient collaboration within the hierarchical framework.

Table 2: Training Process of the High-Level Planner
Step	Description
1	Initialization: Initialize network parameters ($Q_{\phi_1}, Q_{\phi_2}, \pi_{\theta}, V_{\psi}, V_{\psi’}$), experience replay buffer $R$, and hyperparameters.
2	For each episode: Reset environment to initial state $s_t$.
3	For each high-level step: Generate action $a_t = [p_x, p_y, p_z]_{body}$ from $\pi_{\theta}(s_t)$.
4	Low-Level Execution Loop: For $k = 1$ to $(f_{low}/f_{high})$: Pass $a_t$ to the fixed low-level controller, which executes control for one low-level period, updating the unmanned drone state.
5	Observe new state $s_{t+1}$ and reward $r_t$.
6	Store transition $(s_t, a_t, r_t, s_{t+1})$ in $R$.
7	Sample mini-batch from $R$ and update all networks (Critic, Actor, Value, Target) using SAC updates as described in the low-level section, but with the high-level’s discount factor $\gamma=0.999$.
8	Repeat until the unmanned drone reaches the goal, collides, or times out.

This nested training structure allows the high-level planner to implicitly learn the capabilities and constraints of the low-level controller, as the observed state $s_{t+1}$ is the result of the low-level controller’s actions. The frequency matching mechanism $(f_{low} / f_{high} = 15)$ ensures the high-level planner has sufficient time for global observation and replanning while maintaining the low-level controller’s real-time tracking precision.

Experimental Validation and Analysis

The deep reinforcement learning algorithms were used to train the low-level controller and the high-level planner separately. Comparative experiments and analysis were conducted in simulation environments by adjusting different algorithms and parameter combinations.

Simulation Environment and Parameters

To validate the effectiveness of the proposed hierarchical reinforcement learning framework, tests were conducted in different environmental maps. The field size was $200 \times 110 \times 110$ units, containing obstacles of various shapes arranged in different configurations. Different starting and target goal positions were set. The LiDAR simulator generated point cloud data at 30Hz with a max range of 10m, a 120° horizontal FOV (10.0° resolution), and a 60° vertical FOV (7.5° resolution). Gaussian noise ($\sigma=0.05$m) was added to the point cloud. The polar grid resolution was set to 12 (horizontal) $\times$ 8 (vertical).

Algorithm and Parameter Selection

To verify the feasibility of the proposed algorithm and select the most suitable DRL architecture and parameters, experiments were conducted with different algorithms. Key parameters like training steps, control frequency, and network structure were tested. The results for the low-level controller are summarized below:

Table 3: Low-Level Controller Training Results Comparison
Algorithm	Training Steps	Control Freq (Hz)	Position Error (mm)
SAC	2000	80	0.081, 0.074, 0.115
SAC	2500	100	0.069, 0.079, 0.107
DDPG	2000	100	0.094, 0.095, 0.123
SAC	3000	150	0.067, 0.071, 0.105

The results show that SAC, with its maximum entropy mechanism, outperforms DDPG. Controller precision improves with training steps and control frequency, with 150Hz being optimal under hardware constraints. For the integrated navigation system, the results are as follows:

Table 4: High-Level Planner Training Results Comparison
Planner Algorithm	Training Steps	Plan Freq (Hz)	Success Rate	Collision Avoidance Rate	Avg. Path Length
SAC	15000	20	100%	95.8%	1485
SAC	10000	10	98.3%	87.5%	1456
DDPG	15000	10	96.5%	74.4%	1487
SAC	15000	10	100%	95.1%	1401

SAC consistently shows higher success and avoidance rates. A planning frequency of 10Hz yields shorter, more optimal paths by reducing unnecessary re-planning. The impact of low-level controller precision on the high-level planner’s performance was also tested:

Table 5: Impact of Low-Level Controller Precision
Low-Level Pos. Error (mm)	High-Level Success Rate	Training Episodes to Converge
0.12	82.1%	16700
0.09	94.5%	10000
0.07	97.7%	8200

This confirms a strong correlation: higher low-level precision provides a more stable state transition environment, drastically improving high-level success rate and convergence speed.

Sensitivity Analysis and Hyperparameter Tuning

A systematic sensitivity analysis was conducted for key hyperparameters. For the high-level planner, varying the number of neurons in the hidden layers showed that 256 neurons provided the best balance between model capacity and training efficiency/overfitting. Testing learning rates indicated that $3 \times 10^{-4}$ yielded the fastest convergence and lowest final error. Similar trends were observed for the low-level controller. This analysis validates the chosen parameters and provides a reference for tuning.

Table 6: Sensitivity Analysis of Key Parameters
Parameter Tested	Values Compared	Observation & Optimal Choice
Hidden Neurons	64, 128, 256, 512	256 neurons best balance error (0.067) and training time.
Learning Rate	$1\times10^{-4}, 3\times10^{-4}, 1\times10^{-3}$	$3\times10^{-4}$ gives fastest convergence and stable training.
Reward $k_1$ (Low-Level)	0.5, 1.0, 2.0, 3.0	Optimal range [0.8, 1.5]. $k_1=1.0$ chosen.
Reward $k$ (High-Level)	0.02, 0.05, 0.1, 0.2	$k=0.1$ balances path length (1401) and avoidance (95.1%).

Comparative Performance Evaluation

To objectively evaluate the superiority of our method, we compared it against four representative unmanned drone navigation methods from recent literature under identical simulation conditions. The compared methods include: a traditional geometric search method (A* + MPC), a single-layer RL method (PPO), a baseline hierarchical RL method (HRL-DDPG), and our method (HRL-SAC).

Table 7: Performance Comparison with Existing Methods
Method	Navigation Success Rate	Collision Avoidance Rate	Avg. Path Length	Training Episodes to Converge
A* + MPC	85.3%	82.1%	1523	N/A (No training)
PPO (Single-Layer)	78.6%	71.3%	1587	~29000
HRL-DDPG (Baseline)	91.2%	83.5%	1468	2400 + 9800
Our HRL-SAC	97.7%	91.8%	1401	1500 + 8200

The results clearly demonstrate the advantages of our proposed HRL-SAC framework. Compared to the traditional A*+MPC, our method shows significant improvements in all metrics, highlighting its adaptability in dynamic environments without relying on precise models. Against the single-layer PPO, our hierarchical approach solves the “curse of dimensionality” by decoupling the tasks, leading to much higher success rates, better avoidance, and drastically faster convergence. Compared to the HRL-DDPG baseline, our use of the maximum-entropy SAC algorithm provides superior exploration, resulting in higher avoidance rates, shorter paths, and more stable training. The asynchronous, frequency-matched interaction between the high-level planner (10Hz) and the low-level controller (150Hz) is a key factor in this performance, enabling efficient and robust hierarchical control for the unmanned drone.

Conclusion

This paper proposes a hierarchical reinforcement learning (HRL) based framework for autonomous navigation control of unmanned drones. The framework adopts a two-layer structure: a high-level planner generates temporary target points based on global environment information, and a low-level controller executes high-precision trajectory tracking according to the current state, ensuring the unmanned drone smoothly reaches the designated positions. Through this hierarchical design, the algorithm can simultaneously address the needs of global path planning and local dynamic control. Notably, the designed asynchronous frequency matching mechanism effectively resolves common temporal coupling issues in hierarchical reinforcement learning. Experimental results show that this collaborative mechanism allows the high-level planner to implicitly learn the dynamic constraints of the low-level controller, optimizing path length while ensuring collision avoidance success rate, thereby validating the practicality and robustness of the hierarchical framework.

During training, the high-level planner and low-level controller obtain state, action, and reward information through interaction with a simulated environment, eliminating the need for precise dynamic modeling and demonstrating good model adaptability. Furthermore, the hierarchical reinforcement learning framework effectively mitigates the curse of dimensionality, accelerates training convergence, and reduces the risk of getting stuck in local optima.

Simulation tests verify that the designed HRL control framework can achieve stable navigation and attitude control for unmanned drones in complex environments. The systematic sensitivity analysis of key SAC algorithm parameters (network structure, learning rate, batch size, etc.) validates the rationality of the selected parameters for this task, providing a basis for the reproducibility and practical deployment of the algorithm. The proposed algorithm is verified to possess a certain level of robustness against external disturbances and environmental uncertainties. Experimental results also indicate that reasonable hierarchical structure and reward function design significantly impact reinforcement learning training effectiveness and control performance. Future work will further investigate the online learning and adaptive control capabilities of hierarchical reinforcement learning on real unmanned drone platforms to achieve a higher level of autonomous intelligent flight.