Unmanned Aerial Vehicle Path Planning Under AoI Constraints Using SAC Algorithm

In modern information collection systems, drone technology has emerged as a pivotal tool due to its high flexibility, extensive coverage capabilities, and cost-effective operation. Unmanned Aerial Vehicles (UAVs) are particularly valuable in dynamic scenarios, such as monitoring moving vehicles, where timely data acquisition is critical. However, existing research often focuses on static environments, overlooking the challenges of maintaining information freshness, quantified by the Age of Information (AoI), in high-mobility settings. Additionally, the limited energy capacity of Unmanned Aerial Vehicles constrains their practical effectiveness, necessitating efficient path planning that balances AoI optimization and energy consumption. This paper addresses these issues by proposing a path planning method based on the Soft Actor-Critic (SAC) algorithm, a deep reinforcement learning (DRL) approach that operates in continuous action spaces. By integrating a maximum entropy framework with dual Q-networks and experience replay, our method enhances exploration and stability, leading to improved convergence and robustness compared to traditional algorithms like DDPG and PPO. Through extensive simulations, we demonstrate that our approach effectively minimizes AoI while maximizing energy efficiency, making it suitable for real-world applications in drone technology.

The proliferation of Unmanned Aerial Vehicles in various domains, from surveillance to logistics, underscores the importance of optimizing their operational parameters. In particular, drone technology must address the dual challenges of data timeliness and energy sustainability. AoI serves as a key metric for assessing the freshness of information, defined as the time elapsed since the last update was generated at the source. High AoI values indicate stale data, which can be detrimental in time-sensitive applications like autonomous vehicle tracking. Meanwhile, energy efficiency is crucial for extending the flight time of Unmanned Aerial Vehicles, as their batteries have limited capacity. Traditional optimization methods often struggle with the computational complexity of these multi-objective problems, especially in dynamic environments. Reinforcement learning, particularly DRL, offers a promising solution by modeling the problem as a Markov Decision Process (MDP) and learning optimal policies through interaction with the environment. Our work leverages the SAC algorithm, which incorporates entropy regularization to encourage exploration, thereby achieving a better trade-off between AoI reduction and energy conservation. This paper details the system model, problem formulation, algorithm design, and experimental results, providing a comprehensive framework for advancing drone technology in AoI-constrained scenarios.

System Model and Problem Description

The system model comprises three key components: the communication model, energy consumption model, and AoI model. We consider a scenario where an Unmanned Aerial Vehicle acts as a mobile base station to collect status information from ground vehicles moving along roads. The vehicles’ speeds follow a Gaussian distribution, simulating real-world traffic conditions. The UAV’s objective is to plan its flight path such that it minimizes the average AoI of the collected data while adhering to energy constraints. The entire operation is discretized into time slots to facilitate analysis and optimization.

Communication Model

Let $ M = \{1, 2, \dots, m\} $ represent the set of ground vehicles. The flight period $ T $ is divided into $ N $ equal time slots, each of duration $ \Delta t = T / N $, with time index $ t \in \{0, 1, \dots, N-1\} $. In a 3D Cartesian coordinate system, the horizontal position of vehicle $ m $ at time $ t $ is denoted as $ q_m[t] = (x_m[t], y_m[t], 0) $, while the UAV’s position is $ q_u[t] = (x_u[t], y_u[t], H) $, where $ H $ is the fixed flying altitude. The Euclidean distance between the UAV and vehicle $ m $ at time $ t $ is given by:

$$ d_m[t] = \sqrt{(x_u[t] – x_m[t])^2 + (y_u[t] – y_m[t])^2 + H^2} $$

Given the UAV’s altitude, the communication channel is predominantly Line-of-Sight (LoS), modeled using a Rician fading channel. The channel gain between the UAV and vehicle $ m $ at time $ t $ is expressed as:

$$ C_m[t] = \beta_{u,m} g_{u,m}[t] $$

where $ g_{u,m}[t] $ is the small-scale fading coefficient, and $ \beta_{u,m} $ is the average channel power accounting for path loss and shadowing. For a distance $ d_m[t] $, the average channel power is:

$$ \beta_{u,m} = \beta_0 d_m[t]^{-2} $$

Here, $ \beta_0 $ is the channel gain at a reference distance of 1 meter. Assuming the UAV employs Orthogonal Frequency-Division Multiple Access (OFDMA) to communicate with multiple vehicles, the uplink data rate for vehicle $ m $ at time $ t $ follows the Shannon capacity formula:

$$ R_m[t] = B \log_2 \left(1 + \frac{P_u \beta_{u,m}}{\sigma^2}\right) $$

where $ B $ is the channel bandwidth, $ P_u $ is the transmission power, and $ \sigma^2 $ is the Gaussian white noise power. The total information collected by the UAV over period $ T $ is:

$$ D_{\text{total}} = \sum_{m \in M} \int_0^T R_m(t) \, dt $$

This communication model ensures that the UAV can efficiently gather data from moving vehicles, but it must be balanced against energy costs and AoI considerations.

Energy Consumption Model

The UAV’s flight area is modeled as a rectangular grid with cell side length $ l $. The UAV moves at a constant speed, with its velocity vector at time $ t $ defined as $ \mathbf{v}[t] = (v_x[t], v_y[t], v_z[t]) $, where $ v_x[t], v_y[t], v_z[t] $ are the components in the x, y, and z directions, respectively. The possible flight actions include moving north, east, south, west, or hovering, corresponding to discrete changes in position. The propulsion power consumption, which dominates the UAV’s energy usage, is composed of blade profile power, induced power, and parasite power, calculated as:

$$ P(\mathbf{v}[t]) = P_0 \left(1 + \frac{3\|\mathbf{v}[t]\|^2}{u_{\text{tip}}^2}\right) + \frac{1}{2} z_0 \rho s k \|\mathbf{v}[t]\|^3 + P_i \left( \sqrt{1 + \frac{\|\mathbf{v}[t]\|^4}{4v_0^4}} – \frac{\|\mathbf{v}[t]\|^2}{2v_0^2} \right) $$

where $ P_0 $ and $ P_i $ are the blade power and induced power in hover mode, $ u_{\text{tip}} $ is the rotor blade tip speed, $ v_0 $ is the mean rotor-induced velocity, and $ z_0 $, $ \rho $, $ s $, and $ k $ represent the fuselage drag ratio, air density, rotor solidity, and rotor disc area, respectively. Communication energy is negligible compared to propulsion energy and is thus omitted. The total propulsion energy over period $ T $ is:

$$ E_{\text{total}}[T] = \sum_{t=0}^{T} P(\mathbf{v}[t]) \Delta t $$

This model highlights the trade-off between mobility and energy consumption, which is critical for prolonged UAV operations.

Age of Information Model

AoI measures the freshness of information at the destination. For vehicle $ m $ at time $ t $, the AoI $ I_m[t] $ is defined as the time since the last data update was received:

$$ I_m[t] = t – I’_m[t] $$

where $ I’_m[t] $ is the timestamp of the most recent data collection from vehicle $ m $. Initially, at $ t = 0 $, $ I_m[t] $ is set to 0 if no data has been collected. The total AoI over period $ T $ is:

$$ I_{\text{total}}[T] = \sum_{t=0}^{T} \sum_{m \in M} I_m[t] $$

Minimizing this cumulative AoI is essential for ensuring that the collected data remains relevant and useful for decision-making processes.

Problem Formulation

The optimization goal is to maximize a reward function that balances energy efficiency and AoI minimization, subject to the UAV’s energy constraints. The reward function is defined as:

$$ R = \max \sum_{m \in M} \left\{ \zeta \frac{D_{\text{total}}[t]}{E_{\text{total}}[t]} – \xi I_{\text{total}}[t] \right\} $$

where $ \zeta $ is the reward price for energy efficiency, and $ \xi $ is the penalty price for AoI. The constraints include:

Total energy consumption must not exceed the maximum battery capacity: $ E_{\text{total}} \leq E_{\text{max}} $.
The UAV starts and ends at predefined positions: $ q_u[0] = (x_{\text{orig}}, y_{\text{orig}}, H) $ and $ q_u[T] = (x_{\text{dest}}, y_{\text{dest}}, H) $.

This multi-objective optimization is modeled as an MDP to facilitate DRL-based solution. The MDP components are:

State Space $ S[t] $: Includes the UAV’s position, all vehicles’ positions, and their current AoI values: $ S[t] = \{ q_u[t]; q_1[t], \dots, q_M[t]; I_1[t], \dots, I_M[t] \} $.
Action Space $ A[t] $: Represents the UAV’s flight actions as velocity components: $ A[t] = \{ v_x[t], v_y[t], v_z[t] \} $.
Reward Mechanism: The reward $ R $ is designed to maximize energy efficiency while penalizing high AoI, encouraging the UAV to adopt paths that ensure data freshness and energy conservation.

This formulation allows the UAV to learn adaptive policies through interaction with the environment, leveraging the SAC algorithm for continuous action spaces.

SAC Algorithm for UAV Path Planning

The Soft Actor-Critic (SAC) algorithm is a maximum entropy DRL method that enhances exploration by incorporating entropy regularization into the reward function. This is particularly beneficial for drone technology, as it enables Unmanned Aerial Vehicles to discover robust policies in complex, dynamic environments. SAC consists of three main components: the Soft Q-function, Critic networks, and Actor network, each playing a distinct role in policy optimization.

Soft Q-Function

The Soft Q-function extends the standard Q-function by including an entropy term, promoting stochastic policies that improve exploration. The objective function is:

$$ J(\pi) = \mathbb{E}_{(s,a) \sim \rho_\pi} \left[ r(s,a) + \alpha \mathcal{H}(\pi(\cdot \mid s)) \right] $$

where $ r(s,a) $ is the reward function, $ \mathcal{H}(\pi(\cdot \mid s)) = -\log \pi(a \mid s) $ is the policy entropy, and $ \alpha $ is a temperature parameter that controls the trade-off between reward maximization and entropy. A higher $ \alpha $ encourages more exploration, while a lower $ \alpha $ favors exploitation. This approach ensures that the UAV explores diverse paths, leading to more resilient strategies in unpredictable scenarios.

Critic Networks

The Critic estimates the state-action value function $ Q(s,a) $ using two Q-networks, $ Q_{\theta_1}(s,a) $ and $ Q_{\theta_2}(s,a) $, to mitigate overestimation bias. Target networks $ Q_{\theta_1′} $ and $ Q_{\theta_2′} $ are employed for stable learning, with parameters updated via soft updates:

$$ \theta_i’ \leftarrow \tau \theta_i + (1 – \tau) \theta_i’ $$

where $ \tau \in (0,1) $ is the soft update coefficient. The Critic’s loss function is minimized using:

$$ J_Q(\theta_i) = \mathbb{E}_{(s,a,r,s’)} \left[ \left( Q_{\theta_i}(s,a) – y \right)^2 \right] $$

with the target value $ y $ given by:

$$ y = r + \gamma \left[ \min_{j=1,2} Q_{\theta_j’}(s’, a’) – \alpha \log \pi(a’ \mid s’) \right] $$

Here, $ \gamma $ is the discount factor, and $ a’ $ is sampled from the current policy. This dual Q-network structure enhances the reliability of value estimates, which is crucial for safe and efficient drone operations.

Actor Network

The Actor network parameterizes the policy $ \pi(a \mid s) $, generating action distributions based on the current state. Its objective is to maximize the expected return while maintaining high entropy:

$$ J_\pi(\phi) = \mathbb{E}_{s \sim D, a \sim \pi_\phi} \left[ \min_{i=1,2} Q_{\theta_i}(s,a) – \alpha \log \pi_\phi(a \mid s) \right] $$

where $ \phi $ represents the Actor’s parameters, and $ D $ is the experience replay buffer. The gradient update for the Actor is:

$$ \nabla_\phi J_\pi(\phi) = \mathbb{E}_{s \sim D, a \sim \pi_\phi} \left[ \nabla_\phi \log \pi_\phi(a \mid s) \left( \alpha \log \pi_\phi(a \mid s) – Q(s,a) \right) \right] $$

Additionally, the temperature parameter $ \alpha $ is automatically adjusted to maintain a target entropy $ \mathcal{H} $, using the loss function:

$$ J(\alpha) = \mathbb{E}_{a \sim \pi} \left[ -\alpha \left( \log \pi(a \mid s) + \mathcal{H} \right) \right] $$

This adaptive mechanism ensures that the UAV dynamically balances exploration and exploitation, adapting to changing environmental conditions in real-time.

Experimental Setup and Results

To validate our approach, we conducted simulations using a comprehensive setup that mirrors real-world scenarios. The experiments were performed on a Windows 11 system with Python 3.11.5 and PyTorch 2.2.1, utilizing an AMD Ryzen 5 3600 processor, NVIDIA RTX 2060 GPU, and 32 GB RAM. The simulation environment was built using the Simulation of Urban Mobility (SUMO) platform, which models vehicle traffic on a “Z”-shaped road network. The UAV’s flight area was defined as a rectangle from -1100 m to 1100 m in the x-direction and -600 m to 600 m in the y-direction, with a fixed altitude of 50 m. Key parameters for the UAV and communication system are summarized in Table 1.

Table 1: Simulation Parameters for Unmanned Aerial Vehicle and Environment
Parameter	Value
UAV Altitude $ H $	10 m
Channel Gain $ \beta_0 $	-30 dB
Bandwidth $ B $	4 MHz
Transmission Power $ P_u $	1 W
Noise Power $ \sigma^2 $	-100 dBm
Grid Cell Length $ l $	10 m
Blade Power $ P_0 $	79.86 W
Induced Power $ P_i $	88.63 W
Tip Speed $ u_{\text{tip}} $	120 m/s
Induced Velocity $ v_0 $	4.03 m/s
Drag Ratio $ z_0 $	0.6
Air Density $ \rho $	1.225 kg/m³
Efficiency Coefficient $ \zeta $	0.002
AoI Coefficient $ \xi $	0.1
Discount Factor $ \gamma $	0.95
Replay Buffer Size	5000
Batch Size	64
Hidden Layer Dimension	256

In the SUMO simulation, vehicles moved along designated lanes, with the UAV tasked to collect data from specific target vehicles marked for monitoring. The UAV’s path planning was trained over multiple episodes, with rewards computed based on energy efficiency and AoI. We compared our SAC-based method against traditional DRL algorithms like DDPG and PPO to evaluate performance in terms of convergence speed, reward value, and robustness.

Performance Evaluation

The learning rate is a critical hyperparameter in DRL training. We tested values of 0.001, 0.01, and 0.0001, observing that all rates led to similar convergence trends, with the rate of 0.001 achieving the highest stable reward of approximately 0.85 per vehicle per episode. This consistency demonstrates the robustness of our SAC implementation to hyperparameter variations, which is essential for deploying drone technology in diverse environments.

Comparative analysis with DDPG and PPO revealed that our SAC algorithm outperforms both in terms of convergence and final reward. As shown in Table 2, SAC achieved a higher average reward and faster convergence within 50 episodes, whereas DDPG and PPO required more episodes and settled at lower rewards. This advantage stems from SAC’s entropy-based exploration, which prevents premature convergence to suboptimal policies.

Table 2: Performance Comparison of DRL Algorithms for Unmanned Aerial Vehicle Path Planning
Algorithm	Average Reward per Vehicle	Convergence Episodes
SAC (Ours)	0.9	50
DDPG	0.75	80
PPO	0.65	100

To illustrate the impact of energy constraints, we visualized the UAV’s flight paths under two conditions: with energy efficiency consideration ($ \zeta = 0.002 $) and without ($ \zeta = 0 $). When energy was optimized, the UAV adopted a more direct and efficient route, conserving battery while still collecting data. In contrast, without energy constraints, the UAV closely followed the vehicles, resulting in higher energy consumption but lower AoI. This trade-off underscores the importance of integrating energy efficiency into path planning for sustainable drone technology operations.

Furthermore, we analyzed the AoI and energy consumption over time. The SAC method reduced the average AoI by 20% compared to DDPG and 30% compared to PPO, while also improving energy efficiency by 15%. These results validate the effectiveness of our approach in achieving a balance between information freshness and resource utilization, which is critical for long-term deployments of Unmanned Aerial Vehicles in dynamic settings.

Conclusion

In this paper, we presented a path planning framework for Unmanned Aerial Vehicles that leverages the SAC algorithm to optimize AoI and energy efficiency in dynamic data collection scenarios. By modeling the problem as an MDP and incorporating maximum entropy reinforcement learning, our method enables UAVs to learn adaptive policies that outperform traditional DRL algorithms like DDPG and PPO. The experimental results demonstrate significant improvements in convergence speed, reward value, and system robustness, highlighting the potential of SAC-based approaches in advancing drone technology. Future work will explore multi-UAV coordination and real-world testing to further enhance the scalability and applicability of our solution. As drone technology continues to evolve, such innovations will play a crucial role in enabling efficient and reliable autonomous systems for various industries.