Age-Optimized Unmanned Aerial Vehicle Path Planning Using SAC Algorithm

Unmanned Aerial Vehicles (UAVs) revolutionize information collection in dynamic environments due to their high mobility and coverage capabilities. However, existing research primarily focuses on static devices, neglecting Age of Information (AoI) optimization for high-mobility nodes. Simultaneously, limited energy constraints challenge drone technology effectiveness. This work proposes a Soft Actor-Critic (SAC)-based path planning framework for UAV-assisted vehicular networks, balancing AoI and energy efficiency through maximum-entropy reinforcement learning.

System Model

Consider $M$ ground vehicles with positions $q_m[t] = (x_m[t], y_m[t], 0)$ and a UAV at fixed altitude $H$ with position $q_u[t] = (x_u[t], y_u[t], H)$. The system comprises three interconnected models:

Communication Model

The Rician-faded uplink rate between the UAV and vehicle $m$ at time $t$ is:

$$R_m[t] = B \log_2\left(1 + \frac{P_u \beta_0 d_m^{-2}[t]}{\sigma^2}\right)$$

where $d_m[t] = \sqrt{(x_u[t]-x_m[t])^2 + (y_u[t]-y_m[t])^2 + H^2}$. Total collected data over period $T$ discretized into $N$ slots is:

$$D_{\text{total}} = \sum_{m=1}^M \sum_{t=0}^{N-1} R_m[t] \Delta t$$

Energy Consumption Model

UAV propulsion power combines blade profile, induced, and parasitic components:

$$P(\mathbf{v}[t]) = P_0 \left(1 + \frac{3\|\mathbf{v}[t]\|^2}{u_{\text{tip}}^2}\right) + \frac{1}{2} z_0 \rho s k \|\mathbf{v}[t]\|^3 + P_i \sqrt{\sqrt{1 + \frac{\|\mathbf{v}[t]\|^4}{4v_0^4}} – \frac{\|\mathbf{v}[t]\|^2}{2v_0^2}}$$

Total energy expenditure is $E_{\text{total}} = \sum_{t=0}^{N-1} P(\mathbf{v}[t]) \Delta t$, with discrete actions $\mathbf{v}[t] \in \{[0,\pm l/\Delta t, 0], [\pm l/\Delta t, 0, 0], \mathbf{0}\}$.

Age of Information Model

Information freshness for vehicle $m$ is defined as $I_m[t] = t – t_m^{\text{last}}$, where $t_m^{\text{last}}$ is the latest data collection timestamp. Total AoI is:

$$I_{\text{total}} = \sum_{t=0}^{N-1} \sum_{m=1}^M I_m[t]$$

Unmanned Aerial Vehicle Simulation Parameters
Parameter	Value	Description
$H$	50 m	UAV altitude
$B$	4 MHz	Channel bandwidth
$P_u$	1 W	Transmit power
$P_0$	79.86 W	Blade power
$P_i$	88.63 W	Induced power
$\Delta t$	1 s	Time slot duration

Problem Formulation

We optimize the trade-off between energy efficiency and information freshness:

$$\max \sum_{m=1}^M \left( \zeta \frac{D_{\text{total}}}{E_{\text{total}}} – \xi I_{\text{total}} \right)$$

subject to:

$$\begin{cases}
E_{\text{total}} \leq E_{\text{max}} \\
q_u[0] = (x_{\text{orig}}, y_{\text{orig}}, H) \\
q_u[N-1] = (x_{\text{dest}}, y_{\text{dest}}, H)
\end{cases}$$

This multi-objective optimization is modeled as a Markov Decision Process (MDP) with:

State space: $\mathcal{S}[t] = \{q_u[t]; q_1[t],…,q_M[t]; I_1[t],…,I_M[t]\}$
Action space: $\mathcal{A}[t] = \{\mathbf{v}_x[t], \mathbf{v}_y[t], \mathbf{v}_z[t]\}$
Reward: $r[t] = \sum_{m=1}^M \left( \zeta \frac{R_m[t]}{P(\mathbf{v}[t])} – \xi I_m[t] \right)$

SAC Algorithm Framework

Our drone technology employs maximum-entropy reinforcement learning to enhance exploration in continuous action spaces.

Soft Q-Function

The objective combines expected return and policy entropy:

$$J(\pi) = \mathbb{E}_{(s,a) \sim \rho_\pi} \left[ r(s,a) + \alpha \mathcal{H}(\pi(\cdot|s)) \right]$$

where $\mathcal{H}(\pi(\cdot|s)) = -\log \pi(a|s)$ and $\alpha$ is the temperature parameter.

Critic Network

Twin Q-networks $Q_{\theta_1}, Q_{\theta_2}$ minimize the Bellman error:

$$J_Q(\theta_i) = \mathbb{E}_{(s,a,r,s’)} \left[ \left( Q_{\theta_i}(s,a) – y \right)^2 \right]$$

with target:

$$y = r + \gamma \left( \min_{j=1,2} Q_{\bar{\theta}_j}(s’,a’) – \alpha \log \pi(a’|s’) \right)$$

Target networks use soft updates: $\bar{\theta}_j \leftarrow \tau \theta_j + (1-\tau)\bar{\theta}_j$.

Actor Network

The policy network maximizes:

$$J_\pi(\phi) = \mathbb{E}_{s \sim \mathcal{D}, a \sim \pi_\phi} \left[ \min_{j=1,2} Q_{\theta_j}(s,a) – \alpha \log \pi_\phi(a|s) \right]$$

with gradient:

$$\nabla_\phi J_\pi = \mathbb{E}_{s \sim \mathcal{D}, a \sim \pi_\phi} \left[ \nabla_\phi \log \pi_\phi(a|s) \left( \alpha \log \pi_\phi(a|s) – Q(s,a) \right) \right]$$

Entropy coefficient $\alpha$ is optimized via:

$$J(\alpha) = \mathbb{E}_{a \sim \pi} \left[ -\alpha \left( \log \pi(a|s) + \mathcal{H}_{\text{target}} \right) \right]$$

Experimental Results

We implemented our Unmanned Aerial Vehicle system in SUMO with real traffic patterns. Training used a 256-neuron network with $\gamma=0.95$ and replay buffer size 5,000.

Comparative Algorithm Performance
Metric	SAC	DDPG	PPO
Convergence speed	50 epochs	120 epochs	200 epochs
Avg. reward	0.92	0.85	0.78
Energy efficiency (kb/J)	18.7	15.2	12.9
Avg. AoI reduction	41.2%	32.7%	24.5%

Figure 1 demonstrates SAC’s superior convergence versus DDPG and PPO. The Unmanned Aerial Vehicle achieves 18.7% higher reward and 41.2% lower AoI than traditional methods.

Figure 2 contrasts flight paths with/without energy constraints ($\zeta$). Energy-aware routing reduces flight distance by 27.3% while maintaining AoI below 3.5s, proving drone technology can balance conflicting objectives.

Conclusion

Our SAC-based framework optimizes Unmanned Aerial Vehicle path planning under AoI and energy constraints. Maximum-entropy learning enhances exploration in dynamic vehicular environments, outperforming DDPG and PPO in reward (92% vs 85%), convergence speed (2.4× faster), and robustness. Future work will extend to multi-UAV coordination and non-terrestrial networks.