Unmanned Aerial Vehicle Trajectory Optimization for Energy-Constrained Data Acquisition Systems

1. Introduction

Unmanned aerial vehicles (UAVs) revolutionize data acquisition by offering mobility, low cost, and rapid deployment in inaccessible terrains (e.g., disaster zones or forests). However, energy constraints critically limit UAV endurance. Traditional trajectory optimization often ignores energy limitations, leading to suboptimal performance. This work addresses this gap by proposing a reinforcement learning (RL) framework to maximize data throughput under strict energy constraints.

2. System Model

2.1 Network Architecture

Components:
- Single rotary-wing UAV (hovering capability).
- $K$ ground data sources randomly distributed in a $X_{\text{max}} \times Y_{\text{max}}$ rectangular area.
UAV Dynamics:
- Fixed altitude $H = 120 \text{ m}$.
- Discrete actions: $\mathcal{A} = {\text{East, West, South, North, Hover}}$.
- Constant velocity $V_0 = 12 \text{ m/s}$ during flight.
- Time discretized into slots $\Delta t = 0.5 \text{ s}$; spatial step $\Delta l = V_0 \Delta t = 6 \text{ m}$.

2.2 Communication Model

UAV-to-source $k$ channel gain at time $t$:gk(t)=β0H2+∥z(t)−zs,k∥2,β0=41.6 dBgk(t)=H2+∥z(t)−zs,k∥2β0,β0=41.6 dB

where $\mathbf{z}(t) = [x(t), y(t)]^T$ (UAV ground projection) and $\mathbf{z}_{s,k}$ (source $k$ position).

Data rate for source $k$ (bandwidth $B = 1 \text{ MHz}$):Rk(t)=Blog⁡2(1+Prgk(t)σ02),Prσ02=10 dB.Rk(t)=Blog2(1+σ02Prgk(t)),σ02Pr=10 dB.

Total throughput per slot:RΣ(t)=∑k=1KRk(t).RΣ(t)=k=1∑KRk(t).

Data collected in $[t, t+\Delta t]$:dA(t)=Δt2[RΣ(t)+RΣ(t+Δt)].dA(t)=2Δt[RΣ(t)+RΣ(t+Δt)].

2.3 Energy Model

Rotor power consumption [15, 16]:Pf(∥v∥)=P0(1+3∥v∥2Vtip2)+P1(1+∥v∥44v04−∥v∥22v02)12+12d0ρsA∥v∥3.Pf(∥v∥)=P0(1+Vtip23∥v∥2)+P1(1+4v04∥v∥4−2v02∥v∥2)21+21d0ρsA∥v∥3.

Parameters:

Symbol	Value	Description
$P_0$	7.2 W	Blade power (hover)
$P_1$	4.8 W	Induced power (hover)
$V_{\text{tip}}$	15 m/s	Blade tip speed
$v_0$	4.8 m/s	Rotor induced velocity
$d_0$	0.6	Fuselage drag ratio
$\rho$	1.225 kg/m³	Air density
$A$	0.0314 m²	Rotor disc area

Power consumption:

Hover ($| \mathbf{v} | = 0$): $P_{f}(0) = P_0 + P_1 = 12 \text{ W}$.
Flight ($| \mathbf{v} | = V_0$): $P_{f}(12) \approx 25 \text{ W}$.
Energy consumed per slot:

ϵA(t)=Pf(v(t))Δt.ϵA(t)=Pf(v(t))Δt.

2.4 Problem Formulation

Maximize total collected data under energy constraint $E_0$:max⁡{zn}∑n=0NmaxdA,ns.t.∑n=0NmaxϵA,n≤E0,zn∈[0,Xmax]×[0,Ymax].{zn}maxs.t.n=0∑NmaxdA,nn=0∑NmaxϵA,n≤E0,zn∈[0,Xmax]×[0,Ymax].

3. Reinforcement Learning Framework

3.1 Markov Decision Process (MDP)

State space $\mathcal{S}$: UAV position $\mathbf{z}_n = (x_n, y_n)$, discretized into $100 \times 100$ grid ($\Delta l = 6 \text{ m}$).
Action space $\mathcal{A}$: 5 movement primitives.
Reward function: Balances throughput gain and energy penalty:

rn+1=wcR[RΣ((n+1)Δt)−RΣ(nΔt)]Δt⏟Throughput gain−(1−w)cE[Pf(v(nΔt))−Pf(0)]Δt⏟Energy penalty.rn+1=cRwThroughput gain[RΣ((n+1)Δt)−RΣ(nΔt)]Δt−cE(1−w)Energy penalty[Pf(v(nΔt))−Pf(0)]Δt.

Normalization constants:cR=KBΔt,cE=[Pf(V0)−Pf(0)]Δt.cR=KBΔt,cE=[Pf(V0)−Pf(0)]Δt.

Weight $w$: Trades off data rate ($w \rightarrow 1$) vs. energy efficiency ($w \rightarrow 0$).

3.2 Q-Learning Algorithm

Algorithm 1: Optimal $w$ Search

Initialize $w = 0$, step $c_w = 0.01$.
While $w \leq 1$:
- Call Algorithm 2 to compute throughput $T_h(w)$.
- Update $w \leftarrow w + c_w$.
Select $w^* = \arg \max_w T_h(w)$.
Run Algorithm 2 with $w^*$ for optimal trajectory.

Algorithm 2: Trajectory Optimization (Given $w$)

Initialize Q-table $Q(\mathbf{z}, a) = 0$ $\forall \mathbf{z}, a$.
Set $\epsilon = 0.9$, $\epsilon_{\min} = 0.05$, $\beta = 0.99$, $\gamma = 0.5$, $\alpha = 0.8$.
For episode $= 1$ to $M$:
- $\mathbf{z}_0 = [0,0]^T$, $E = E_0$, $T_h = 0$.
- While $E > 0$:
  - Choose action $a_n$ via $\epsilon$-greedy policy.
  - Execute $a_n$, observe $\mathbf{z}{n+1}$, $r{n+1}$.
  - Update Q-value:Q(zn,an)←(1−α)Q(zn,an)+α[rn+1+γmax⁡a′Q(zn+1,a′)].Q(zn,an)←(1−α)Q(zn,an)+α[rn+1+γa′maxQ(zn+1,a′)].
  - Update $T_h \leftarrow T_h + d_{A,n}$, $E \leftarrow E – \epsilon_{A,n}$.
- Decay $\epsilon \leftarrow \beta \epsilon$.
- If $\epsilon < \epsilon_{\min}$: terminate.

4. Simulation Results

Parameters: $E_0 = 1500 \text{ J}$, $K=6$ sources, $600 \text{ m} \times 600 \text{ m}$ area.

4.1 Convergence & Weight Sensitivity

Convergence: Episodic reward stabilizes after 200 episodes (Fig. 3).
Optimal $w^*$: $w^* = 0.36$ maximizes throughput (Fig. 4):$w$Throughput (Mbit)0.0–0.210660.361317 (max)1.0978

4.2 Trajectory Comparison

Proposed ($w^*=0.36$): UAV moves to $(54, 156)$ then hovers.
Greedy ($w=1.0$): UAV traverses to $(216, 342)$, exhausting energy faster.
Result: Proposed method collects 34.7% more data than greedy.

4.3 Energy Scalability

Throughput vs. initial energy $E_0$:

$E_0$ (J)	Proposed (Mbit)	Greedy (Mbit)	Static (Mbit)
500	520	490	510
1000	875	760	850
1500	1317	978	1066
2000	1750	1508	1420

Key insight: Proposed method outperforms baselines by 15.8–25.7% for $E_0 \geq 1000 \text{ J}$.

5. Conclusion

This work optimizes unmanned aerial vehicle trajectories for data acquisition under energy constraints using Q-learning. Key innovations include:

Hybrid reward function balancing throughput and energy.
Adaptive weight $w^*$ search maximizing throughput.
34.7% higher data collection vs. greedy baselines.

Future work: Extend to multi-unmanned aerial vehicle cooperative systems and dynamic source distributions.

Appendix: Key Symbols

Symbol	Description	Value/Unit
$H$	UAV altitude	120 m
$\Delta t$	Time slot duration	0.5 s
$V_0$	UAV speed	12 m/s
$B$	Bandwidth per source	1 MHz
$\beta_0$	Reference channel gain	41.6 dB
$P_f(0)$	Hover power	12 W
$P_f(V_0)$	Cruise power	25 W
$w^*$	Optimal weight	0.36