Unmanned Aerial Vehicle Trajectory Optimization for Energy-Constrained Data Acquisition Systems

1. Introduction

Unmanned aerial vehicles (UAVs) revolutionize data acquisition by offering mobility, low cost, and rapid deployment in inaccessible terrains (e.g., disaster zones or forests). However, energy constraints critically limit UAV endurance. Traditional trajectory optimization often ignores energy limitations, leading to suboptimal performance. This work addresses this gap by proposing a reinforcement learning (RL) framework to maximize data throughput under strict energy constraints.


2. System Model

2.1 Network Architecture

  • Components:
    • Single rotary-wing UAV (hovering capability).
    • $K$ ground data sources randomly distributed in a $X_{\text{max}} \times Y_{\text{max}}$ rectangular area.
  • UAV Dynamics:
    • Fixed altitude $H = 120 \text{ m}$.
    • Discrete actions: $\mathcal{A} = {\text{East, West, South, North, Hover}}$.
    • Constant velocity $V_0 = 12 \text{ m/s}$ during flight.
    • Time discretized into slots $\Delta t = 0.5 \text{ s}$; spatial step $\Delta l = V_0 \Delta t = 6 \text{ m}$.

2.2 Communication Model

UAV-to-source $k$ channel gain at time $t$:gk(t)=β0H2+∥z(t)−zs,k∥2,β0=41.6 dBgk​(t)=H2+∥z(t)−zs,k​∥2β0​​,β0​=41.6 dB

where $\mathbf{z}(t) = [x(t), y(t)]^T$ (UAV ground projection) and $\mathbf{z}_{s,k}$ (source $k$ position).

Data rate for source $k$ (bandwidth $B = 1 \text{ MHz}$):Rk(t)=Blog⁡2(1+Prgk(t)σ02),Prσ02=10 dB.Rk​(t)=Blog2​(1+σ02​Prgk​(t)​),σ02​Pr​​=10 dB.

Total throughput per slot:RΣ(t)=∑k=1KRk(t).RΣ​(t)=k=1∑KRk​(t).

Data collected in $[t, t+\Delta t]$:dA(t)=Δt2[RΣ(t)+RΣ(t+Δt)].dA​(t)=2Δt​[RΣ​(t)+RΣ​(tt)].

2.3 Energy Model

Rotor power consumption [15, 16]:Pf(∥v∥)=P0(1+3∥v∥2Vtip2)+P1(1+∥v∥44v04−∥v∥22v02)12+12d0ρsA∥v∥3.Pf​(∥v∥)=P0​(1+Vtip2​3∥v∥2​)+P1​(1+4v04​∥v∥4​​−2v02​∥v∥2​)21​+21​d0​ρsAv∥3.

Parameters:

SymbolValueDescription
$P_0$7.2 WBlade power (hover)
$P_1$4.8 WInduced power (hover)
$V_{\text{tip}}$15 m/sBlade tip speed
$v_0$4.8 m/sRotor induced velocity
$d_0$0.6Fuselage drag ratio
$\rho$1.225 kg/m³Air density
$A$0.0314 m²Rotor disc area

Power consumption:

  • Hover ($| \mathbf{v} | = 0$): $P_{f}(0) = P_0 + P_1 = 12 \text{ W}$.
  • Flight ($| \mathbf{v} | = V_0$): $P_{f}(12) \approx 25 \text{ W}$.
    Energy consumed per slot:

ϵA(t)=Pf(v(t))Δt.ϵA​(t)=Pf​(v(t))Δt.

2.4 Problem Formulation

Maximize total collected data under energy constraint $E_0$:max⁡{zn}∑n=0NmaxdA,ns.t.∑n=0NmaxϵA,n≤E0,zn∈[0,Xmax]×[0,Ymax].{zn​}max​s.t.​n=0∑Nmax​​dA,nn=0∑Nmax​​ϵA,n​≤E0​,zn​∈[0,Xmax​]×[0,Ymax​].​


3. Reinforcement Learning Framework

3.1 Markov Decision Process (MDP)

  • State space $\mathcal{S}$: UAV position $\mathbf{z}_n = (x_n, y_n)$, discretized into $100 \times 100$ grid ($\Delta l = 6 \text{ m}$).
  • Action space $\mathcal{A}$: 5 movement primitives.
  • Reward function: Balances throughput gain and energy penalty:

rn+1=wcR[RΣ((n+1)Δt)−RΣ(nΔt)]Δt⏟Throughput gain−(1−w)cE[Pf(v(nΔt))−Pf(0)]Δt⏟Energy penalty.rn+1​=cRw​Throughput gain[RΣ​((n+1)Δt)−RΣ​(nΔt)]Δt​​−cE​(1−w)​Energy penalty[Pf​(v(nΔt))−Pf​(0)]Δt​​.

Normalization constants:cR=KBΔt,cE=[Pf(V0)−Pf(0)]Δt.cR​=KBΔt,cE​=[Pf​(V0​)−Pf​(0)]Δt.

  • Weight $w$: Trades off data rate ($w \rightarrow 1$) vs. energy efficiency ($w \rightarrow 0$).

3.2 Q-Learning Algorithm

Algorithm 1: Optimal $w$ Search

  1. Initialize $w = 0$, step $c_w = 0.01$.
  2. While $w \leq 1$:
    • Call Algorithm 2 to compute throughput $T_h(w)$.
    • Update $w \leftarrow w + c_w$.
  3. Select $w^* = \arg \max_w T_h(w)$.
  4. Run Algorithm 2 with $w^*$ for optimal trajectory.

Algorithm 2: Trajectory Optimization (Given $w$)

  1. Initialize Q-table $Q(\mathbf{z}, a) = 0$ $\forall \mathbf{z}, a$.
  2. Set $\epsilon = 0.9$, $\epsilon_{\min} = 0.05$, $\beta = 0.99$, $\gamma = 0.5$, $\alpha = 0.8$.
  3. For episode $= 1$ to $M$:
    • $\mathbf{z}_0 = [0,0]^T$, $E = E_0$, $T_h = 0$.
    • While $E > 0$:
      • Choose action $a_n$ via $\epsilon$-greedy policy.
      • Execute $a_n$, observe $\mathbf{z}{n+1}$, $r{n+1}$.
      • Update Q-value:Q(zn,an)←(1−α)Q(zn,an)+α[rn+1+γmax⁡a′Q(zn+1,a′)].Q(zn​,an​)←(1−α)Q(zn​,an​)+α[rn+1​+γa′max​Q(zn+1​,a′)].
      • Update $T_h \leftarrow T_h + d_{A,n}$, $E \leftarrow E – \epsilon_{A,n}$.
    • Decay $\epsilon \leftarrow \beta \epsilon$.
    • If $\epsilon < \epsilon_{\min}$: terminate.

4. Simulation Results

Parameters: $E_0 = 1500 \text{ J}$, $K=6$ sources, $600 \text{ m} \times 600 \text{ m}$ area.

4.1 Convergence & Weight Sensitivity

  • Convergence: Episodic reward stabilizes after 200 episodes (Fig. 3).
  • Optimal $w^*$: $w^* = 0.36$ maximizes throughput (Fig. 4):$w$Throughput (Mbit)0.0–0.210660.361317 (max)1.0978

4.2 Trajectory Comparison

  • Proposed ($w^*=0.36$): UAV moves to $(54, 156)$ then hovers.
  • Greedy ($w=1.0$): UAV traverses to $(216, 342)$, exhausting energy faster.
    Result: Proposed method collects 34.7% more data than greedy.

4.3 Energy Scalability

Throughput vs. initial energy $E_0$:

$E_0$ (J)Proposed (Mbit)Greedy (Mbit)Static (Mbit)
500520490510
1000875760850
150013179781066
2000175015081420

Key insight: Proposed method outperforms baselines by 15.8–25.7% for $E_0 \geq 1000 \text{ J}$.


5. Conclusion

This work optimizes unmanned aerial vehicle trajectories for data acquisition under energy constraints using Q-learning. Key innovations include:

  1. Hybrid reward function balancing throughput and energy.
  2. Adaptive weight $w^*$ search maximizing throughput.
  3. 34.7% higher data collection vs. greedy baselines.

Future work: Extend to multi-unmanned aerial vehicle cooperative systems and dynamic source distributions.


Appendix: Key Symbols

SymbolDescriptionValue/Unit
$H$UAV altitude120 m
$\Delta t$Time slot duration0.5 s
$V_0$UAV speed12 m/s
$B$Bandwidth per source1 MHz
$\beta_0$Reference channel gain41.6 dB
$P_f(0)$Hover power12 W
$P_f(V_0)$Cruise power25 W
$w^*$Optimal weight0.36
Scroll to Top