1. Introduction
Unmanned aerial vehicles (UAVs) revolutionize data acquisition by offering mobility, low cost, and rapid deployment in inaccessible terrains (e.g., disaster zones or forests). However, energy constraints critically limit UAV endurance. Traditional trajectory optimization often ignores energy limitations, leading to suboptimal performance. This work addresses this gap by proposing a reinforcement learning (RL) framework to maximize data throughput under strict energy constraints.

2. System Model
2.1 Network Architecture
- Components:
- Single rotary-wing UAV (hovering capability).
- $K$ ground data sources randomly distributed in a $X_{\text{max}} \times Y_{\text{max}}$ rectangular area.
- UAV Dynamics:
- Fixed altitude $H = 120 \text{ m}$.
- Discrete actions: $\mathcal{A} = {\text{East, West, South, North, Hover}}$.
- Constant velocity $V_0 = 12 \text{ m/s}$ during flight.
- Time discretized into slots $\Delta t = 0.5 \text{ s}$; spatial step $\Delta l = V_0 \Delta t = 6 \text{ m}$.
2.2 Communication Model
UAV-to-source $k$ channel gain at time $t$:gk(t)=β0H2+∥z(t)−zs,k∥2,β0=41.6 dBgk(t)=H2+∥z(t)−zs,k∥2β0,β0=41.6 dB
where $\mathbf{z}(t) = [x(t), y(t)]^T$ (UAV ground projection) and $\mathbf{z}_{s,k}$ (source $k$ position).
Data rate for source $k$ (bandwidth $B = 1 \text{ MHz}$):Rk(t)=Blog2(1+Prgk(t)σ02),Prσ02=10 dB.Rk(t)=Blog2(1+σ02Prgk(t)),σ02Pr=10 dB.
Total throughput per slot:RΣ(t)=∑k=1KRk(t).RΣ(t)=k=1∑KRk(t).
Data collected in $[t, t+\Delta t]$:dA(t)=Δt2[RΣ(t)+RΣ(t+Δt)].dA(t)=2Δt[RΣ(t)+RΣ(t+Δt)].
2.3 Energy Model
Rotor power consumption [15, 16]:Pf(∥v∥)=P0(1+3∥v∥2Vtip2)+P1(1+∥v∥44v04−∥v∥22v02)12+12d0ρsA∥v∥3.Pf(∥v∥)=P0(1+Vtip23∥v∥2)+P1(1+4v04∥v∥4−2v02∥v∥2)21+21d0ρsA∥v∥3.
Parameters:
Symbol | Value | Description |
---|---|---|
$P_0$ | 7.2 W | Blade power (hover) |
$P_1$ | 4.8 W | Induced power (hover) |
$V_{\text{tip}}$ | 15 m/s | Blade tip speed |
$v_0$ | 4.8 m/s | Rotor induced velocity |
$d_0$ | 0.6 | Fuselage drag ratio |
$\rho$ | 1.225 kg/m³ | Air density |
$A$ | 0.0314 m² | Rotor disc area |
Power consumption:
- Hover ($| \mathbf{v} | = 0$): $P_{f}(0) = P_0 + P_1 = 12 \text{ W}$.
- Flight ($| \mathbf{v} | = V_0$): $P_{f}(12) \approx 25 \text{ W}$.
Energy consumed per slot:
ϵA(t)=Pf(v(t))Δt.ϵA(t)=Pf(v(t))Δt.
2.4 Problem Formulation
Maximize total collected data under energy constraint $E_0$:max{zn}∑n=0NmaxdA,ns.t.∑n=0NmaxϵA,n≤E0,zn∈[0,Xmax]×[0,Ymax].{zn}maxs.t.n=0∑NmaxdA,nn=0∑NmaxϵA,n≤E0,zn∈[0,Xmax]×[0,Ymax].
3. Reinforcement Learning Framework
3.1 Markov Decision Process (MDP)
- State space $\mathcal{S}$: UAV position $\mathbf{z}_n = (x_n, y_n)$, discretized into $100 \times 100$ grid ($\Delta l = 6 \text{ m}$).
- Action space $\mathcal{A}$: 5 movement primitives.
- Reward function: Balances throughput gain and energy penalty:
rn+1=wcR[RΣ((n+1)Δt)−RΣ(nΔt)]Δt⏟Throughput gain−(1−w)cE[Pf(v(nΔt))−Pf(0)]Δt⏟Energy penalty.rn+1=cRwThroughput gain[RΣ((n+1)Δt)−RΣ(nΔt)]Δt−cE(1−w)Energy penalty[Pf(v(nΔt))−Pf(0)]Δt.
Normalization constants:cR=KBΔt,cE=[Pf(V0)−Pf(0)]Δt.cR=KBΔt,cE=[Pf(V0)−Pf(0)]Δt.
- Weight $w$: Trades off data rate ($w \rightarrow 1$) vs. energy efficiency ($w \rightarrow 0$).
3.2 Q-Learning Algorithm
Algorithm 1: Optimal $w$ Search
- Initialize $w = 0$, step $c_w = 0.01$.
- While $w \leq 1$:
- Call Algorithm 2 to compute throughput $T_h(w)$.
- Update $w \leftarrow w + c_w$.
- Select $w^* = \arg \max_w T_h(w)$.
- Run Algorithm 2 with $w^*$ for optimal trajectory.
Algorithm 2: Trajectory Optimization (Given $w$)
- Initialize Q-table $Q(\mathbf{z}, a) = 0$ $\forall \mathbf{z}, a$.
- Set $\epsilon = 0.9$, $\epsilon_{\min} = 0.05$, $\beta = 0.99$, $\gamma = 0.5$, $\alpha = 0.8$.
- For episode $= 1$ to $M$:
- $\mathbf{z}_0 = [0,0]^T$, $E = E_0$, $T_h = 0$.
- While $E > 0$:
- Choose action $a_n$ via $\epsilon$-greedy policy.
- Execute $a_n$, observe $\mathbf{z}{n+1}$, $r{n+1}$.
- Update Q-value:Q(zn,an)←(1−α)Q(zn,an)+α[rn+1+γmaxa′Q(zn+1,a′)].Q(zn,an)←(1−α)Q(zn,an)+α[rn+1+γa′maxQ(zn+1,a′)].
- Update $T_h \leftarrow T_h + d_{A,n}$, $E \leftarrow E – \epsilon_{A,n}$.
- Decay $\epsilon \leftarrow \beta \epsilon$.
- If $\epsilon < \epsilon_{\min}$: terminate.
4. Simulation Results
Parameters: $E_0 = 1500 \text{ J}$, $K=6$ sources, $600 \text{ m} \times 600 \text{ m}$ area.
4.1 Convergence & Weight Sensitivity
- Convergence: Episodic reward stabilizes after 200 episodes (Fig. 3).
- Optimal $w^*$: $w^* = 0.36$ maximizes throughput (Fig. 4):$w$Throughput (Mbit)0.0–0.210660.361317 (max)1.0978
4.2 Trajectory Comparison
- Proposed ($w^*=0.36$): UAV moves to $(54, 156)$ then hovers.
- Greedy ($w=1.0$): UAV traverses to $(216, 342)$, exhausting energy faster.
Result: Proposed method collects 34.7% more data than greedy.
4.3 Energy Scalability
Throughput vs. initial energy $E_0$:
$E_0$ (J) | Proposed (Mbit) | Greedy (Mbit) | Static (Mbit) |
---|---|---|---|
500 | 520 | 490 | 510 |
1000 | 875 | 760 | 850 |
1500 | 1317 | 978 | 1066 |
2000 | 1750 | 1508 | 1420 |
Key insight: Proposed method outperforms baselines by 15.8–25.7% for $E_0 \geq 1000 \text{ J}$.
5. Conclusion
This work optimizes unmanned aerial vehicle trajectories for data acquisition under energy constraints using Q-learning. Key innovations include:
- Hybrid reward function balancing throughput and energy.
- Adaptive weight $w^*$ search maximizing throughput.
- 34.7% higher data collection vs. greedy baselines.
Future work: Extend to multi-unmanned aerial vehicle cooperative systems and dynamic source distributions.
Appendix: Key Symbols
Symbol | Description | Value/Unit |
---|---|---|
$H$ | UAV altitude | 120 m |
$\Delta t$ | Time slot duration | 0.5 s |
$V_0$ | UAV speed | 12 m/s |
$B$ | Bandwidth per source | 1 MHz |
$\beta_0$ | Reference channel gain | 41.6 dB |
$P_f(0)$ | Hover power | 12 W |
$P_f(V_0)$ | Cruise power | 25 W |
$w^*$ | Optimal weight | 0.36 |