Trajectory Optimization for Energy-Constrained Unmanned Aerial Vehicle Data Acquisition Systems Using Reinforcement Learning

In recent years, drone technology has revolutionized data collection in various fields, such as environmental monitoring, disaster response, and Internet of Things (IoT) networks. Unmanned Aerial Vehicles (UAVs) offer high mobility, flexibility, and cost-effectiveness compared to traditional ground-based methods. However, a critical challenge in UAV-aided systems is energy limitation, which constrains flight duration and data acquisition capabilities. This paper addresses the problem of optimizing the flight trajectory of an energy-constrained Unmanned Aerial Vehicle serving as an aerial base station to collect data from ground-based sources. Our goal is to maximize the total system throughput—the amount of data collected—before the UAV’s energy is depleted. We model this as a Markov Decision Process (MDP) and propose a reinforcement learning-based approach using Q-Learning to dynamically plan the UAV’s path. By balancing communication rates and energy consumption, our algorithm achieves superior performance compared to baseline strategies.

The proliferation of drone technology has enabled innovative applications in data acquisition, where Unmanned Aerial Vehicles act as mobile base stations. In scenarios like remote sensing or emergency communications, UAVs can efficiently gather data from distributed sources. However, the energy constraints of these systems necessitate intelligent trajectory planning to maximize efficiency. Traditional methods often assume unlimited energy, but in practice, UAVs have finite battery life, making energy-aware optimization essential. This work focuses on developing a reinforcement learning framework to optimize the flight path of an Unmanned Aerial Vehicle, considering both the data throughput and energy expenditure. We formulate the problem mathematically, design a reward function that incorporates these factors, and implement a Q-Learning algorithm to learn optimal policies. Simulation results demonstrate the effectiveness of our approach in enhancing system throughput under energy constraints.

To establish a foundation for our work, we first describe the system model. Consider a rectangular data collection area of dimensions $X_{\text{max}} \times Y_{\text{max}}$, with $K$ ground data sources randomly distributed within this region. A single Unmanned Aerial Vehicle, flying at a constant altitude $H$, serves as an aerial base station to collect data from these sources. The UAV’s motion is discretized into time slots of length $\Delta t$, during which it can perform one of five actions: move east, west, north, south, or hover. The flight speed is constant at $V_0$ meters per second, resulting in a displacement of $\Delta l = V_0 \Delta t$ per time slot. The ground area is divided into a grid of cells of size $\Delta l \times \Delta l$, with the UAV’s position represented by its ground-projected coordinates $\mathbf{z} = [x, y]^T$.

The communication model accounts for the channel gain between the UAV and each data source. For data source $k$ at time $t$, the channel gain $g_k(t)$ is given by:

$$g_k(t) = \frac{\beta_0}{H^2 + \|\mathbf{z} – \mathbf{z}_{S,k}\|^2}$$

where $\beta_0$ is the reference channel gain at 1 meter, $\mathbf{z}_{S,k} = [x_{S,k}, y_{S,k}]^T$ is the position of data source $k$, and $\|\cdot\|$ denotes the Euclidean norm. The data sources use frequency division multiple access, each allocated a bandwidth $B$. The communication rate $R_k(t)$ between the UAV and data source $k$ at time $t$ is:

$$R_k(t) = B \log_2 \left(1 + \frac{P_T g_k(t)}{\sigma_0^2}\right)$$

where $P_T$ is the transmit power of the data sources, and $\sigma_0^2$ is the noise power. The total communication rate $R_{\Sigma}(t)$ is the sum over all $K$ sources:

$$R_{\Sigma}(t) = \sum_{k=1}^{K} R_k(t)$$

The data collected in a time slot starting at time $t$, denoted $\Delta d(t)$, depends on the UAV’s action. If the UAV hovers, $\Delta d(t) = R_{\Sigma}(t) \Delta t$. If it moves, we approximate the data collection linearly over the slot:

$$\Delta d(t) = \frac{R_{\Sigma}(t) + R_{\Sigma}(t + \Delta t)}{2} \Delta t$$

Energy consumption is a critical aspect of drone technology. For a rotary-wing Unmanned Aerial Vehicle, the flight power $P_{\text{fly}}(v)$ as a function of speed $v$ is modeled as:

$$P_{\text{fly}}(v) = P_0 + P_1 \left( \sqrt{1 + \frac{v^4}{4v_0^4}} – \frac{v^2}{2v_0^2} \right) + \frac{1}{2} d_0 \rho s A v^3$$

where $P_0$ and $P_1$ are the blade profile and induced powers in hover, $v_0$ is the induced velocity in hover, $d_0$ is the fuselage drag ratio, $\rho$ is air density, $s$ is rotor solidity, and $A$ is rotor disc area. The hover power is $P_{\text{fly}}(0) = P_0 + P_1$, and the power at speed $V_0$ is $P_{\text{fly}}(V_0)$. The energy consumed in a time slot, $\Delta e(t)$, is:

$$\Delta e(t) = P_{\text{fly}}(v(t)) \Delta t$$

where $v(t)$ is the UAV’s speed at time $t$, either $V_0$ (moving) or 0 (hovering).

The optimization problem is to maximize the total data collected until the UAV’s initial energy $E_0$ is depleted, subject to boundary constraints. Let $T_{\text{max}}$ be the maximum flight time, and discretize time as $t = n \Delta t$ for $n = 0, 1, 2, \dots, N_{\text{max}}$ with $N_{\text{max}} = T_{\text{max}} / \Delta t$. The problem formulation is:

$$\begin{aligned}
\text{P1:} \quad & \max_{\{\mathbf{z}_n\}} \sum_{n=0}^{N} \Delta d_n \\
\text{s.t.} \quad & \sum_{n=0}^{N} \Delta e_n \leq E_0 \\
& x_n \in [0, X_{\text{max}}], \quad y_n \in [0, Y_{\text{max}}]
\end{aligned}$$

where $\mathbf{z}_n = [x_n, y_n]^T$ is the UAV’s position at time step $n$, $\Delta d_n$ and $\Delta e_n$ are the data collected and energy consumed in time slot $n$, and $N$ is the last time step before energy depletion.

To solve this, we model it as an MDP. The state space $\mathcal{S}$ consists of all possible UAV positions on the grid, with $I_x = X_{\text{max}} / \Delta l$ and $I_y = Y_{\text{max}} / \Delta l$ grid points in the x and y directions, respectively. Thus, $\mathcal{S} = \{ \mathbf{z} = [i \Delta l, j \Delta l]^T : i = 0, 1, \dots, I_x; j = 0, 1, \dots, I_y \}$. The action space $\mathcal{A}$ includes the five movements: east, west, north, south, and hover. The reward function is designed to balance data throughput and energy consumption:

$$r_{n+1} = w \left( \frac{R_{\Sigma}((n+1)\Delta t) – R_{\Sigma}(n \Delta t)}{c_R} \right) – (1 – w) \left( \frac{P_{\text{fly}}(v(n \Delta t)) – P_{\text{fly}}(0)}{c_E} \right)$$

where $w \in [0,1]$ is a weight factor, $c_R$ and $c_E$ are normalization constants to balance the scales. Specifically, $c_R$ is set to the maximum possible rate difference, and $c_E$ to the energy difference between flying and hovering. This reward encourages the UAV to move toward areas of high data rate while penalizing excessive energy use.

We employ Q-Learning, a model-free reinforcement learning algorithm, to learn the optimal policy. The Q-function $Q(\mathbf{z}, a)$ represents the expected cumulative reward for taking action $a$ in state $\mathbf{z}$. The algorithm updates Q-values iteratively based on the Bellman equation:

$$Q(\mathbf{z}_n, a_n) \leftarrow (1 – \alpha) Q(\mathbf{z}_n, a_n) + \alpha \left[ r_{n+1} + \gamma \max_{a’} Q(\mathbf{z}_{n+1}, a’) \right]$$

where $\alpha$ is the learning rate, $\gamma$ is the discount factor, and $r_{n+1}$ is the immediate reward. We use an $\epsilon$-greedy policy for exploration: with probability $\epsilon$, choose a random action; otherwise, choose the action that maximizes $Q(\mathbf{z}, a)$. Over time, $\epsilon$ decays to favor exploitation.

The proposed algorithm consists of two parts: an outer loop that searches for the optimal weight $w$, and an inner loop that implements Q-Learning for a given $w$. The inner algorithm initializes parameters, simulates episodes from the start until energy depletion, and updates Q-values. The outer algorithm varies $w$ from 0 to 1, selects the value that maximizes total data collected, and returns the corresponding trajectory.

We now present simulation results to validate our approach. The UAV starts at the origin $(0,0)$, flying at $H = 120$ m over a $600 \times 600$ m area. Time slots are $\Delta t = 0.5$ s, with speed $V_0 = 12$ m/s, giving $\Delta l = 6$ m and a $100 \times 100$ grid. The UAV’s weight is 2 N (approx. 204 g), rotor radius 0.1 m, and other parameters as per standard models, resulting in hover power $P_{\text{fly}}(0) = 12$ W and flight power $P_{\text{fly}}(12) \approx 25$ W. There are $K=6$ randomly placed data sources, each with $B=1$ MHz bandwidth, $\beta_0 = 41.6$ dB, and $P_T / \sigma_0^2 = 10$ dB. Reinforcement learning parameters are: discount factor $\gamma = 0.5$, learning rate $\alpha = 0.8$, and decay factor $\beta = 0.99$ for $\epsilon$.

We compare our method to two baselines: (1) Static Hovering: The UAV remains at the initial position ($w=0$); (2) Greedy Strategy: The UAV always moves toward the position with the highest immediate communication rate ($w=1$).

The convergence of our Q-Learning algorithm is shown in Figure 3 (refer to the image for visualization). For weights $w=0.3$ and $w=0.8$, the cumulative reward per episode increases with training, stabilizing after a number of episodes, indicating convergence. This demonstrates the effectiveness of our learning approach in adapting to the environment.

Table 1 summarizes the system parameters used in simulations:

Parameter	Value
Area Dimensions	$600 \times 600$ m
UAV Altitude $H$	120 m
Time Slot $\Delta t$	0.5 s
UAV Speed $V_0$	12 m/s
Grid Cell Size $\Delta l$	6 m
Hover Power $P_{\text{fly}}(0)$	12 W
Flight Power $P_{\text{fly}}(12)$	25 W
Number of Data Sources $K$	6
Bandwidth per Source $B$	1 MHz
Reference Gain $\beta_0$	41.6 dB
Transmit SNR $P_T / \sigma_0^2$	10 dB
Learning Rate $\alpha$	0.8
Discount Factor $\gamma$	0.5

The impact of the weight $w$ on data collection is illustrated in Figure 4. For $w \leq 0.2$, the data collected is constant at 1066 Mbit. As $w$ increases from 0.2 to 0.36, data collection rises to a maximum of 1317 Mbit. Beyond $w=0.36$, it decreases, reaching 978 Mbit at $w=1$. This highlights the importance of optimizing $w$ to balance rate and energy.

Figure 5 depicts example trajectories for our algorithm (with optimal $w=0.36$) and the greedy strategy. Both start from $(0,0)$ and move to $(54,156)$, but our algorithm hovers there to conserve energy, while the greedy strategy continues to $(216,342)$, consuming more energy for marginal gains. This visualizes how our method achieves better efficiency.

The effect of initial energy $E_0$ on data collection is shown in Figure 6. As $E_0$ increases, data collection grows for all methods. For low $E_0$ (e.g., $\leq 1000$ J), our algorithm performs similarly to static hovering. At $E_0=1500$ J, our method collects 1317 Mbit vs. 978 Mbit for greedy—a 25.7% improvement. At $E_0=2000$ J, the gap narrows to 15.8%, but our approach consistently outperforms others, demonstrating its robustness across energy levels.

In conclusion, we have developed a reinforcement learning-based trajectory optimization algorithm for energy-constrained Unmanned Aerial Vehicle data acquisition systems. By formulating the problem as an MDP and using Q-Learning with a weighted reward function, we effectively maximize system throughput. Simulations confirm the algorithm’s convergence and superiority over baseline strategies. Future work will extend this to multi-UAV scenarios and incorporate real-time environmental adaptations. Drone technology continues to evolve, and such advancements in Unmanned Aerial Vehicle systems will enhance their applicability in diverse data collection tasks.

The integration of drone technology into data acquisition systems represents a significant leap forward. Unmanned Aerial Vehicles offer unparalleled flexibility, but energy constraints remain a pivotal challenge. Our approach demonstrates that intelligent trajectory planning, powered by reinforcement learning, can substantially improve performance. As Unmanned Aerial Vehicle capabilities expand, further research in energy-efficient algorithms will be crucial for sustainable and effective deployments.