In recent years, the rapid advancement of information technologies, such as the Internet of Things and 5G, has led to an exponential increase in the demand for real-time data processing and transmission. Mobile edge computing (MEC) has emerged as a pivotal architecture that decentralizes computational tasks and data sources from the cloud to edge nodes near users, significantly enhancing real-time data handling capabilities. However, traditional ground-based MEC deployments face limitations in flexibility and are susceptible to signal attenuation due to obstacles, which can compromise service stability and user experience. To address these challenges, drone technology, particularly Unmanned Aerial Vehicles (UAVs), has gained prominence in wireless communications due to their high mobility, rapid deployment, and line-of-sight (LoS) transmission advantages. UAVs can be equipped with edge servers to provide computational offloading services, overcoming the inflexibility of terrestrial MEC networks. Nonetheless, UAV-assisted MEC systems introduce critical issues, such as increased energy consumption from carrying edge servers, heightened computational energy demands, and elevated complexity and costs. Moreover, ground-based interference sources, like radio towers and high-power base stations, pose significant constraints on UAV applications. Therefore, optimizing the energy efficiency (EE) of UAV-assisted MEC systems in interference-prone environments is imperative. This paper proposes a novel approach to maximize system EE by jointly optimizing UAV trajectory and user scheduling, leveraging deep reinforcement learning (DRL) techniques to enhance performance under ground interference conditions.
The system model considers a UAV-assisted MEC communication scenario with ground interference. Let $\mathcal{T}$ denote the set of $K$ fixed ground terminal users, with each user $k$ located at coordinates $q_k = [x_k, y_k, 0]$ for $k \in \mathcal{T}$. Similarly, $\mathcal{M}$ represents the set of $M$ fixed ground interferers, with each interferer $m$ positioned at $q_m = [x_m, y_m]$ for $m \in \mathcal{M}$. The UAV, equipped with an onboard MEC server, provides computational offloading services to ground users. The task data size for user $k$ is denoted as $D_k$, and once all tasks are processed, the user ceases to request services. The UAV starts from an initial point $q_0 = [x_0, y_0]$ and flies at a fixed altitude $H_u$, with its position at time slot $n$ given by $q_u[n] = [x_u[n], y_u[n], H_u]$. The total task execution time $T$ is divided into $N$ small time slots $t_n$ for $n \in \mathcal{N} \triangleq \{1, \dots, N\}$, ensuring that positions remain relatively constant within each slot.
The communication link model accounts for air-to-ground channels using free-space path loss. The channel power gain between user $k$ and the UAV at time slot $n$ is expressed as $g_{k,u}[n] = \beta_0 (\|q_k – q_u[n]\|^2 + H_u^2)^{-1}$, where $\beta_0$ is the reference channel gain at 1 m. For interferers, the channel gain to the UAV is $j_{m,u}[n] = \beta_0 (\|q_m – q_u[n]\|^2 + H_u^2)^{-1}$, assuming worst-case LoS conditions. The achievable communication rate from user $k$ to the UAV is given by:
$$ r_{k,u}[n] = B \log_2 \left(1 + \frac{P_k g_{k,u}[n]}{\sum_{m=1}^{M} P_m j_{m,u}[n] + \sigma^2}\right) $$
where $B$ is the bandwidth allocated per user, $P_k$ and $P_m$ are the transmit powers of user $k$ and interferer $m$, respectively, and $\sigma^2$ is the additive white Gaussian noise (AWGN) power. The offloaded data bits received by the UAV in time slot $n$ are $R_{k,u}[n] = b_k[n] t_n r_{k,u}[n]$, where $b_k[n]$ is the user scheduling variable ($b_k[n] = 1$ if user $k$ is served, else $0$).
The energy consumption model focuses on computational and flight-related energy, neglecting communication energy due to its minor impact for small task sizes. The flight energy $E_{u,f}[n]$ is derived from rotor-powered UAV dynamics, considering horizontal flight and blade drag. It is expressed as:
$$ E_{u,f}[n] = \frac{W^2 / \sqrt{2\rho A}}{\sqrt{\|(v_x[n], v_y[n])\|^2 + \sqrt{\|(v_x[n], v_y[n])\|^4 + 4V_h^4}}} + \frac{C_{D0} \rho A}{8} \|(v_x[n], v_y[n])\|^3 $$
where $W = mg$ is the UAV weight, $\rho$ is air density, $A$ is the rotor disk area, $v_x[n]$ and $v_y[n]$ are horizontal velocities, $V_h = \sqrt{W/(2\rho A)}$ is the hover-induced velocity, and $C_{D0}$ is the drag coefficient. The computational energy for offloaded tasks is $E_{u,c}[n] = \sum_{k \in \mathcal{T}} \gamma_u R_{k,u}[n] (f_{k,u})^2$, where $\gamma_u$ is the effective capacitance coefficient and $f_{k,u}$ is the CPU frequency allocated to user $k$’s tasks. The total energy consumption is $E_{\text{total}} = \sum_{n=1}^{N} E_{u,f}[n] + \sum_{n=1}^{N} E_{u,c}[n]$, and the system EE is defined as:
$$ E_{\text{EE}} = \frac{\sum_{n=1}^{N} \sum_{k \in \mathcal{T}} R_{k,u}[n]}{E_{\text{total}}} $$
The optimization problem aims to maximize $E_{\text{EE}}$ by jointly optimizing the UAV trajectory $\mathbf{Q} \triangleq \{q_u[n]\}$ and user scheduling $\mathbf{B} \triangleq \{b_k[n]\}$, subject to constraints: (1) at most one user served per slot ($\sum_{k \in \mathcal{T}} b_k[n] \leq 1, b_k[n] \in \{0,1\}$), (2) all user tasks completed ($\sum_{n=1}^{N} R_{k,u}[n] \geq D_k$), and (3) UAV movement within boundaries ($x_{\text{min}} \leq x_u[n] \leq x_{\text{max}}, y_{\text{min}} \leq y_u[n] \leq y_{\text{max}}$). This problem is non-convex and complex, necessitating a DRL-based solution.
We formulate this as a Markov Decision Process (MDP), where the UAV acts as an agent. The state space $s[n]$ includes the UAV’s position, user and interferer locations, remaining tasks, residual time, and boundaries. The action space $a[n]$ comprises UAV movements (e.g., forward, backward, left, right, hover) and user scheduling. The reward function $r[n]$ incorporates EE, penalties for boundary violations, serving completed users, and proximity to interferers (with thresholds $S_0$ and $S_1$), and rewards for task completion. Specifically:
$$ r[n] = E_{\text{EE}}[n] – P_0 – P_1 – P_2 – P_3 + R_0 $$
where $P_0$, $P_1$, $P_2$, $P_3$ are penalties, and $R_0$ is a reward.
To solve this, we employ the Dueling Double Deep Q-Network (D3QN) algorithm, which combines double Q-learning and dueling architectures to reduce overestimation bias. The Q-value is computed as:
$$ Q(s[n], a[n]; \theta, \alpha, \beta) = V(s[n]; \theta, \beta) + A(s[n], a[n]; \theta, \alpha) – \frac{1}{N} \sum_{a’} A(s[n], a’; \theta, \alpha) $$
where $\theta$, $\alpha$, and $\beta$ are network parameters. The target Q-value is:
$$ Q_n^{\text{target}} = \begin{cases}
r[n] + \gamma Q(s[n+1], a_{\text{max}}; \theta^{-}, \alpha^{-}, \beta^{-}) & \text{if } d[n] = 0 \\
r[n] & \text{if } d[n] = 1
\end{cases} $$
with $a_{\text{max}} = \arg\max_a Q(s[n+1], a; \theta, \alpha, \beta)$, discount factor $\gamma$, and terminal state indicator $d[n]$. The loss function is $L(\theta, \alpha, \beta) = \mathbb{E}[(Q_n^{\text{target}} – Q(s[n], a[n]; \theta, \alpha, \beta))^2]$. Prioritized experience replay and Dropout regularization enhance stability and convergence.

In simulations, we compare our D3QN-based joint optimization scheme with three benchmarks: (1) D3QN with diagonal UAV trajectory, (2) D3QN with user-directed UAV trajectory, and (3) DDQN-based joint optimization. The environment parameters are summarized in Table 1.
| Parameter | Value |
|---|---|
| Number of time slots $N$ | 400 |
| Time slot length $\delta_t$ (s) | 1 |
| Flight grid size (m×m) | 10×10 |
| Bandwidth $B$ (MHz) | 1 |
| User/interferer power (W) | 0.1 |
| Reference channel gain $\beta_0$ (dB) | -40 |
| AWGN power $\sigma^2$ (dBm) | -160 |
| Effective capacitance $\gamma_u$ | 1×10^{-22} |
| CPU frequency $f_{k,u}$ (GHz) | 1 |
| Rotor disk area $A$ (m²) | 0.18 |
| Air density $\rho$ (kg/m³) | 1.225 |
| UAV mass (kg) | 4 |
| Drag coefficient $C_{D0}$ | 0.08 |
| Penalties $P_0, P_1, P_2, P_3$ | 50, 50, 10, 5 |
| Thresholds $S_0, S_1$ | -1.5, -0.5 |
| Reward $R_0$ | 10 |
The UAV operates in a 1000 m × 1000 m area at 50 m height, starting from [100, 100, 0]. Interferers are at [50, 450, 0], [700, 150, 0], and [400, 880, 0]. Users are randomly distributed with tasks $D_k$ between 86 and 128 Mbits. The D3QN network has an input layer, four hidden layers (128 neurons each), and an output layer, using Leaky ReLU activation and Adam optimizer. Training involves 9000 episodes, experience replay size of 1,000,000, mini-batch size of 2000, learning rate 0.001, and discount factor 0.9.
Simulation results demonstrate the superiority of our approach. The convergence analysis shows that D3QN achieves higher cumulative rewards than DDQN, stabilizing after approximately 12,000 iterations. Trajectory comparisons reveal that our scheme enables the UAV to intelligently avoid interferers while efficiently serving users, unlike benchmarks that suffer from unnecessary movements or poor interference handling. The system EE, evaluated via cumulative distribution function (CDF), is significantly higher in our scheme, as shown in Table 2, which summarizes average EE values over multiple runs.
| Scheme | Average EE (bits/J) |
|---|---|
| D3QN with diagonal trajectory | 1.2 × 10^5 |
| D3QN with user-directed trajectory | 1.5 × 10^5 |
| DDQN joint optimization | 1.8 × 10^5 |
| Proposed D3QN joint optimization | 2.4 × 10^5 |
These results highlight how drone technology, through advanced DRL, can optimize Unmanned Aerial Vehicle operations in challenging environments. The integration of UAV-assisted MEC with intelligent trajectory planning and resource allocation ensures robust performance against ground interference, paving the way for future applications in dynamic scenarios. Further research could explore multi-UAV collaborations or real-time adaptive interference mitigation to enhance scalability and resilience.
