Cooperative Trajectory Optimization and Task Offloading in Multi-UAV Assisted Edge Computing for Enhanced Fairness and Efficiency

The rapid evolution of the Internet of Things (IoT) has led to an exponential increase in computational demands from user devices (UDs), which often face limitations in processing power, battery life, and data security. To address these challenges, Unmanned Aerial Vehicle (UAV) technology has emerged as a promising solution, particularly when integrated with Mobile Edge Computing (MEC). By deploying UAVs equipped with MEC servers, it is possible to provide flexible and reliable computational offloading services, especially in remote or disaster-stricken areas where traditional infrastructure is unavailable. However, single UAV systems often struggle with resource constraints in dense task environments, necessitating the use of multi-UAV collaborations. This paper explores the optimization of UAV trajectories and task offloading decisions in a cooperative multi-UAV MEC system, focusing on minimizing energy consumption while maximizing user fairness through innovative machine learning approaches.

In typical IoT systems, UDs generate computation-intensive tasks that require timely processing. However, due to hardware limitations, these devices may offload tasks to nearby MEC servers. While ground-based MEC servers are common, they are ineffective in areas with poor connectivity. Here, drone technology offers a viable alternative, as UAVs can dynamically position themselves to serve UDs efficiently. The integration of multiple UAVs allows for load balancing and improved coverage, but it introduces complexities in trajectory planning, resource allocation, and fairness among users. Existing research often employs binary offloading, where tasks are entirely processed either locally or on a UAV, leading to suboptimal resource utilization. In contrast, this work adopts partial offloading, where tasks are partitioned into smaller units, enabling more granular resource management. Additionally, task migration between UAVs is introduced to handle load imbalances, ensuring that no single UAV is overwhelmed.

Data privacy and overhead are critical concerns in IoT networks. Federated learning (FL) has been widely adopted to address these issues by allowing UDs to train machine learning models locally without sharing raw data. However, traditional FL methods may suffer from slow convergence and inefficiencies in heterogeneous environments. To overcome these limitations, this paper proposes a Federated Reinforcement Learning algorithm combined with an Actor-Critic Network (FRLACN). This approach leverages the strengths of reinforcement learning (RL) to optimize decisions in a dynamic environment while preserving privacy through federated aggregation. The algorithm jointly optimizes UAV trajectories and offloading decisions to reduce total energy consumption and enhance user fairness, measured via the Jain fairness index.

The system model considers a discrete-time scenario where multiple UAVs fly at a fixed altitude, serving ground UDs. Each UAV’s position is updated based on its flight direction and distance, constrained to avoid collisions and ensure coverage. The communication model accounts for path loss between UDs and UAVs, incorporating Line-of-Sight (LoS) and Non-Line-of-Sight (NLoS) probabilities. Tasks are partitioned into equal segments, allowing partial offloading to UAVs or local execution. Task migration enables high-load UAVs to transfer tasks to underutilized ones, improving overall system efficiency. The energy consumption model includes components for transmission and computation, while the fairness model ensures equitable resource distribution among UDs.

The optimization problem aims to maximize a weighted sum of user fairness and energy efficiency, subject to constraints on UAV trajectories, resource capacities, and task deadlines. This problem is formulated as a non-convex mixed-integer nonlinear program, which is NP-hard and challenging to solve with traditional methods. Thus, the FRLACN algorithm is developed, modeling the problem as a Markov Decision Process (MDP). Each UAV acts as an intelligent agent, observing environmental states (e.g., positions, task loads, distances) and taking actions (e.g., flight angles and distances) to maximize cumulative rewards. The Actor-Critic networks facilitate policy learning, with federated aggregation ensuring privacy and convergence.

Simulations demonstrate that the proposed FRLACN algorithm outperforms baseline approaches like centralized DDPG, multi-agent DDPG, and traditional federated learning in terms of convergence speed, energy efficiency, and fairness. For instance, with 20 UDs, FRLACN reduces total energy consumption by over 11% and data transmission costs by 8.7% compared to FedAvg, while improving user fairness by up to 24% in certain scenarios. The results underscore the effectiveness of cooperative drone technology in edge computing systems, highlighting the importance of integrated optimization in real-world applications.

System Model and Problem Formulation

The system comprises a cloud server, multiple UDs, and a fleet of UAVs equipped with MEC servers, all operating in a discrete-time framework. Let $N$ denote the set of UDs, indexed by $n \in \{1, 2, \dots, N\}$, and $M$ represent the set of UAVs, indexed by $m \in \{1, 2, \dots, M\}$. Time is divided into slots $t \in \{1, 2, \dots, T\}$. Each UAV flies at a fixed altitude $H$, with its position in slot $t$ given by coordinates $(X_{m,t}, Y_{m,t}, H)$. The initial position is $(X_{m,0}, Y_{m,0}, H)$, and movement is governed by a flight angle $\alpha_{m,t} \in (0, 2\pi)$ and distance $d_{m,t} \in (0, d_{\text{max}})$. Thus, the coordinates evolve as:

$$X_{m,t} = X_{m,0} + \sum_{t’=1}^{t} d_{m,t’} \cos(\alpha_{m,t’})$$

$$Y_{m,t} = Y_{m,0} + \sum_{t’=1}^{t} d_{m,t’} \sin(\alpha_{m,t’})$$

Constraints ensure that UAVs remain within a square area of side length $l_{\text{max}}$:

$$0 \leq X_{m,t} \leq l_{\text{max}}, \quad 0 \leq Y_{m,t} \leq l_{\text{max}}$$

To prevent collisions, the distance between any two UAVs $m$ and $m’$ must satisfy:

$$d_{m,m’,t} = \sqrt{(X_{m,t} – X_{m’,t})^2 + (Y_{m,t} – Y_{m’,t})^2} \geq d_{\text{min}}$$

Each UD $n$ has a fixed position $(x_n, y_n)$, and the horizontal distance to UAV $m$ is:

$$d_{n,m,t} = \sqrt{(X_{m,t} – x_n)^2 + (Y_{m,t} – y_n)^2}$$

For successful communication, this distance must not exceed the UAV’s coverage radius $d_{\text{max}}$:

$$d_{n,m,t} \leq d_{\text{max}}$$

The channel model incorporates LoS and NLoS components. The LoS probability is given by:

$$P_{\text{LoS}} = \frac{1}{1 + a \exp(-b(\theta – a))}$$

where $a$ and $b$ are environment-dependent parameters, and $\theta$ is the elevation angle:

$$\theta = \frac{180}{\pi} \arcsin\left(\frac{H}{d_{n,m,t}}\right)$$

The path loss $L$ is calculated as:

$$L = L_{\text{FS}} + P_{\text{LoS}} \eta_{\text{LoS}} + (1 – P_{\text{LoS}}) \eta_{\text{NLoS}}$$

where $L_{\text{FS}}$ is the free-space path loss:

$$L_{\text{FS}} = 20 \log_{10}\left(\frac{4\pi f_c \sqrt{H^2 + d_{n,m,t}^2}}{c}\right)$$

Here, $f_c$ is the carrier frequency, and $c$ is the speed of light.

Each UD generates a computation task $O_{n,t} = \{C_{n,t}, D_{n,t}, T^{\text{max}}_{n,t}\}$, where $C_{n,t}$ is the required CPU cycles, $D_{n,t}$ is the data size, and $T^{\text{max}}_{n,t}$ is the maximum tolerable delay. Tasks are partitioned into $Q_n$ equal segments, allowing partial offloading. Let $\delta_{1,n}$, $\delta_{2,n}$, and $\delta_{3,n}$ represent the fractions of tasks processed locally, offloaded to a UAV, or migrated to another UAV, respectively, with $\delta_{1,n} + \delta_{2,n} + \delta_{3,n} = 1$.

For communication, the data rate from UD $n$ to UAV $m$ is:

$$R_{n,m,t} = B \log_2\left(1 + \frac{\rho_n g_{n,m}}{\sigma^2 + \sum_{m’ \neq m} \rho_n g_{m,m’}}\right)$$

where $B$ is the bandwidth, $\rho_n$ is the transmission power, $g_{n,m}$ is the channel gain, and $\sigma^2$ is the noise power. For task migration between UAVs $m$ and $m’$, the data rate is:

$$R_{m,m’,t} = B \log_2\left(1 + \frac{\rho_m g_{m,m’}}{\sigma^2}\right)$$

The computation model accounts for local processing, offloading to a UAV, and task migration. The local execution time for a task segment is:

$$T^{\text{loc}}_{n,t} = \frac{C_{n,t} D_{n,t}}{f_n Q_n}$$

where $f_n$ is the UD’s computational capability. The corresponding energy consumption is:

$$E^{\text{loc}}_{n,t} = \kappa_n (f_n)^2 C_{n,t} D_{n,t} / Q_n$$

with $\kappa_n$ as the effective switched capacitance. For offloading to a UAV, the transmission time is:

$$T^{\text{tran}}_{n,m,t} = \frac{D_{n,t}}{R_{n,m,t} Q_n}$$

and the transmission energy is:

$$E^{\text{tran}}_{n,m,t} = \rho_n T^{\text{tran}}_{n,m,t}$$

The computation time on UAV $m$ is:

$$T^{\text{com}}_{n,m,t} = \frac{C_{n,t} D_{n,t}}{f_{m,n} Q_n}$$

where $f_{m,n}$ is the resource allocated to UD $n$. The computation energy is:

$$E^{\text{com}}_{n,m,t} = \kappa_m (f_{m,n})^2 C_{n,t} D_{n,t} / Q_n$$

Thus, the total energy for offloading is:

$$E_{n,m,t} = E^{\text{com}}_{n,m,t} + E^{\text{tran}}_{n,m,t}$$

For task migration, the transmission time between UAVs is:

$$T^{\text{tran}}_{m,m’,t} = \frac{D_{n,t}}{R_{m,m’,t} Q_n}$$

with energy:

$$E^{\text{tran}}_{m,m’,t} = \rho_m T^{\text{tran}}_{m,m’,t}$$

The computation time on the recipient UAV $m’$ is:

$$T^{\text{com}}_{m,m’,t} = \frac{C_{n,t} D_{n,t}}{f_{m’,m} Q_n}$$

and energy:

$$E^{\text{com}}_{m,m’,t} = \kappa_m (f_{m’,m})^2 C_{n,t} D_{n,t} / Q_n$$

The total energy for migration is:

$$E_{m,m’,t} = E^{\text{com}}_{m,m’,t} + E^{\text{tran}}_{m,m’,t}$$

The overall task completion time is the maximum of the local, offloading, and migration times:

$$T_n = \max\left(T^{\text{loc}}_{n,t}, T^{\text{tran}}_{n,m,t} + T^{\text{com}}_{n,m,t}, T^{\text{tran}}_{n,m,t} + T^{\text{com}}_{n,m,t} + T^{\text{com}}_{m,m’,t} + T^{\text{tran}}_{m,m’,t}\right)$$

Resource constraints ensure that the total computational resources allocated by a UAV do not exceed its capacity $F^{\text{max}}_m$:

$$\sum_{n=1}^{N} f_{m,n} + f_{m’,m} \leq F^{\text{max}}_m$$

User fairness is quantified using the Jain fairness index. The total system throughput over $T$ time slots is:

$$R_{\text{total}} = \sum_{n=1}^{N} \sum_{m=1}^{M} \sum_{t=1}^{T} R_{n,m,t}$$

The fairness factor $\phi$ is defined as:

$$\phi = \frac{\left(\sum_{n=1}^{N} \sum_{m=1}^{M} \sum_{t=1}^{T} \frac{R_{n,m,t}}{D_{n,t}}\right)^2}{MN \sum_{n=1}^{N} \sum_{m=1}^{M} \sum_{t=1}^{T} \left(\frac{R_{n,m,t}}{D_{n,t}}\right)^2}$$

where $\phi$ ranges from 0 to 1, with higher values indicating better fairness.

The optimization problem aims to maximize the objective function $\Psi$:

$$\max \Psi = \mu_1 \phi – \mu_2 \left(E^{\text{loc}}_{n,t} + E_{n,m,t} + E_{m,m’,t}\right)$$

subject to constraints C1–C10, including trajectory limits, collision avoidance, coverage, resource capacity, task deadlines, and fairness bounds. Here, $\mu_1$ and $\mu_2$ are weight factors balancing fairness and energy efficiency.

Algorithm Design: Federated Reinforcement Learning with Actor-Critic Networks

The formulated problem is complex and non-convex, making traditional optimization methods inefficient. Thus, we propose the FRLACN algorithm, which combines federated learning with reinforcement learning to optimize UAV trajectories and offloading decisions while preserving data privacy. The approach models the problem as a Markov Decision Process (MDP), where each UAV acts as an independent agent interacting with the environment.

The MDP is defined by the tuple $(s_{m,t}, a_{m,t}, r_{m,t})$, where $s_{m,t}$ is the state, $a_{m,t}$ is the action, and $r_{m,t}$ is the reward at time $t$ for agent $m$. The state space includes the UAV’s position, task data sizes, and inter-UAV distances:

$$s_{m,t} = (X_m, Y_m, H, D_{n,t}, d_{m,m’,t})$$

The action space comprises the flight angle and distance:

$$a_{m,t} = (\alpha_{m,t}, d_{m,t})$$

The reward function incorporates the objective function and penalties for constraints violations:

$$r_{m,t} = \Psi – P_{\text{crash}} – P_{\text{cross}}$$

where $P_{\text{crash}}$ and $P_{\text{cross}}$ are penalties for collisions and boundary violations, respectively.

The FRLACN algorithm employs Actor-Critic networks for each agent. The Actor network generates actions based on the current state, while the Critic network evaluates the action’s value. During training, agents collect experiences $(s_{m,t}, a_{m,t}, r_{m,t})$ in a replay buffer $\Phi$. Local training occurs every $T_{\text{LT}}$ time slots, where agents update their networks using sampled experiences. The policy gradient method is applied to optimize the Actor network, and the Critic network is updated via temporal difference learning.

Federated aggregation is performed every $T_{\text{fed}}$ rounds. Agents with rewards exceeding a threshold $r_{\text{th}}$ (e.g., the average reward from the previous round) upload their Actor gradients and rewards to a central server. The server aggregates these gradients using a weighted average:

$$G_{\text{server}} = \frac{1}{\sum_m \Omega^{\text{batch}}_m} \sum_m \left( \frac{r_{m,t}}{\sum_m r_{m,t}} \right) \Omega^{\text{batch}}_m G_m$$

where $\Omega^{\text{batch}}_m$ is the batch size for agent $m$, and $G_m$ is the local gradient. The server then broadcasts the aggregated gradient to all agents. Agents that did not upload gradients undergo a test phase; if their performance improves, they retain their local models; otherwise, they adopt the server’s model.

This federated approach reduces communication overhead and enhances privacy by limiting data exchange. The use of Actor-Critic networks allows for continuous action spaces, which is essential for precise trajectory control in drone technology. The algorithm pseudocode is summarized as follows:

Input: $M$ agents, state space $s_{m,t}$, action space $a_{m,t}$, local training time $T_{\text{LT}}$, federation period $T_{\text{fed}}$, reward threshold $r_{\text{th}}$, initial network parameters $\omega_1$ (Actor) and $\omega_2$ (Critic), replay buffer $\Phi$.
Output: Personalized policies $\pi_m(s)$ and global model updates $G_{\text{server}}$.
For each federation round $t = 1$ to $T_{\text{fed}}$:
- Local training phase: For $x = 1$ to $T_{\text{LT}}$, each agent interacts with the environment, stores experiences in $\Phi$, and updates local networks.
- Federation phase: Agents with $r_{m,t} \geq r_{\text{th}}$ upload gradients and rewards; server computes weighted average and broadcasts updates.
- Validation phase: Agents not uploading gradients test local models; update policies based on performance.
End when convergence is achieved.

The computational complexity of FRLACN is dominated by local training and federation. For each agent, the per-iteration complexity is $O(N_{\text{actor}} + N_{\text{critic}})$, where $N_{\text{actor}}$ and $N_{\text{critic}}$ are the network sizes. With $E$ local iterations, the complexity becomes $O(E(N_{\text{actor}} + N_{\text{critic}}))$. The federation step has complexity $O(M \cdot N_{\text{actor}})$. Thus, the overall complexity is $O(T_{\text{global}} [T_{\text{LT}} E (N_{\text{actor}} + N_{\text{critic}}) + M \cdot N_{\text{actor}}])$, where $T_{\text{global}}$ is the number of federation rounds. The space complexity is $O(\Phi + N_{\text{actor}} + N_{\text{critic}})$ locally and $O(M \cdot N_{\text{actor}})$ on the server.

Simulation Results and Analysis

To evaluate the proposed FRLACN algorithm, simulations were conducted in a Python environment with TensorFlow. The scenario involved a $1000 \, \text{m} \times 1000 \, \text{m}$ area with 4 UAVs flying at $H = 100 \, \text{m}$ and multiple UDs. Key parameters are summarized in Table 1.

Table 1: Simulation Parameters
Parameter	Value
UAV computational resource $f_{m,n}$, $f_{m’,m}$	10 GHz
UD computational resource $f_n$	2 GHz
Task CPU cycles $C$	[1, 2.5] G/cycle
System bandwidth $B$	1 MHz
UD transmission power $\rho_n$	0.2 W
UAV transmission power $\rho_m$	1.0 W
Task data size $D$	100 MB
UD-UAV channel gain $g_{n,m}$	-10 dB
UAV-UAV channel gain $g_{m,m’}$	-40 dB
Carrier frequency $f_c$	2 GHz
UAV maximum speed $v_{\text{max}}$	15 m/s

Table 2 outlines the algorithm-specific parameters used in training.

Table 2: Algorithm Parameters
Parameter	Value
Hidden layer activation function	tanh
Output layer activation function	ReLU
Neurons per layer	32
Batch size	64
Discount factor	0.95
Replay buffer size $\Phi$	10,000
Learning rate	0.0001
Local training time $T_{\text{LT}}$	100 rounds
Federation period $T_{\text{fed}}$	10 $T_{\text{LT}}$

The performance of FRLACN was compared against three baseline algorithms: Centralized DDPG (Centralized), Multi-Agent DDPG (MADDPG), and Federated Averaging (FedAvg). Figure 1 illustrates the average reward versus training rounds for 20 UDs. FRLACN achieves faster convergence and higher stability than the baselines, attributed to its dynamic threshold mechanism and Actor-Critic architecture. The federated aggregation reduces ineffective updates, while the Critic network provides accurate value estimates.

Figure 2 displays the optimized UAV trajectories for 20 UDs, showing how UAVs position themselves to maximize coverage and fairness. The trajectories avoid collisions and ensure that most UDs are within communication range, demonstrating the effectiveness of the proposed approach in real-time decision-making for Unmanned Aerial Vehicle systems.

Data transmission costs are analyzed in Figure 3. As the number of UDs increases, FRLACN maintains lower costs compared to other algorithms. For instance, with 25 UDs, FRLACN reduces costs by 17.21%, 48.47%, and 50.97% relative to FedAvg, MADDPG, and Centralized, respectively. This reduction is due to efficient task partitioning and migration, which minimize redundant transmissions.

Energy consumption is evaluated in Figure 4. FRLACN consistently consumes less energy, with savings of up to 70.39% over FedAvg, 68.21% over MADDPG, and 36.11% over Centralized at 20 UDs. The integration of task migration and local processing optimizes resource usage, highlighting the benefits of cooperative drone technology.

User fairness is plotted against time slots in Figure 5. FRLACN achieves higher fairness indices, improving by 4.49%, 24%, and 38.81% over FedAvg, MADDPG, and Centralized, respectively, at 15 time slots. The Jain index-based fairness model ensures equitable resource distribution, even for UDs with poor channel conditions.

Figure 6 compares cooperative and non-cooperative offloading. With task migration, FRLACN shows superior convergence and performance, as load balancing prevents UAV overloads. This underscores the importance of inter-UAV collaboration in large-scale deployments.

Conclusion and Future Directions

This paper presents a comprehensive framework for optimizing UAV trajectories and task offloading in multi-UAV assisted MEC systems. The proposed FRLACN algorithm leverages federated reinforcement learning to minimize energy consumption and enhance user fairness while addressing data privacy concerns. Simulations confirm its superiority over existing methods in terms of convergence, efficiency, and equity. The use of drone technology in edge computing is poised to revolutionize IoT applications, particularly in challenging environments.

Future research will focus on several avenues. First, adaptive federation strategies will be developed to handle dynamic network conditions, such as varying channel states and mobile UDs. This may involve real-time adjustment of aggregation periods and local training frequencies. Second, multi-objective optimization will be explored using game-theoretic approaches to balance energy, latency, and security. Integrating physical layer security techniques could mitigate privacy risks during task migration. Third, blockchain technology may be incorporated with federated learning to ensure traceability and integrity of model updates. Finally, digital twin models for UAV clusters could enable offline pre-training and online fine-tuning, reducing computational overhead in real-world deployments. These advancements will further solidify the role of Unmanned Aerial Vehicles in next-generation edge computing systems.