Energy-Efficient Multi-UAV ISAC via Enhanced MATD3 for Joint Trajectory and Beamforming Optimization

The convergence of wireless communication and sensing functionalities into a unified framework, known as Integrated Sensing and Communication (ISAC), is pivotal for realizing the vision of 6G networks. ISAC systems offer superior spectral and hardware efficiency by sharing resources between these two core tasks. The integration of unmanned drones, or Unmanned Aerial Vehicles (UAVs), into ISAC architectures unlocks significant potential for low-altitude applications. Their high mobility, rapid deployment capability, and propensity for Line-of-Sight (LoS) links make them ideal aerial platforms for dynamic service provisioning.

However, orchestrating a multi-unmanned drone ISAC system poses substantial challenges. The system must manage limited onboard resources (power, service capacity) while adapting to dynamic environments, such as user mobility. A key, often under-addressed, metric in such systems is energy efficiency (EE), which is crucial for the prolonged operation of battery-constrained unmanned drones. Maximizing EE requires the intricate joint optimization of multiple continuous and discrete variables, including the flight trajectories of each unmanned drone, the beamforming vectors for communication and sensing, and the association between unmanned drones and ground users. This problem is inherently a Mixed-Integer Non-Linear Programming (MINLP) problem with a fractional objective, making it NP-hard and intractable for conventional convex optimization methods.

To tackle this, we propose a novel two-step, data-driven optimization framework. First, we design a capacity-aware K-means clustering algorithm for dynamic real-time association between unmanned drones and users. Second, we develop an enhanced Multi-Agent Twin Delayed Deep Deterministic Policy Gradient (MATD3) reinforcement learning algorithm. A key innovation is the decoupling of the high-dimensional beamforming action space into separate, lower-dimensional power and phase components. This, combined with a composite reward function and a Centralized Training with Decentralized Execution (CTDE) framework, enables stable and efficient learning. Our approach jointly optimizes unmanned drone trajectories and beamforming to maximize the overall system EE under stringent communication and sensing constraints.

System Model and Problem Formulation

System Overview

We consider a multi-unmanned drone ISAC system comprising $N$ UAVs, $K_c$ communication users (CUs), and $K_s$ sensing users (SUs). Each unmanned drone is equipped with an $M$-antenna uniform linear array, while all ground users have a single omnidirectional antenna. The unmanned drones fly at a fixed altitude $H$. The system operates in a square service area over discrete time slots $t \in \{1, …, T\}$.

Unmanned Drone Position: $ \mathbf{u}_n(t) = (x_n(t), y_n(t), H), \forall n \in \mathcal{N}=\{1,…,N\}$.
Communication User Position: $ \mathbf{q}^c_k(t) = (x^c_k(t), y^c_k(t), 0), \forall k \in \mathcal{K}_c=\{1,…,K_c\}$.
Sensing User Position: $ \mathbf{q}^s_k(t) = (x^s_k(t), y^s_k(t), 0), \forall k \in \mathcal{K}_s=\{1,…,K_s\}$.

Ground users move according to a random walk model, with maximum speed $V_{max}$.

Communication and Sensing Models

The channel between an unmanned drone and a user is dominated by the LoS link. The channel gain from unmanned drone $n$ to CU $k$ is:

$$
h^c_{n,k}(t) = \sqrt{\alpha} d^{-1}_{n,k}(t) \mathbf{a}(\mathbf{u}_n(t), \mathbf{q}^c_k(t))
$$

where $\alpha$ is the reference channel gain at 1m, $d_{n,k}(t)=||\mathbf{u}_n(t) – \mathbf{q}^c_k(t)||$ is the distance, and $\mathbf{a}(\cdot)$ is the array response vector.

The signal transmitted by unmanned drone $n$ is $\mathbf{x}_n(t) = \sum_{k=1}^{K_c} \delta_{n,k}(t) \mathbf{w}^c_{n,k}(t) s^c_{n,k}(t)$, where $\delta_{n,k}(t) \in \{0,1\}$ is the association variable, $\mathbf{w}^c_{n,k}(t) \in \mathbb{C}^{M \times 1}$ is the beamforming vector, and $s^c_{n,k}(t)$ is the data symbol. The received Signal-to-Interference-plus-Noise Ratio (SINR) at CU $k$ is:

$$
\gamma^c_k(t) = \frac{|\sum_{n=1}^N \delta_{n,k}(t) (h^c_{n,k}(t))^H \mathbf{w}^c_{n,k}(t)|^2}{\sum_{n=1}^N \sum_{l \neq k}^{K_c} \delta_{n,l}(t) |(h^c_{n,k}(t))^H \mathbf{w}^c_{n,l}(t)|^2 + \sigma^2}
$$

The achievable data rate is $R^c_k(t) = B \log_2(1+\gamma^c_k(t))$, where $B$ is the bandwidth. A minimum rate constraint $R_{min}$ is enforced.

For sensing, the channel gain to SU $k$ is modeled as $h^s_{n,k}(t) = \sqrt{\beta} d^{-2}_{n,k}(t) \mathbf{a}(\mathbf{u}_n(t), \mathbf{q}^s_k(t))$, where $\beta$ incorporates the radar cross-section. The beamforming gain towards SU $k$, which reflects sensing performance, is:

$$
\Gamma^s_k(t) = \left| \sum_{n=1}^N \delta_{n,k}(t) (h^s_{n,k}(t))^H \mathbf{w}^c_{n}(t) \right|^2
$$

where $\mathbf{w}^c_n(t)=\sum_{k=1}^{K_c} \mathbf{w}^c_{n,k}(t)$. A minimum beamforming gain constraint $\Gamma_{min}$ is enforced, scaled by the distance.

Unmanned Drone Mobility and Energy Consumption Model

Each unmanned drone’s movement is constrained by maximum speed $v_{max}$, a minimum safe distance $d_{min}$ between unmanned drones, and service area boundaries $[x_l, x_u], [y_l, y_u]$.

The propulsion energy consumption $E^f_n(t)$ for a rotary-wing unmanned drone in time slot $\tau$ is a function of its velocity $v_n(t)$:

$$
E^f_n(t) = \tau \left[ P_0 \left(1 + \frac{3||v_n(t)||^2}{U_{tip}^2} \right) + P_1 \left( \sqrt{1+\frac{||v_n(t)||^4}{4v_0^4}} – \frac{||v_n(t)||^2}{2v_0^2} \right)^{1/2} + \frac{1}{2} d_0 \rho s A ||v_n(t)||^3 \right]
$$

where $P_0, P_1, U_{tip}, v_0, d_0, \rho, s, A$ are parameters related to the unmanned drone’s aerodynamics.

Energy Efficiency and Optimization Problem

The system EE at time $t$ is defined as the ratio of the sum rate of all CUs to the total energy consumed by all unmanned drones:

$$
\eta_{EE}(t) = \frac{ \sum_{k=1}^{K_c} R^c_k(t) }{ \sum_{n=1}^{N} E^f_n(t) }
$$

Our goal is to maximize the average EE over $T$ slots by jointly optimizing unmanned drone trajectories, beamforming vectors, and user associations. The problem is formulated as:

$$
\max_{\substack{\mathbf{U}(t), \boldsymbol{\Delta}(t), \\ \boldsymbol{\Omega}(t), \mathbf{V}(t)}} \frac{1}{T} \sum_{t=1}^{T} \eta_{EE}(t)
$$

Subject to:
$$
\begin{aligned}
&\text{(C1): } \sum_{n=1}^{N} \delta_{n,k}(t) = 1, \forall k \in \mathcal{K} \\
&\text{(C2): } R^c_k(t) \geq R_{min}, \forall k \in \mathcal{K}_c \\
&\text{(C3): } \Gamma^s_k(t) \cdot d^2_{n,k}(t) \geq \Gamma_{min}, \forall k \in \mathcal{K}_s \\
&\text{(C4): } \sum_{k=1}^{K_c} ||\mathbf{w}^c_{n,k}(t)||^2 \leq P_{max}, \forall n \\
&\text{(C5): } ||\mathbf{u}_m(t) – \mathbf{u}_n(t)|| \geq d_{min}, \forall m \neq n \\
&\text{(C6): } x_l \leq x_n(t) \leq x_u, y_l \leq y_n(t) \leq y_u \\
&\text{(C7): } 0 \leq ||v_n(t)|| \leq v_{max}
\end{aligned}
$$

Here, $\mathbf{U}(t)=\{\mathbf{u}_n(t)\}$, $\boldsymbol{\Delta}(t)=\{\delta_{n,k}(t)\}$, $\boldsymbol{\Omega}(t)=\{\mathbf{w}^c_{n,k}(t)\}$, and $\mathbf{V}(t)=\{v_n(t)\}$.

Proposed Two-Step Solution Methodology

To solve this complex MINLP problem, we decompose it into two sequential steps: dynamic user association followed by joint trajectory and beamforming optimization.

Step 1: Capacity-Aware K-means for Dynamic User Association

Traditional K-means clustering associates users to the nearest unmanned drone based solely on distance. This can lead to overloaded unmanned drones exceeding their service capacity. We propose an enhanced algorithm that incorporates a service capacity constraint for each unmanned drone.

First, we define a unified service demand $d_k$ for each user $k$:
$$
d_k = \begin{cases}
R^{min}_k / \lambda, & \text{if } k \in \mathcal{K}_c \\
\Gamma^{min}_k, & \text{if } k \in \mathcal{K}_s
\end{cases}
$$
where $\lambda$ is a normalization factor. Each unmanned drone $n$ has a maximum service capacity $C^{max}_n$. Its current load is $C_n = \sum_{k \in \mathcal{K}_n} d_k$, where $\mathcal{K}_n$ is its set of associated users.

The algorithm proceeds as follows:

Initial Clustering: Associate each user to the nearest unmanned drone (centroid).
Capacity Check & Reassignment: For any unmanned drone $n$ where $C_n > C^{max}_n$:
- Sort its associated users in descending order of their distance to $n$.
- Iteratively select the farthest user. Find the second-nearest unmanned drone $n’$ for which $C_{n’} + d_k \leq C^{max}_{n’}$.
- Reassociate the user to $n’$ and update $C_n$ and $C_{n’}$.
Centroid Update: Update each unmanned drone’s centroid as the mean position of its newly associated users.
Repeat steps 2-3 until convergence or a maximum number of iterations.

This process is executed periodically to adapt to user mobility, ensuring balanced and feasible task allocation among the unmanned drones.

Step 2: Enhanced MATD3 for Trajectory and Beamforming Optimization

With fixed user associations from Step 1, we formulate the remaining continuous optimization as a Markov Decision Process (MDP) and solve it using an enhanced MATD3 algorithm. Each unmanned drone is an independent agent.

State, Action, and Reward Design

State $s_n(t)$ for agent $n$: Its local observation includes its own 2D position, and the channel gains to all CUs and SUs within a relevant range. The global state $S(t)$ is the aggregation of all agents’ observations.

Action $a_n(t)$ for agent $n$: This is a key innovation. Instead of directly outputting the high-dimensional complex beamforming vector $\mathbf{w}^c_{n,k}$, we decompose the action into manageable components:
$$
a_n(t) = [ \theta_n(t), v_n(t), \boldsymbol{\phi}_n(t), \mathbf{p}_n(t) ]
$$

$\theta_n(t) \in [0, 2\pi)$: Flying heading direction.
$v_n(t) \in [0, v_{max}]$: Flying speed.
$\boldsymbol{\phi}_n(t) \in [0, 2\pi)^{M}$: Phase shifts for the $M$ antenna elements.
$\mathbf{p}_n(t) \in [0, P_{max}]^{K_c}$: Power allocation factors for its associated CUs.

The actual beamforming vector for CU $k$ served by unmanned drone $n$ is then reconstructed as $\mathbf{w}^c_{n,k}(t) = \sqrt{p_{n,k}(t)} \cdot e^{j \boldsymbol{\phi}_n(t)}$. This “phase-power” decoupling drastically reduces the action space from $\mathbb{C}^{M \times K_c}$ to $\mathbb{R}^{M + K_c + 2}$.

Reward $r_n(t)$: We design a composite reward to maximize EE while satisfying constraints.
$$
r_n(t) = \alpha_1 \cdot \tilde{\eta}_{EE}(t) – \alpha_2 \cdot \rho_c(t) – \alpha_3 \cdot \rho_s(t) – \alpha_4 \cdot \rho_f(t)
$$

$\tilde{\eta}_{EE}(t)$: Normalized instantaneous system EE.
$\rho_c(t)$: Penalty for violating communication rate constraints (C2).
$\rho_s(t)$: Penalty for violating sensing gain constraints (C3).
$\rho_f(t)$: Penalty for unmanned drone collision or flying out of bounds (C5, C6).

The coefficients $\alpha_i$ balance these objectives. All agents share the same global reward derived from system EE to encourage cooperation.

Enhanced MATD3 Architecture and Training

We adopt the CTDE paradigm. Each unmanned drone agent has:

A decentralized Actor network ($\mu_n$) that takes the local state $o_n(t)$ and outputs the deterministic action $a_n(t)$.
Two centralized Critic networks ($Q^1_n, Q^2_n$) used only during training. Each takes the global state $S(t)$ and the joint action of all agents $\mathbf{A}(t)$ to estimate the Q-value.

Key Enhancements over standard MATD3/DDPG:

Double Critic & Clipped Target: To mitigate overestimation bias common with fractional EE objectives, we use two critics and take the minimum of their outputs for the target Q-value:
$$
y = r + \gamma \min_{i=1,2} Q^{i}_{n, \text{target}}(S’, a’_1, …, a’_N)
$$
where $a’_n = \mu_{n, \text{target}}(o’_n) + \epsilon$, $\epsilon$ is clipped policy noise.
Delayed Policy Updates: The Actor networks are updated less frequently than the Critic networks (e.g., once every two Critic updates) to enhance stability.
Target Network Soft Updates: All target networks ($\mu_{\text{target}}$, $Q_{\text{target}}$) are updated via soft replacement: $\theta_{\text{target}} \leftarrow \tau \theta + (1-\tau)\theta_{\text{target}}$ with $\tau \ll 1$.

The training procedure is summarized below:

Algorithm: Enhanced MATD3 for Multi-Unmanned Drone ISAC
1: Initialize: Actor networks $\mu_n$, Critics $Q^1_n, Q^2_n$, and their target networks with random parameters. Initialize replay buffer $\mathcal{D}$. 2: for episode = 1 to $E_{max}$ do 3: Reset environment, get initial state $S(1)$. 4: for $t = 1$ to $T$ do 5: Step 1: Run Capacity-Aware K-means to get association $\boldsymbol{\Delta}(t)$. 6: For each agent $n$, get local observation $o_n(t)$ from $S(t)$. 7: Select action $a_n(t) = \mu_n(o_n(t)) + \mathcal{N}_t$ (exploration noise). 8: Reconstruct beamforming vectors: $\mathbf{w}^c_{n,k}(t) = \sqrt{p_{n,k}(t)} \cdot e^{j \boldsymbol{\phi}_n(t)}$. 9: Execute joint action $\mathbf{A}(t)$, observe reward $r(t)$ and next state $S'(t)$. 10: Store transition $(S(t), \mathbf{A}(t), r(t), S'(t))$ in $\mathcal{D}$. 11: for agent $n = 1$ to $N$ do 12: Sample random mini-batch $B$ from $\mathcal{D}$. 13: Update critics $Q^1_n, Q^2_n$ by minimizing loss: $\mathcal{L} = \mathbb{E}_{B}[(y – Q_n(S, \mathbf{A}))^2]$. 14: if $t$ mod $d$ == 0 then // Delayed update 15: Update actor $\mu_n$ using the sampled policy gradient: $\nabla_{\theta^{\mu_n}} J \approx \mathbb{E}_{B}[\nabla_{a_n} Q^1_n(S, \mathbf{A})\|_{a_n=\mu_n(o_n)} \nabla_{\theta^{\mu_n}} \mu_n(o_n)]$. 16: Soft update target networks. 17: end if 18: end for 19: end for 20: end for

Algorithm: Enhanced MATD3 for Multi-Unmanned Drone ISAC

1: Initialize: Actor networks $\mu_n$, Critics $Q^1_n, Q^2_n$, and their target networks with random parameters. Initialize replay buffer $\mathcal{D}$.
2: for episode = 1 to $E_{max}$ do
3:    Reset environment, get initial state $S(1)$.
4:    for $t = 1$ to $T$ do
5:       Step 1: Run Capacity-Aware K-means to get association $\boldsymbol{\Delta}(t)$.
6:       For each agent $n$, get local observation $o_n(t)$ from $S(t)$.
7:       Select action $a_n(t) = \mu_n(o_n(t)) + \mathcal{N}_t$ (exploration noise).
8:       Reconstruct beamforming vectors: $\mathbf{w}^c_{n,k}(t) = \sqrt{p_{n,k}(t)} \cdot e^{j \boldsymbol{\phi}_n(t)}$.
9:       Execute joint action $\mathbf{A}(t)$, observe reward $r(t)$ and next state $S'(t)$.
10:       Store transition $(S(t), \mathbf{A}(t), r(t), S'(t))$ in $\mathcal{D}$.
11:       for agent $n = 1$ to $N$ do
12:          Sample random mini-batch $B$ from $\mathcal{D}$.
13:          Update critics $Q^1_n, Q^2_n$ by minimizing loss: $\mathcal{L} = \mathbb{E}_{B}[(y – Q_n(S, \mathbf{A}))^2]$.
14:          if $t$ mod $d$ == 0 then // Delayed update
15:             Update actor $\mu_n$ using the sampled policy gradient:
            $\nabla_{\theta^{\mu_n}} J \approx \mathbb{E}_{B}[\nabla_{a_n} Q^1_n(S, \mathbf{A})|_{a_n=\mu_n(o_n)} \nabla_{\theta^{\mu_n}} \mu_n(o_n)]$.
16:             Soft update target networks.
17:          end if
18:       end for
19:    end for
20: end for

Simulation Results and Performance Analysis

We conduct simulations in a $500m \times 500m$ area to evaluate the proposed framework. Key parameters are listed below.

Table 1: Simulation Parameters
Parameter	Value
Time slot length $\tau$	0.1 s
Number of unmanned drones $N$	3
Number of CUs $K_c$ / SUs $K_s$	6 / 3
Unmanned drone altitude $H$	150 m
Max unmanned drone speed $v_{max}$	5 m/s
Max user speed $V_{max}$	0.5 m/s
Number of antennas $M$	4
Max transmit power $P_{max}$	20 dBm
Bandwidth $B$	15 MHz
Noise power $\sigma^2$	-110 dBm
Min communication rate $R_{min}$	5 Mbps
Min sensing gain $\Gamma_{min}$	-60 dBm

Convergence and Cumulative Reward

We compare the learning performance of our Enhanced MATD3 against several benchmarks: Standard MATD3 (without action decoupling), Multi-Agent DDPG (MADDPG), and a static Ground ISAC baseline with fixed terrestrial base stations. The cumulative reward over training episodes is the primary metric.

The results clearly demonstrate the superiority of our approach. The Enhanced MATD3 algorithm converges faster and to a significantly higher final reward level compared to the standard MATD3 and MADDPG. The initial oscillations are smaller, indicating more stable learning. This improvement is directly attributable to the reduced and structured action space (phase-power decoupling), which makes the policy easier to learn. The Ground ISAC baseline performs poorly due to fixed node positions and higher path loss, lacking the optimization capability for EE.

Impact of Dynamic User Association

To highlight the necessity of our capacity-aware association, we plot the instantaneous reward over one episode with and without periodic re-association. As users move, the reward for a fixed association scheme continuously decays after about 150 slots because the initial assignment becomes suboptimal, violating constraints and triggering penalties. In contrast, our proposed method periodically triggers re-clustering, which causes a temporary dip (during reassignment) followed by a recovery to a high reward state. This demonstrates the algorithm’s ability to adapt and maintain high performance in dynamic environments.

Performance Versus System Scale and Dynamics

We analyze the performance under varying system conditions.

Communication Sum Rate vs. Number of Unmanned Drones: As the number of unmanned drones increases from 3 to 6, the sum rate increases for all schemes due to greater spatial diversity and multiplexing gains. Our Enhanced MATD3 consistently achieves the highest sum rate, benefiting from effective multi-agent coordination and optimized beamforming. The ground baseline shows the poorest performance.

Energy Efficiency vs. User Mobility: We examine the normalized EE as the maximum user speed $V_{max}$ increases. EE degrades for all algorithms with higher mobility because channel state information becomes outdated faster, reducing beamforming accuracy. However, our algorithm exhibits the strongest robustness, maintaining the highest EE across all speeds. This is due to the combined effect of dynamic association and the reinforcement learning agent’s ability to continuously adapt trajectories.

Energy Efficiency vs. Number of Unmanned Drones: Finally, we evaluate how system EE scales with the number of unmanned drones. EE increases with more unmanned drones because the total data rate grows faster than the added propulsion energy cost, and the agents can better cover the area and manage interference. Our Enhanced MATD3 algorithm consistently outperforms the others, effectively balancing the trade-off between achieving high rates and minimizing the energy expenditure of the unmanned drone fleet.

Table 2: Key Performance Comparison Summary
Metric / Scenario	Enhanced MATD3 (Ours)	Standard MATD3	MADDPG	Ground ISAC
Cumulative Reward	Highest	High	Medium	Low
Sum Rate (N=6)	~38 Mbps	~33 Mbps	~30 Mbps	~22 Mbps
EE Robustness (High Mobility)	Best	Good	Fair	Poor
Convergence Stability	Most Stable	Stable	Less Stable	N/A

Conclusion

In this work, we addressed the critical challenge of maximizing energy efficiency in a multi-unmanned drone ISAC system with mobile users. We formulated a joint optimization problem for unmanned drone trajectory, beamforming, and user association. To solve this complex problem, we proposed a novel two-step framework. First, a capacity-aware adaptive K-means algorithm dynamically associates users to unmanned drones, ensuring load balance. Second, an enhanced MATD3 reinforcement learning algorithm performs the core optimization. A major contribution is the decoupling of the beamforming action into separate power and phase components, which drastically reduces the learning complexity. The algorithm employs a composite reward function and a CTDE architecture to effectively maximize the fractional EE objective under realistic constraints.

Simulation results confirm that our proposed framework significantly outperforms benchmark approaches in terms of cumulative reward, achievable sum rate, and energy efficiency, especially in dynamic scenarios with user mobility. The unmanned drones successfully learn cooperative policies to optimize their flight paths and communication resources. Future work may consider more complex channel models, partial observability, and the integration of recharging strategies for truly persistent unmanned drone operation.