Unmanned Drone Trajectory and Resource Optimization via Multi-Agent Reinforcement Learning

The field of wireless communications is undergoing a paradigm shift towards the integration of traditionally disparate functionalities. Unmanned drone platforms, formally known as Unmanned Aerial Vehicles (UAVs), are at the forefront of this evolution. Their inherent advantages—such as high mobility, swift deployment, and the propensity for establishing dominant Line-of-Sight (LoS) links—make them ideal candidates for serving as aerial base stations. This is particularly valuable for providing expansive coverage in remote areas, disaster-stricken zones, or during large-scale public events where terrestrial infrastructure is lacking, damaged, or overloaded.

Building upon this potential, the concept of Integrated Sensing and Communication (ISAC) has emerged as a cornerstone for next-generation networks. ISAC seeks to amalgamate communication and radar sensing capabilities into a unified hardware and spectral framework. This co-design promises significant gains in spectral efficiency, hardware utilization, and reduced system size, weight, and power (SWaP) compared to operating two separate, dedicated systems. An unmanned drone, acting as a mobile ISAC platform, can dynamically sense its environment to locate users or detect obstacles while simultaneously delivering high-quality communication services to those users. This dual functionality is crucial for applications like search and rescue, precision agriculture, and smart city management, where situational awareness and data transfer are required concurrently.

However, a single unmanned drone faces inherent limitations in terms of onboard energy, communication bandwidth, and processing power. To overcome these constraints and scale the network coverage and capacity, deploying a fleet or swarm of multiple unmanned drones becomes necessary. This multi-drone approach introduces a complex, high-dimensional optimization challenge. The system performance—encompassing both communication throughput and sensing accuracy—becomes a function of the intricate interplay between several dynamically coupled variables: the three-dimensional (3D) flight path of each unmanned drone, the association (or scheduling) between drones and ground users (GUs), and the allocation of transmit power among them. Crucially, optimizing for communication performance alone, such as maximizing data rates, could inadvertently degrade the sensing performance for certain users, violating principles of fairness and service guarantee.

Therefore, our core objective is to design a system where a fleet of unmanned drones collaboratively provides ISAC services to a set of ground users. We aim to maximize the overall communication quality of service, specifically the maximum average communication rate among all users, while strictly ensuring that each user receives a minimum level of sensing performance. This leads to a complex, non-convex, mixed-integer optimization problem involving continuous variables (3D positions, power), discrete variables (user association), and stringent constraints (e.g., collision avoidance, power limits, sensing thresholds). Traditional optimization methods, such as convex approximation or alternating algorithms, often struggle with the curse of dimensionality, the non-convex nature of the objective, and the need for real-time adaptability in dynamic environments.

To address these challenges, we turn to the paradigm of Multi-Agent Deep Reinforcement Learning (MARL). By modeling each unmanned drone as an independent intelligent agent, MARL provides a framework for learning collaborative policies through interaction with a simulated environment. We propose a novel algorithm named ISAC-Aware QMIX (IA-QMIX). This algorithm is built upon the QMIX architecture, which follows a Centralized Training with Decentralized Execution (CTDE) strategy, but is significantly enhanced with mechanisms explicitly designed for the ISAC context. Our contributions include a hierarchical state encoder to better model the drone-user topology, a constraint-aware mixing network that guides agents to satisfy sensing and collision constraints, and an adaptive weighting mechanism. Extensive simulation results demonstrate that our proposed IA-QMIX algorithm effectively coordinates the fleet of unmanned drones, outperforming several state-of-the-art MARL baselines and traditional optimization methods in achieving superior communication rates under strict sensing guarantees.

System Model and Problem Formulation

We consider a scenario with a set of M unmanned drones, denoted by $\mathcal{M} = \{1, 2, …, M\}$, providing ISAC services to a set of K ground users (GUs), denoted by $\mathcal{K} = \{1, 2, …, K\}$, within a defined geographic area. The system operates over a total mission time T, which is discretized into $\tau$ time slots of equal duration $\delta_t$, so $T = \tau \cdot \delta_t$. The index for time slots is $t \in \{1, 2, …, \tau\}$.

1. Drone and User Positions: We employ a 3D Cartesian coordinate system. The position of GU $k$ is fixed and given by $\mathbf{g}_k = [x_k, y_k, 0]^T$. The position of unmanned drone $m$ at time slot $t$ is a variable we aim to optimize, denoted as $\mathbf{u}_m(t) = [x_m(t), y_m(t), z_m(t)]^T$. The drone’s altitude is constrained within a permissible range: $H_{min} \le z_m(t) \le H_{max}, \forall m, t$.

2. Mobility and Safety Constraints: Each unmanned drone has a maximum speed $V_{max}$. Its mobility between consecutive time slots is constrained by:
$$ ||\mathbf{u}_m(t+1) – \mathbf{u}_m(t)|| \le V_{max} \cdot \delta_t, \quad \forall m, t $$
To ensure safe operation, any pair of unmanned drones must maintain a minimum separation distance $d_{min}$ to avoid collisions:
$$ ||\mathbf{u}_m(t) – \mathbf{u}_{m’}(t)|| \ge d_{min}, \quad \forall m \neq m’, \forall t $$

3. Channel Model: Given the typical deployment altitude of unmanned drones, the air-to-ground (A2G) channel is often dominated by the LoS link. Therefore, we adopt a simplified yet effective path-loss model. The channel power gain for communication between unmanned drone $m$ and GU $k$ at time $t$ is:
$$ h_{m,k}^{comm}(t) = \frac{\beta_0}{d_{m,k}^2(t)} $$
where $d_{m,k}(t) = ||\mathbf{u}_m(t) – \mathbf{g}_k||$ is the Euclidean distance, and $\beta_0$ is the channel power gain at a reference distance of 1 meter. For sensing, the signal makes a round trip, leading to a different model. The effective channel gain for sensing is modeled as:
$$ h_{m,k}^{sens}(t) = \frac{\eta_0}{d_{m,k}^4(t)} $$
where $\eta_0$ is the sensing reference gain.

4. Time-Division ISAC Framework: Within each time slot $\delta_t$, we employ a Time-Division (TD) scheme to share resources between sensing and communication. A fraction $\alpha (0 < \alpha < 1)$ of the slot is allocated for sensing (e.g., transmitting and processing radar waveforms), and the remaining fraction $(1-\alpha)$ is allocated for data communication. The parameter $\alpha$ can be tuned based on system priorities.

5. User Association and Power Control: Let $a_{m,k}(t) \in \{0, 1\}$ be a binary association variable, where $a_{m,k}(t)=1$ indicates that GU $k$ is served by unmanned drone $m$ for communication in slot $t$. We assume each GU can be connected to at most one unmanned drone per slot, and each unmanned drone can serve at most one GU per slot to avoid intra-cell interference:
$$ \sum_{m \in \mathcal{M}} a_{m,k}(t) \le 1, \quad \forall k, t $$
$$ \sum_{k \in \mathcal{K}} a_{m,k}(t) \le 1, \quad \forall m, t $$
Each unmanned drone $m$ has a transmit power $p_m(t)$, which is limited by a maximum value $P_{max}$:
$$ 0 \le p_m(t) \le P_{max}, \quad \forall m, t $$

6. Communication and Sensing Metrics:
The Signal-to-Interference-plus-Noise Ratio (SINR) for the communication link from unmanned drone $m$ to its associated GU $k$ is:
$$ \gamma_{m,k}^{comm}(t) = \frac{p_m(t) h_{m,k}^{comm}(t)}{\sum_{m’ \neq m} p_{m’}(t) h_{m’,k}^{comm}(t) + \sigma^2} $$
where $\sigma^2$ is the noise power. The achievable communication rate (in bps/Hz) is given by the Shannon capacity:
$$ R_{m,k}^{comm}(t) = (1-\alpha) \log_2\left(1 + \gamma_{m,k}^{comm}(t)\right) $$
Similarly, the SINR for sensing GU $k$ by unmanned drone $m$ is:
$$ \gamma_{m,k}^{sens}(t) = \frac{p_m(t) h_{m,k}^{sens}(t)}{\sum_{m’ \neq m} p_{m’}(t) h_{m’,k}^{sens}(t) + \sigma^2} $$
We use Mutual Information (MI) as a rigorous metric for sensing performance. The sensing MI is:
$$ I_{m,k}^{sens}(t) = \alpha \log_2\left(1 + \gamma_{m,k}^{sens}(t)\right) $$

7. Long-Term Performance and Optimization Problem:
The average communication rate for GU $k$ over the entire mission is:
$$ \bar{R}_k^{comm} = \frac{1}{\tau} \sum_{t=1}^{\tau} \sum_{m \in \mathcal{M}} a_{m,k}(t) R_{m,k}^{comm}(t) $$
The average sensing MI for GU $k$ is:
$$ \bar{I}_k^{sens} = \frac{1}{\tau} \sum_{t=1}^{\tau} \sum_{m \in \mathcal{M}} a_{m,k}(t) I_{m,k}^{sens}(t) $$
To ensure fair and reliable sensing service, we impose a constraint that each GU’s average sensing MI must meet a minimum threshold $I_{th}$:
$$ \bar{I}_k^{sens} \ge I_{th}, \quad \forall k \in \mathcal{K} $$
Our goal is to maximize the minimum average communication rate among all GUs (max-min fairness for communication) subject to the sensing constraints. This leads to the following optimization problem $(P1)$:

$$
\begin{aligned}
(P1): \quad & \max_{\{\mathbf{u}_m(t)\}, \{a_{m,k}(t)\}, \{p_m(t)\}} \min_{k \in \mathcal{K}} \bar{R}_k^{comm} \\
\text{s.t.} \quad & \bar{I}_k^{sens} \ge I_{th}, \quad \forall k \in \mathcal{K} \\
& ||\mathbf{u}_m(t+1) – \mathbf{u}_m(t)|| \le V_{max} \cdot \delta_t, \quad \forall m, t \\
& ||\mathbf{u}_m(t) – \mathbf{u}_{m’}(t)|| \ge d_{min}, \quad \forall m \neq m’, \forall t \\
& H_{min} \le z_m(t) \le H_{max}, \quad \forall m, t \\
& \sum_{m \in \mathcal{M}} a_{m,k}(t) \le 1, \quad \sum_{k \in \mathcal{K}} a_{m,k}(t) \le 1, \quad \forall t \\
& a_{m,k}(t) \in \{0, 1\}, \quad \forall m, k, t \\
& 0 \le p_m(t) \le P_{max}, \quad \forall m, t
\end{aligned}
$$

Problem $(P1)$ is a mixed-integer non-convex optimization problem, which is generally NP-hard and intractable for traditional solvers in real-time dynamic scenarios, especially with a moderate number of unmanned drones. This complexity motivates our data-driven MARL approach.

Proposed IA-QMIX Algorithm

We formulate the cooperative control of the unmanned drone fleet as a Markov Game, where each unmanned drone is an agent. The goal is to learn a policy that maps local observations to actions (movement, association, power selection) to maximize the global cumulative reward, which is aligned with the objective in $(P1)$.

MARL Framework and Agent Design

We adopt the Centralized Training with Decentralized Execution (CTDE) paradigm, exemplified by the QMIX algorithm. During training, a centralized critic has access to global information to guide the learning of decentralized actor policies. During execution, each unmanned drone agent acts based solely on its local observations, ensuring scalability and robustness.

State Space: The global state $s_t$ includes all drones’ positions, velocities, associated user indices, and all GUs’ positions and their historical performance metrics (average rates, average MI).

Observation Space: Each unmanned drone $m$ has a partial observation $o_m^t$, which typically includes its own state, the states of GUs within its communication/sensing range, and the states of nearby drones (to gauge collision risk).

Action Space: The action $a_m^t$ for an unmanned drone $m$ is a composite, discrete action. We discretize the 3D movement into a set of basic motions: move forward/backward/left/right/up/down by a fixed step size $\Delta d$, or hover. The association action is choosing which GU to serve (or none). The power control is discretized into a few levels (e.g., low, medium, high). The combination forms a multi-dimensional discrete action space.

Reward Design: The reward function is critical for shaping the desired behavior. The global reward $r_t$ at time $t$ is designed to encourage the max-min communication rate objective while penalizing constraint violations:
$$ r_t = \omega_1 \cdot \min_{k \in \mathcal{K}} \left( \sum_m a_{m,k}(t) R_{m,k}^{comm}(t) \right) – \omega_2 \cdot \sum_{k} \max(0, I_{th} – I_k^{sens}(t)) – \omega_3 \cdot \sum_{m \neq m’} \mathbb{I}(||\mathbf{u}_m – \mathbf{u}_{m’}|| < d_{min}) $$
where $\omega_1, \omega_2, \omega_3$ are weighting coefficients, and $\mathbb{I}(\cdot)$ is an indicator function. The first term is the instantaneous minimum user rate. The second term penalizes any shortfall in the instantaneous sensing MI compared to the required threshold (averaged over a short window). The third term is a collision penalty.

IA-QMIX Network Architecture

The standard QMIX algorithm uses a mixing network that combines individual agent Q-values into a joint action-value $Q_{tot}$, enforcing a monotonicity constraint to ensure decentralized policy consistency. Our IA-QMIX enhances this structure with ISAC-aware modules.

1. Hierarchical State Encoder: The global state $s_t$ is processed by an encoder to produce a rich feature vector $h_t$. Unlike a simple flattening operation, our encoder has a hierarchical design:

Drone-Level Encoder: An MLP processes the features of all unmanned drones to capture inter-drone spatial relationships and collision risks.
User-Level Encoder: Another MLP processes GU features to understand user distribution and demand.
Interaction Encoder: A cross-attention module models the bipartite graph between the unmanned drone set and the GU set. It computes attention weights that signify the potential value of a drone serving a particular user, dynamically capturing the topology.

The outputs are concatenated to form $h_t$.

2. Constraint-Aware Feasibility Gating: A key innovation is the introduction of constraint prediction heads that take $h_t$ as input. These heads output feasibility scores $c_m^{coll}(t)$ and $c_m^{sens}(t)$ for each unmanned drone $m$, predicting how safe its current state is from collisions and how well its actions contribute to meeting the global sensing constraints, respectively. These scores, in the range [0,1], act as gating mechanisms on the individual drone’s Q-value before mixing:
$$ \tilde{Q}_m(o_m^t, a_m^t) = Q_m(o_m^t, a_m^t) \cdot c_m^{coll}(t) \cdot c_m^{sens}(t) $$
This multiplicative gating softly suppresses the Q-values of actions that are predicted to lead to constraint violations, effectively integrating the hard constraints from $(P1)$ into the learning process in a differentiable manner.

3. Adaptive Monotonic Mixing Network: A hyper-network takes the encoded global state $h_t$ and generates the parameters (weights and biases) for the main mixing network. This allows the mixing function to adapt based on the current global scenario (e.g., user density, interference level). The mixing network then non-linearly combines the gated Q-values $\{\tilde{Q}_m\}$ to produce the final $Q_{tot}$, while strictly enforcing monotonicity via non-negative weights:
$$ Q_{tot}(\mathbf{s}, \mathbf{a}; \theta) = f_{mix}\left( \tilde{Q}_1, \tilde{Q}_2, …, \tilde{Q}_M; \mathbf{W}(h_t), \mathbf{b}(h_t) \right) $$
where $f_{mix}$ is a monotonic function parameterized by weights $\mathbf{W}$ and biases $\mathbf{b}$ generated by the hyper-network from $h_t$.

Training and Execution

The agents explore the environment using an $\epsilon$-greedy policy. Experiences $(s_t, \mathbf{o}_t, \mathbf{a}_t, r_t, s_{t+1}, \mathbf{o}_{t+1})$ are stored in a replay buffer. During training, mini-batches are sampled to update all network parameters by minimizing the Temporal-Difference (TD) loss:
$$ \mathcal{L}(\theta) = \mathbb{E}_{(\mathbf{s}, \mathbf{a}, r, \mathbf{s}’) \sim \mathcal{D}} \left[ \left( r + \gamma \max_{\mathbf{a}’} Q_{tot}(\mathbf{s}’, \mathbf{a}’; \theta^-) – Q_{tot}(\mathbf{s}, \mathbf{a}; \theta) \right)^2 \right] $$
where $\theta$ are the parameters of the online networks, $\theta^-$ are the parameters of the target networks (updated periodically), $\gamma$ is the discount factor, and $\mathcal{D}$ is the replay buffer.

After training, only the local agent networks are deployed on each unmanned drone. During execution, each drone independently selects its action by greedily maximizing its local $Q_m(o_m^t, a_m^t)$, which, due to the monotonicity and constraint-gating learned during training, leads to near-optimal joint actions that satisfy the ISAC constraints. The computational overhead for decision-making is just a single forward pass through a neural network, enabling real-time control.

Simulation Results and Analysis

We conduct simulations in a Python environment to validate the performance of our proposed IA-QMIX algorithm. The unmanned drones operate in a 2km x 2km area with a height limit between 100m and 150m. Key simulation parameters are summarized in the table below.

Table 1: Default Simulation Parameters
Parameter	Value
Number of Unmanned Drones (M)	4
Number of Ground Users (K)	8
Mission Duration (T)	900 s
Time Slot Duration ($\delta_t$)	10 s
Max Drone Speed ($V_{max}$)	30 m/s
Min Safe Distance ($d_{min}$)	15 m
Drone Altitude Range	[100m, 150m]
Max Transmit Power ($P_{max}$)	0.5 W
Noise Power ($\sigma^2$)	-169 dBm
Comm. Ref. Gain ($\beta_0$)	-111 dB
Sensing Ref. Gain ($\eta_0$)	-131 dB
Sensing Threshold ($I_{th}$)	0.05 bps/Hz
ISAC Time Split ($\alpha$)	0.35
Communication Bandwidth	1 MHz

We compare IA-QMIX against several baselines: 1) Independent Q-Learning (IQL): Each unmanned drone learns its policy independently without any coordination. 2) Standard QMIX: Uses the original mixing network without ISAC-aware modules. 3) Multi-Agent PPO (MAPPO): A state-of-the-art policy gradient method. 4) Greedy Heuristic: A rule-based method where drones always move towards and serve the nearest user with maximum power. 5) SCA-based Optimization (SCA): A traditional iterative optimization method using Successive Convex Approximation.

1. Convergence Performance: The figure below shows the learning curves of the max-min average communication rate during training. IA-QMIX converges to the highest stable performance (approximately 0.816 bps), significantly outperforming the baselines. Standard QMIX performs second best but plateaus at a lower level, demonstrating the value of our ISAC-aware enhancements. IQL performs poorly due to the lack of coordination in this highly coupled multi-unmanned-drone environment.

2. Scalability with Number of Users and Drones: We tested the algorithms under varying scales. When increasing the number of GUs (K) with a fixed number of unmanned drones (M=4), the performance of all algorithms initially dips as the scheduling becomes more complex but then generally improves due to multi-user diversity. IA-QMIX consistently maintains the highest rate across all scales. When increasing the number of unmanned drones (M) with fixed GUs (K=8), the max-min rate improves as more resources become available. IA-QMIX effectively manages the increased complexity of multi-agent coordination, translating the added drones into tangible performance gains, whereas other algorithms like IQL show diminishing returns.

3. Impact of Transmit Power and Time Allocation: We also analyze the impact of key system parameters. The performance of all schemes improves with increased maximum transmit power $P_{max}$ but eventually saturates due to rising inter-drone interference. IA-QMIX achieves the best trade-off. Furthermore, varying the ISAC time-split parameter $\alpha$ reveals a clear trade-off: a small $\alpha$ favors communication but harms sensing, potentially violating the $I_{th}$ constraint; a large $\alpha$ does the opposite. IA-QMIX, with its constraint-aware learning, robustly meets the sensing requirement across different $\alpha$ while maximizing the communication rate, finding an effective balance around $\alpha=0.35$ for our default reward weights.

4. Ablation Study on ISAC-Aware Modules: To isolate the contribution of the constraint-aware gating and hierarchical encoder, we create ablated versions of MAPPO and IQL by adding similar modules (IA-MAPPO, IA-IQL). The results, summarized in the table below, show a consistent performance boost compared to their vanilla counterparts, confirming the general efficacy of explicitly modeling ISAC constraints within the MARL framework.

Table 2: Performance Comparison of Different Algorithms
Algorithm	Max-Min Avg. Comm. Rate (bps)	Sensing Constraint Satisfied?
Greedy Heuristic	0.521	No
SCA Optimization	0.683	Yes
IQL	0.451	Marginally
IA-IQL	0.535 (+18.6%)	Yes
MAPPO	0.603	Yes
IA-MAPPO	0.648 (+7.5%)	Yes
Standard QMIX	0.742	Yes
Proposed IA-QMIX	0.816	Yes

Conclusion

In this work, we investigated the critical problem of joint trajectory design and resource allocation for a fleet of unmanned drones providing integrated sensing and communication services. We formulated a comprehensive optimization problem aimed at maximizing the minimum average communication rate among ground users while guaranteeing a minimum level of sensing performance for each user. To tackle this complex, non-convex problem with mixed-integer variables and multiple constraints, we proposed a novel multi-agent deep reinforcement learning algorithm named IA-QMIX.

Our algorithm innovates upon the standard QMIX framework by incorporating a hierarchical state encoder to better understand the drone-user topology and, more importantly, a constraint-aware feasibility gating mechanism that seamlessly integrates sensing and collision-avoidance requirements into the value function learning process. This allows the unmanned drone agents to learn cooperative policies that inherently respect the system’s physical constraints.

Simulation results demonstrate that the proposed IA-QMIX algorithm significantly outperforms several strong baseline methods, including independent learning, vanilla QMIX, policy gradient methods, and traditional optimization heuristics. It effectively coordinates the 3D movement, user association, and power control of multiple unmanned drones, achieving superior communication fairness under strict sensing quality guarantees. The work confirms the great potential of advanced MARL techniques in managing the complexity of future autonomous, multi-functional unmanned drone networks for ISAC applications.