Intelligent Route Planning for China UAV Drones in Urban Patrols

The rapid advancement of China UAV drone technology has propelled its integration into a vast array of military and civilian applications, driving significant progress across multiple sectors. While single-platform operations are common, many scenarios, particularly in complex urban environments, demand the coordinated effort of multi-UAV systems. Applications such as intelligence, surveillance, reconnaissance (ISR), and large-scale environmental monitoring benefit immensely from the formation of collaborative aerial networks where China UAV drones share data and distribute tasks to achieve comprehensive area coverage and real-time situational awareness.

However, the mission effectiveness of these China UAV drone swarms is critically dependent on intelligent route planning. This task is highly challenging due to the dynamic and complex nature of urban airspace, compounded by stringent requirements for communication link quality and real-time responsiveness. The flight environment for a China UAV drone fleet is constrained by multiple factors: 1) Safety: No-fly zones and tall infrastructure create hard spatial constraints and collision risks; 2) Energy: On-board battery capacity limits operational endurance, necessitating timely return-to-home maneuvers; 3) Communication: The low-altitude urban radio environment is volatile, with frequent Line-of-Sight (LOS) to Non-Line-of-Sight (NLOS) transitions caused by building blockages, severely impacting data transmission stability between the China UAV drone and ground base stations.

Consequently, the core challenge for a China UAV drone system’s route planning lies in achieving a holistic optimization that balances communication performance, task efficiency, and flight safety under these specific network conditions. This research delves into this multi-objective constrained route planning problem. We leverage Deep Reinforcement Learning (DRL), specifically a Double Deep Q-Network (DDQN) architecture, to enable a fleet of China UAV drones to autonomously learn optimal patrol trajectories. Our approach involves discretizing the operational airspace with known communication quality distribution, constructing comprehensive spatial, energy, and communication models, and designing a multi-dimensional reward function. Through this framework, we demonstrate that a trained China UAV drone swarm can effectively perform urban patrols by maximizing data throughput while adhering to all safety and energy constraints.

1. Problem Statement and System Modeling

We consider a canonical urban patrol mission executed by a cooperative fleet of China UAV drones. Each China UAV drone is equipped with multi-modal sensors (e.g., high-resolution optical and thermal cameras) for dynamic monitoring of city streets, functional zones, and points of interest. The mission begins from a common launch/recovery site. The primary objective is to plan flight paths that maximize the total amount of surveillance data successfully transmitted to ground infrastructure while ensuring safe navigation and efficient energy use for every China UAV drone in the fleet. To formally define this optimization problem, we establish the following mathematical models.

1.1 Spatial and Mobility Model

The operational airspace is discretized into an $$M \times M$$ grid, with each cell having a side length of $$w$$. Key spatial elements are defined as sets of coordinates:

Launch/Recovery Area (A): $$A = \{ [x^a_i, y^a_i]^T, i = 1, 2, \ldots, k_1 \}$$
No-Fly Zones & Tall Buildings (B): $$B = \{ [x^b_i, y^b_i]^T, i = 1, 2, \ldots, k_2 \}$$
Standard Buildings (S): $$S = \{ [x^s_i, y^s_i]^T, i = 1, 2, \ldots, k_3 \}$$

The position of the $$i$$-th China UAV drone at time step $$t$$ is given by:
$$P_i(t) = (x_i(t), y_i(t), h_i(t)), \quad t \in \{1, 2, \ldots, T\}$$
where $$T$$ is the total mission duration and $$h$$ is the flight altitude.

1.2 Energy Consumption Model

The remaining battery level of the $$i$$-th China UAV drone is denoted as $$b_i(t)$$. For simplification and to align with practical operational limits, we model the battery level as equivalent to remaining flight time. The China UAV drone must return to the recovery area before its battery depletes to zero. The energy depletion is modeled as a unit decrement per movement time step $$t$$ when the China UAV drone is in flight. A significant penalty is imposed if a China UAV drone exhausts its battery before landing, incentivizing prudent energy-aware path planning.

1.3 Communication Model

The China UAV drone fleet communicates with $$K$$ ground base stations (BSs) using Orthogonal Frequency-Division Multiplexing (OFDM) with a Time Division Multiple Access (TDMA) scheme. The communication timeline is divided into fine-grained communication slots indexed by $$n$$. The maximum achievable transmission rate between the $$i$$-th China UAV drone and the $$k$$-th BS during slot $$n$$ is modeled as:
$$R^{max}_{i,k}(n) = N_{RE} \cdot BpR \cdot C_r \cdot T_{mn} \cdot M_n \cdot F_n$$
where $$N_{RE} = 13,200$$ is the number of available Resource Elements (REs) per subframe for data transmission after control channel overhead, $$BpR$$ is the modulation order (e.g., 2 for QPSK, 4 for 16QAM), $$C_r$$ is the channel coding rate, $$T_{mn}$$ is the number of slots, $$M_n$$ is the number of MIMO data streams, and $$F_n$$ is the number of half-frames per second.

To avoid interference and scheduling conflicts, each China UAV drone is associated with at most one BS per communication slot $$n$$. This is enforced by the binary association indicator $$q_{i,k}(n) \in \{0,1\}$$ and the constraint:
$$\sum_{k=1}^{K} q_{i,k}(n) \leq 1, \quad \forall n, \forall i.$$
The effective data throughput for the $$i$$-th China UAV drone over the movement step $$t$$ (which spans multiple communication slots) is:
$$C_i(t) = \phi_i(t) \sum_{n=\lambda t}^{\lambda(t+1)-1} \sum_{k=1}^{K} q_{i,k}(n) R_{i,k}(n) \delta_n$$
where $$\phi_i(t)$$ is a state-dependent scaling factor, $$\lambda$$ maps movement steps to communication slots, and $$\delta_n$$ is the duration of a communication slot. The association $$q_{i,k}(n)$$ is typically chosen to connect the China UAV drone to the BS offering the highest Signal-to-Noise Ratio (SNR) at that instant.

The overarching communication optimization goal for the China UAV drone swarm is to maximize the total data throughput over the mission period by optimizing their collective flight trajectories and BS associations.

Table 1: Key Parameters for System Modeling
Parameter Category	Symbol	Description
Spatial	$$M, w$$	Grid size and cell width
	$$A, B, S$$	Sets for recovery area, obstacles, and buildings
	$$P_i(t)$$	3D position of i-th UAV at time t
Energy	$$b_i(t)$$	Remaining battery/operation time of i-th UAV
Communication	$$R^{max}_{i,k}(n)$$	Max transmission rate between UAV i and BS k
	$$q_{i,k}(n)$$	Binary association variable
	$$C_i(t)$$	Data throughput of UAV i over movement step t
	$$K$$	Number of ground base stations

2. From Q-Learning to Double Deep Q-Networks

Route planning is inherently a sequential decision-making problem under uncertainty, making Reinforcement Learning (RL) a suitable paradigm. The core objective for a China UAV drone is to learn a policy—a mapping from states (its perception of the environment) to actions (flight maneuvers)—that maximizes cumulative future reward.

2.1 Foundation: Q-Learning and the Curse of Dimensionality

Classical Q-Learning algorithms learn an action-value function $$Q(s, a)$$, which estimates the expected return of taking action $$a$$ in state $$s$$ and thereafter following the optimal policy. The update rule is:
$$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_t + \gamma \max_{a’} Q(s_{t+1}, a’) – Q(s_t, a_t)]$$
where $$\alpha$$ is the learning rate and $$\gamma$$ is the discount factor. While effective for small, discrete state spaces, Q-Learning relies on a Q-table. For a China UAV drone operating in a high-dimensional urban grid with continuous communication metrics, the state space becomes enormous, leading to the “curse of dimensionality” where the Q-table is infeasibly large to store and learn.

2.2 Deep Q-Networks (DQN): Function Approximation

Deep Q-Networks overcome this limitation by using a deep neural network as a function approximator for the Q-value, parameterized by weights $$\theta$$: $$Q(s, a; \theta)$$. This allows the China UAV drone to generalize across similar states. DQN introduced two key innovations for stable training:

Experience Replay: Transitions $$(s_t, a_t, r_t, s_{t+1})$$ are stored in a replay buffer and sampled randomly during training. This breaks temporal correlations in the data and improves sample efficiency for the China UAV drone learner.
Target Network: A separate target network $$Q(s, a; \theta^-)$$ is used to generate the Q-targets $$y$$ for the loss function. The parameters $$\theta^-$$ are periodically copied from the online network $$\theta$$. This stabilizes training by preventing a moving target. The loss for a minibatch is:
$$L(\theta) = \mathbb{E}_{(s,a,r,s’) \sim D}[(y – Q(s, a; \theta))^2], \quad \text{where } y = r + \gamma \max_{a’} Q(s’, a’; \theta^-).$$

2.3 The Overestimation Bias and Double DQN (DDQN)

A known issue with standard DQN is the overestimation of Q-values. The $$\max$$ operator in the target uses the same values both to select and to evaluate an action. In noisy environments common to China UAV drone operations, this can lead to upward bias, causing suboptimal policy convergence.

Double DQN (DDQN) addresses this by decoupling the selection from the evaluation. The online network $$\theta$$ is used to select the best action for the next state, while the target network $$\theta^-$$ evaluates its Q-value. The modified target becomes:
$$y = r + \gamma Q(s’, \arg\max_{a’} Q(s’, a’; \theta); \theta^-).$$
This simple yet effective modification reduces overestimation bias, leading to more stable and reliable policy learning, which is crucial for the safety-critical task of planning routes for a China UAV drone swarm.

Table 2: Evolution of RL Algorithms for UAV Path Planning
Algorithm	Core Mechanism	Advantage for UAVs	Limitation
Q-Learning	Tabular Q-value updates	Simple, guarantees convergence for small MDPs	Curse of dimensionality; unsuitable for complex China UAV drone states.
Deep Q-Network (DQN)	Neural network function approximation with experience replay & target network.	Handles high-dimensional state spaces (e.g., grid + comms data). Stable training.	Prone to overestimation bias, can converge to suboptimal policies.
Double DQN (DDQN)	Decouples action selection and evaluation using two networks.	Reduces overestimation, yields more accurate and stable value estimates for reliable China UAV drone navigation.	Slightly more complex than DQN; remains a value-based method.

3. DDQN-Based Route Planning for Multi-China UAV Drone Systems

We now formulate the multi-objective route planning problem for a China UAV drone swarm within the DDQN framework. Each China UAV drone acts as an independent learning agent. While they operate in a shared environment, their actions are coordinated implicitly through a shared safety constraint module, and their reward functions are not directly coupled. This design avoids the complexities of multi-agent RL while effectively solving the cooperative planning task.

3.1 State Space Design

The state $$s(t)$$ observed by the planning system at time $$t$$ must encapsulate all necessary information for decision-making. We define it as a composite tensor:
$$s(t) = (\mathcal{M}, \{P_i(t)\}, \{b_i(t)\}, \{\phi_i(t)\}, \{U_k\}, \{D_k(t)\})$$
where:

$$\mathcal{M}$$: A 3-channel tensor representing the static spatial map (recovery area, obstacles, buildings).
$$\{P_i(t)\}$$: Current 3D positions of all China UAV drones.
$$\{b_i(t)\}$$: Current battery levels of all China UAV drones.
$$\{\phi_i(t)\}$$: Operational status coefficients for all China UAV drones.
$$\{U_k\}$$: Fixed locations of all ground base stations.
$$\{D_k(t)\}$$: Current data backlog awaiting transmission at each base station.

This rich state representation allows the China UAV drone agent to perceive the spatial, energetic, and communicative context of the mission.

3.2 Reward Function Engineering

The reward function is the crucial mechanism guiding the China UAV drone toward optimal behavior. It must encode all our objectives: data collection, safety, energy preservation, and path efficiency. We define a composite reward for the $$i$$-th China UAV drone at step $$t$$ as:
$$r_i(t) = S_i(t) + \omega \sum_{k=1}^{K} \left( D_k(t) – D_k(t+1) \right) + \Theta_i(t) + \beta$$

The components are:

Data Delivery Reward ($$\omega \sum_{k} \Delta D_k$$): The core incentive. It rewards the reduction of data backlog $$D_k$$ at all base stations, weighted by a factor $$\omega$$. This encourages the China UAV drone swarm to fly in a way that maximizes the total system throughput.
Safety Penalty ($$S_i(t)$$): A large negative penalty imposed if the China UAV drone attempts an illegal action, such as entering a no-fly zone or colliding with an obstacle/building.
Energy Penalty ($$\Theta_i(t)$$): A severe negative penalty applied if the China UAV drone‘s battery depletes to zero before it has returned to the recovery area. This strongly incentivizes timely return.
Path Cost ($$\beta$$): A small negative constant (e.g., -0.1) applied per step to encourage shorter, more direct paths and discourage unnecessary loitering.

The careful tuning of the weights ($$\omega$$) and penalty magnitudes is essential for training a successful China UAV drone policy.

3.3 Training Strategy and Action Space

The action space for each China UAV drone is discrete, corresponding to moving to one of the adjacent cells in the 3D grid (including maintaining altitude, ascending, or descending). The DDQN algorithm, as outlined in Section 2.3, is employed for training. The online network takes the state $$s(t)$$ as input and outputs Q-values for each possible action. The agent follows an $$\epsilon$$-greedy policy during training to balance exploration and exploitation. Experiences are stored in the replay buffer, and the network is updated by sampling minibatches and minimizing the DDQN loss. The target network updates periodically. This process allows each China UAV drone agent to progressively learn a policy that maximizes long-term cumulative reward, effectively solving the joint optimization of communication and navigation.

4. Simulation Analysis and Performance Evaluation

To validate our proposed DDQN-based route planning framework for a China UAV drone swarm, we conducted comprehensive simulations. The environment was configured to emulate a realistic urban patrol scenario with time-varying communication channels and multiple spatial constraints, reflecting typical operational conditions for China UAV drones.

Table 3: Simulation Parameters
Parameter	Value
Number of Base Stations ($$K$$)	10
Number of China UAV Drones	3
Total Mission Data Volume (ROI)	600 MByte
UAV Flight Speed	10 m/s
UAV Initial Battery (Time Steps)	200
Base Station Tx Power	45 dBm
Training Episodes	3 × 10⁵
DDQN Learning Rate ($$\alpha$$)	3 × 10^-5
Discount Factor ($$\gamma$$)	0.95
Reward Weight for Data ($$\omega$$)	1.0

4.1 Learning Performance and Convergence

The primary performance metric is the total uplink data volume transmitted by the entire China UAV drone swarm over a complete mission episode. Figure 1 shows the learning curve, where the total data volume increases and stabilizes as training progresses across episodes. This demonstrates that the DDQN agents successfully learn to coordinate their paths to improve data collection. The learning curve for a separate, unseen test set of environmental configurations closely follows the training trend, indicating that the learned policy generalizes well to new, unforeseen urban layouts and communication patterns. This robustness is critical for the deployment of China UAV drone systems in dynamic real-world settings.

4.2 Algorithm Comparison: DDQN vs. DQN

A key aspect of our analysis was to verify the advantage of using DDQN over standard DQN for this application. We compared the training loss (Mean Squared Error between target and predicted Q-values) of both algorithms over the course of training. The results, illustrated in Figure 2, clearly show that DDQN achieves a lower and more stable loss value. The DDQN loss curve descends more rapidly and remains smoother compared to the more volatile DQN curve. This empirically confirms that DDQN mitigates the overestimation bias, leading to more stable and reliable value function approximation. Consequently, the resulting policy for the China UAV drone swarm is more robust and performs better in maximizing the mission objective while adhering to constraints.

4.3 Trajectory and Strategy Analysis

By examining the trajectories generated by the trained China UAV drone swarm, we observe intelligent emergent behaviors. The drones learn to spread out to cover areas with better communication links to different base stations, implicitly implementing a form of spatial multiplexing. They proactively avoid no-fly zones and tall buildings while scheduling their return to the recovery site well before battery depletion. The swarm dynamically adapts its flight patterns based on the simulated communication quality map, demonstrating that the DDQN framework effectively solves the multi-objective optimization problem without requiring explicit pre-programmed rules for each scenario.

Table 4: Performance Summary
Metric	DDQN-based System	Note
Average Total Data Delivered	High & Stable	Superior to baseline DQN and rule-based approaches.
Safety Constraint Violations	~0%	Agents reliably learn to avoid obstacles and no-fly zones.
Battery Exhaustion Events	~0%	Agents learn efficient energy management and timely return.
Generalization to New Maps	Excellent	Trained model performs robustly on unseen urban configurations.

5. Conclusion

This research has addressed the complex problem of route planning for a multi-China UAV drone system in urban patrol missions. By framing the challenge as a multi-objective optimization task under communication, safety, and energy constraints, we developed a solution based on Double Deep Reinforcement Learning (DDQN). The core of our approach involved a meticulous formulation of the system, including spatial discretization, energy modeling, and a realistic communication model. The design of a multi-dimensional reward function was pivotal in guiding the China UAV drone agents to balance data throughput maximization with safe and energy-efficient navigation.

Our simulation results demonstrate that the DDQN-based agents successfully learn cooperative strategies without direct inter-agent reward coupling. The trained model generates efficient flight trajectories that maximize data delivery to ground stations while strictly adhering to all spatial and energetic constraints. Furthermore, the comparative analysis confirms that DDQN provides more stable and reliable learning than standard DQN by mitigating value overestimation, a critical factor for safety-sensitive applications. The policy exhibits strong generalization capabilities, performing effectively in previously unseen urban environments. This work validates the significant applicability and potential of advanced deep reinforcement learning techniques, like DDQN, in enabling intelligent and autonomous operation for next-generation China UAV drone swarms in complex civilian and professional domains.