Chaotic-Reinforcement-Learning-Driven Multimodal ALNS for Police UAV Cooperative Patrol Path Planning

The landscape of public security is becoming increasingly complex, making the construction of a three-dimensional治安巡逻体系 critically important. Unmanned Aerial Vehicles (UAVs), with their superior mobility and multi-source information acquisition capabilities, play a vital role in this system. However, for police UAV operations in urban environments, cooperative path planning for multiple UAVs from a single nest faces multiple intertwined challenges. It requires balancing endurance with coverage area, ensuring the quality of巡查 in key regions, and accommodating the different priority levels of various patrol points. These spatio-temporal constraints are deeply coupled, forming a classic NP-hard combinatorial optimization problem that demands highly robust and dynamically adaptive path planning solutions.

Early research often employed meta-heuristic methods like Genetic Algorithms (GA), which possess global search capabilities but are prone to getting trapped in local optima when dealing with complex airspace constraints. In light of this, the Adaptive Large Neighborhood Search (ALNS) framework proposed by researchers has achieved efficient exploration through dynamically selected destroy and repair operators. Subsequent introductions of multi-restart annealing mechanisms have further enhanced algorithmic stability. Recent works have validated the robustness of such algorithms in time-varying task scenarios. For police UAV patrol path planning, hybrid optimization methods combining genetic algorithms with local search strategies have demonstrated effectiveness in practical applications.

The integration of Reinforcement Learning (RL) and chaotic mechanisms has recently propelled the intelligent development of path planning algorithms. Dynamic operator selection mechanisms based on RL have significantly improved solution efficiency. Strategies involving multi-chaotic map switching have enhanced the global search capability of algorithms. However, the performance of ALNS algorithms remains highly dependent on the strategy for choosing destroy and repair operators. Mainstream methods, such as heuristic weight adaptation or RL-based dynamic scheduling, typically operate within a single modality (i.e., homogeneous heuristic rules), struggling to balance exploration and exploitation in complex scenarios like police UAV patrols.

Current research in chaotic optimization and reinforcement learning, while deepening, often remains fragmented. In complex task environments like police UAV patrols, enabling search mechanisms to gain adaptive self-tuning capabilities has become a key challenge. To address this, we propose a novel Chaotic Reinforcement Learning-driven Multimodal Adaptive Large Neighborhood Search (CRL-ALNS) algorithm. This method dynamically selects search operators through intelligent environmental perception, providing smart guidance throughout the search process, thereby effectively enhancing both search efficiency and solution quality.

Problem Description and Mathematical Model

Problem Formulation

Police patrols are crucial for maintaining social order. This study focuses on path planning for multiple police UAVs deployed from a fixed nest at a police station, operating at a constant altitude. Patrol points are categorized based on strategic importance and治安状况: critical locations like government buildings are classified as Level-1 points, requiring full coverage; general public areas are Level-2 points, requiring a high basic coverage rate. Each patrol point has a strictly defined time window for service and a required service duration.

Mathematical Model

We formulate a dual-objective optimization model. The primary objective is to minimize the total flight distance for the entire fleet. Given distance-optimal solutions, the secondary objective is to minimize the number of police UAVs deployed.

Sets and Indices:

$V = \{0, 1, …, n\}$: Set of nodes, where $0$ represents the central nest.
$i, j \in V, i,j \neq 0$: Indices for patrol points.
$V_p \subset V$: Set of Level-1 (priority) patrol points.
$V_s \subset V$: Set of Level-2 (standard) patrol points.
$K = \{1, 2, …, m\}$: Set of available police UAVs.

Parameters:

$d_{ij}$: Euclidean distance from point $i$ to point $j$.
$t_{ij}$: Flight time from point $i$ to point $j$.
$s_i$: Service (inspection) time required at point $i$.
$[e_i, l_i]$: Time window for point $i$, where service must start within this interval.
$T_{max}$: Maximum flight endurance (time) for a single police UAV.
$D_{max}$: Maximum flight range (distance) for a single police UAV.
$M$: A sufficiently large positive constant (Big-M).

Decision Variables:

$x_{ijk} \in \{0,1\}$: Equals 1 if police UAV $k$ travels from point $i$ to point $j$; 0 otherwise.
$y_{ik} \in \{0,1\}$: Equals 1 if point $i$ is served by police UAV $k$; 0 otherwise.
$t_i \ge 0$: Start time of service at point $i$.
$u_{ik} \ge 0$: Auxiliary variable for the Miller-Tucker-Zemlin (MTZ) sub-tour elimination constraints.

Objective Functions:

$$ \text{min } Z_1 = \sum_{k \in K} \sum_{i \in V} \sum_{j \in V} d_{ij} x_{ijk} \tag{1} $$
$$ \text{min } Z_2 = \sum_{k \in K} y_{0k} \tag{2} $$

Subject to:

$$ \sum_{k \in K} y_{ik} = 1, \quad \forall i \in V_p \tag{3} $$
$$ \frac{1}{|V_s|} \sum_{i \in V_s} \sum_{k \in K} y_{ik} \ge \alpha \quad (\text{e.g., } \alpha = 0.9) \tag{4} $$
$$ \sum_{k \in K} y_{ik} \le 1, \quad \forall i \in V, i \neq 0 \tag{5} $$
$$ \sum_{j \in V} x_{ijk} = y_{ik}, \quad \forall i \in V, \forall k \in K \tag{6} $$
$$ \sum_{i \in V} x_{ijk} = y_{jk}, \quad \forall j \in V, \forall k \in K \tag{7} $$
$$ \sum_{j \in V, j \neq 0} x_{0jk} = y_{0k}, \quad \forall k \in K \tag{8} $$
$$ \sum_{i \in V, i \neq 0} x_{i0k} = y_{0k}, \quad \forall k \in K \tag{9} $$
$$ t_j \ge t_i + s_i + t_{ij} – M(1 – x_{ijk}), \quad \forall i,j \in V, i \neq j, \forall k \in K \tag{10} $$
$$ e_i \le t_i \le l_i, \quad \forall i \in V \tag{11} $$
$$ \sum_{i \in V} \sum_{j \in V} t_{ij} x_{ijk} + \sum_{i \in V} s_i y_{ik} \le T_{max}, \quad \forall k \in K \tag{12} $$
$$ \sum_{i \in V} \sum_{j \in V} d_{ij} x_{ijk} \le D_{max}, \quad \forall k \in K \tag{13} $$
$$ u_{ik} – u_{jk} + |V| \cdot x_{ijk} \le |V| – 1, \quad \forall i,j \in V, i \neq j, \forall k \in K \tag{14} $$
$$ x_{ijk}, y_{ik} \in \{0,1\}, u_{ik} \ge 0, t_i \ge 0 \tag{15} $$

Constraint (3) ensures full coverage of all priority (Level-1) points. Constraint (4) enforces a minimum coverage rate $\alpha$ for standard (Level-2) points, providing scheduling flexibility. In our computational tests, we set $\alpha=1$ to evaluate the algorithm’s limit performance. Constraints (6)-(9) enforce flow conservation, ensuring each used police UAV starts and ends at the nest. Constraints (10) and (11) maintain temporal consistency and time window feasibility. Constraints (12) and (13) respect the endurance and range limits of each police UAV. Constraint (14) is the MTZ formulation for sub-tour elimination.

The CRL-ALNS Algorithm

The core of our proposed solution is the Chaotic-Reinforcement-Learning-driven Multimodal ALNS (CRL-ALNS) algorithm. It extends the classic ALNS framework by integrating a hybrid chaotic mechanism for global exploration, a Deep Reinforcement Learning (DRL) module for intelligent operator selection, and a multimodal strategy orchestrator to balance different decision-making paradigms.

Adaptive Hybrid Chaotic Mechanism

Chaotic systems, characterized by extreme sensitivity to initial conditions, ergodicity, and inherent stochasticity, provide a powerful driver for optimization algorithms to escape local optima. We construct an adaptive hybrid chaotic mechanism to replace traditional pseudo-random number generators, injecting more exploratory perturbation sequences into the neighborhood search process for police UAV path planning.

1. Composite Chaotic System: We fuse the Logistic map and the Tent map to create a hybrid chaotic system. This addresses issues like insufficient traversal uniformity and periodic windows in single maps. An adaptive probability switching mechanism dynamically selects the chaos source, leveraging their complementary dynamic characteristics to improve statistical performance.
$$ \text{Logistic: } \zeta_{n+1} = \mu \cdot \zeta_n (1 – \zeta_n), \quad \zeta_n \in (0,1), \mu \in [3.57, 4] $$
$$ \text{Tent: } \zeta_{n+1} = \begin{cases} \frac{\zeta_n}{\beta}, & \text{if } \zeta_n < \beta \\ \frac{1 – \zeta_n}{1 – \beta}, & \text{otherwise} \end{cases}, \quad \zeta_n \in (0,1), \beta \in (0,1) $$

2. Adaptive Chaotic Weight: To dynamically match perturbation intensity with the search progress, we introduce an adaptive chaotic weight based on a non-linear decay strategy. This controls the magnitude of perturbations applied during the ruin (destroy) phase of ALNS.
$$ w_{chaos}(iter) = w_{max} – (w_{max} – w_{min}) \times \left( \frac{iter}{MaxIter} \right)^{\gamma} $$
where $iter$ is the current iteration, $MaxIter$ is the maximum number of iterations, $w_{max}$ and $w_{min}$ are the initial and final weight bounds, and $\gamma$ controls the decay rate.

Deep Reinforcement Learning Based Operator Selection Strategy

The performance of ALNS critically depends on the strategy for selecting destroy and repair operators. Traditional methods use static or reactive weight adjustments, which often fail to adapt effectively to the complex changes during the search for optimal police UAV paths. To overcome this, we integrate a Deep Q-Network (DQN) to learn an intelligent selection policy.

1. State Space ($s_t$): The state vector encodes comprehensive information about the current solution, search progress, and history.

Search Progress: Current iteration divided by maximum iterations.
Solution Quality: Ratio of current objective value to historical best value.
Search Diversity: Metric based on Hamming distance of recent solution set.
Operator Performance History: A sliding-window weighted average score for each operator, quantifying its recent contribution to solution improvement.

2. Action Space ($a_t$) and Reward Function ($r_t$): An action is the selection of a specific destroy-repair operator pair from the Cartesian product $D \times R$. The reward function is designed to guide the RL agent toward long-term efficient search strategies.
$$ r_t = \lambda_1 \cdot \frac{Z(S_{current}) – Z(S_{candidate})}{|Z(S_{current})|} + \lambda_2 \cdot \mathbb{I}(Z(S_{candidate}) < Z(S^*) ) + \lambda_3 \cdot \Phi(S_{candidate}) \tag{17} $$
where $Z(\cdot)$ is the cost function (e.g., total distance), $S^*$ is the historical best solution, $\mathbb{I}(\cdot)$ is an indicator function (1 if condition true, else 0), $\Phi(\cdot)$ is a feasibility reward function for candidate solution $S_{candidate}$, and $\lambda_1, \lambda_2, \lambda_3$ are non-negative weights summing to 1, balancing the contribution of immediate improvement, milestone achievement, and constraint satisfaction.

3. Network Architecture and Training: We employ a DQN architecture with a target network and experience replay. The Q-function approximator is a multi-layer perceptron trained using stochastic gradient descent and a mean-squared error loss function. The exploration-exploitation trade-off is managed by an $\epsilon$-greedy policy with decaying $\epsilon$.

Multimodal Operator Selection Strategy

We propose a multimodal orchestration scheme integrating the DRL decision framework, a probabilistic selection mechanism, and an emergency response module. This addresses the complex multi-objective and dynamic challenges in police UAV path planning by establishing synergistic and competitive relationships between strategies for real-time adaptation.

1. Sliding-Window Probabilistic Strategy (SWPS): As a complementary stabilization mechanism, SWPS monitors operator performance via a sliding window, focusing on objective improvement and feasibility enhancement. Selection probability uses a Boltzmann distribution:
$$ P_i = \frac{\exp(\eta_i / \tau)}{\sum_{j=1}^{K} \exp(\eta_j / \tau)} \tag{18} $$
where $\eta_i$ is the normalized efficiency score of operator $i$, $K$ is the total number of operators, and $\tau$ is a temperature parameter controlling randomness (higher $\tau$ increases exploration).

2. Directed Breakthrough Strategy for Search Stagnation (DBS): Upon detecting search stagnation (e.g., no improvement over many iterations), this strategy activates automatically. It increases the destruction strength of ruin operators and prioritizes the use of high-performing but recently underutilized backup operators, effectively breaking the current impasse and helping the algorithm escape local optima.

The overall algorithm flow for solving the police UAV cooperative patrol problem is outlined below, integrating all three modalities (DRL, SWPS, DBS) with the hybrid chaotic mechanism within the ALNS iterative framework.

Algorithm 1: CRL-ALNS for Police UAV Patrol
1: Input: Problem instance, MaxIter, DRL model params, chaos params. 2: Initialize: Generate initial solution $S_{current}$ using a greedy heuristic. $S_{best} \leftarrow S_{current}$. Initialize operator scores and DRL state. 3: for $iter = 1$ to $MaxIter$ do 4: // Multimodal Strategy Orchestration 5: if Stagnation detected then $mode \leftarrow \text{DBS}$ 6: else if DRL model is sufficiently trained & confident then $mode \leftarrow \text{DRL}$ 7: else $mode \leftarrow \text{SWPS}$ 8: // Select Operator Pair based on current $mode$ 9: Select destroy operator $d$ and repair operator $r$. 10: // Ruin Phase with Chaos 11: Generate chaotic weight $w_c$ using adaptive formula. 12: $S_{temp} \leftarrow$ Apply operator $d$ to $S_{current}$, removing $q$ nodes. $q$ is influenced by $w_c$. 13: // Recreate Phase 14: $S_{candidate} \leftarrow$ Apply operator $r$ to reinsert removed nodes into $S_{temp}$. 15: // Acceptance and Update 16: Evaluate $S_{candidate}$ using objective (1) and constraints. 17: if $S_{candidate}$ is accepted (e.g., simulated annealing criterion) then 18: $S_{current} \leftarrow S_{candidate}$. 19: if $Z(S_{candidate}) < Z(S_{best})$ then $S_{best} \leftarrow S_{candidate}$. 20: Calculate reward $r_t$ based on Eq. (17). 21: Update DRL model state, action, reward ($s_t, a_t, r_t, s_{t+1}$). 22: Update operator scores in SWPS based on performance. 23: end for 24: Output: Best found solution $S_{best}$ (paths for each police UAV).

Algorithm 1: CRL-ALNS for Police UAV Patrol

1: Input: Problem instance, MaxIter, DRL model params, chaos params.
2: Initialize: Generate initial solution $S_{current}$ using a greedy heuristic. $S_{best} \leftarrow S_{current}$. Initialize operator scores and DRL state.
3: for $iter = 1$ to $MaxIter$ do
4:     // Multimodal Strategy Orchestration
5:     if Stagnation detected then $mode \leftarrow \text{DBS}$
6:     else if DRL model is sufficiently trained & confident then $mode \leftarrow \text{DRL}$
7:     else $mode \leftarrow \text{SWPS}$
8:     // Select Operator Pair based on current $mode$
9:     Select destroy operator $d$ and repair operator $r$.
10:     // Ruin Phase with Chaos
11:     Generate chaotic weight $w_c$ using adaptive formula.
12:     $S_{temp} \leftarrow$ Apply operator $d$ to $S_{current}$, removing $q$ nodes. $q$ is influenced by $w_c$.
13:     // Recreate Phase
14:     $S_{candidate} \leftarrow$ Apply operator $r$ to reinsert removed nodes into $S_{temp}$.
15:     // Acceptance and Update
16:     Evaluate $S_{candidate}$ using objective (1) and constraints.
17:     if $S_{candidate}$ is accepted (e.g., simulated annealing criterion) then
18:        $S_{current} \leftarrow S_{candidate}$.
19:     if $Z(S_{candidate}) < Z(S_{best})$ then $S_{best} \leftarrow S_{candidate}$.
20:     Calculate reward $r_t$ based on Eq. (17).
21:     Update DRL model state, action, reward ($s_t, a_t, r_t, s_{t+1}$).
22:     Update operator scores in SWPS based on performance.
23: end for
24: Output: Best found solution $S_{best}$ (paths for each police UAV).

Computational Experiments and Analysis

Test Instance Design and Parameter Settings

We conduct simulations to validate the CRL-ALNS algorithm for the single-nest multi-police UAV cooperative patrol task, evaluating its performance under multiple constraints.

Patrol Area: A 20km x 20km square. The UAV nest is located at the geometric center.
Patrol Points: 50 Level-1 points and 150 Level-2 points are randomly generated following a uniform distribution within the area. Different random seeds create distinct spatial configurations for testing stability.
Coverage Requirement: For testing algorithmic limits, we require 100% coverage of all points ($\alpha=1$ in Eq. (4)).
Police UAV Parameters: A homogeneous fleet is assumed: flight speed = 20 km/h, endurance $T_{max}$ = 6 hours, maximum range $D_{max}$ is set large enough not to be the primary active constraint in these test instances, reflecting the capabilities of modern fixed-wing or hybrid police UAVs.

Core Destroy and Repair Operators

The effectiveness of ALNS hinges on its operators. The following table lists key operators adapted for the police UAV patrol problem with time windows and priorities.

Type	Operator Name	Description (Adapted for Police UAV Context)
Destroy (Ruin)	Random Removal	Randomly selects and removes a number of patrol points from current routes.
	Worst Distance Removal	Removes points that contribute most to the total detour or distance in their current assigned route.
	Time Window Based Removal	Removes points that are most critical or cause time window violations/infeasibility.
Repair (Recreate)	Greedy Insertion	Inserts removed points into the best feasible position in any route, minimizing the increase in total distance.
	Regret-k Insertion	Considers the cost difference between inserting a point into its best route and its k-th best route, prioritizing points with high regret.
	Time Window Feasibility First Insertion	Prioritizes inserting points into positions that maintain or restore time window feasibility, even with a distance penalty.

Results and Performance Comparison

We generate 10 benchmark instances using random seeds (42, 44, …, 60). Each instance is solved independently 20 times by CRL-ALNS, and average performance metrics are recorded. We compare against an Improved Genetic Algorithm (GA) with population size 100, crossover rate 0.85, and mutation rate 0.05, using the same initial greedy solution.

Random Seed	Total Flight Distance (km)		Distance Reduction vs. GA	Number of Police UAVs Used		UAVs Saved vs. GA
	GA	CRL-ALNS		GA	CRL-ALNS
42	974.7	355.4	63.5%	16	14	2
44	1540.3	282.8	81.6%	17	9	8
46	1108.1	303.2	72.6%	19	10	9
48	1562.4	311.5	80.1%	20	10	10
50	1018.5	326.2	68.0%	15	12	3
52	1566.1	280.6	82.1%	20	9	11
54	1524.9	307.5	79.8%	20	11	9
56	761.0	301.1	60.5%	10	9	1
58	914.9	317.5	65.3%	13	11	2
60	1058.6	280.2	73.5%	19	9	10
Average	1203.0	306.6	72.7%	16.9	10.4	6.5
Std. Dev.	332.8	20.9	–	3.5	1.4	–

The results demonstrate the significant superiority of the proposed CRL-ALNS algorithm. On average, it reduces the total flight distance by 72.7% compared to the Improved GA (from 1203.0 km to 306.6 km). This dramatic reduction highlights the effectiveness of the multimodal intelligent search mechanism in discovering highly efficient patrol routes for police UAVs.

Concurrently, the algorithm optimizes resource utilization. The average number of police UAVs required decreases from 16.9 to 10.4, a reduction of 38.5% (saving 6.5 UAVs on average). This indicates that CRL-ALNS effectively consolidates patrol routes and optimizes task allocation, leading to substantial gains in operational efficiency for police departments.

A key observation is the remarkable stability of CRL-ALNS. The standard deviation of the total distance across different spatial configurations is only 20.9 km for CRL-ALNS, compared to 332.8 km for GA. This stability stems from the dual assurance of our core architecture: the DRL module dynamically adjusts the search direction based on environmental features, while the chaotic system maintains global exploration vitality through persistent perturbation. This synergy allows the algorithm to handle diverse patrol scenarios robustly.

The convergence behavior of CRL-ALNS is characterized by rapid improvement in the initial phases (often within 250 iterations), followed by steady refinement facilitated by the RL guidance and chaotic perturbations, converging to a stable solution after around 500 iterations across different instances. This pattern confirms the method’s stable performance and strong adaptive capability in varying police UAV patrol scenarios.

Analysis of the Reinforcement Learning Reward Components

The designed reward function (Eq. 17) is crucial for guiding the DRL agent. The following table illustrates the typical contribution of each component during different search phases for a police UAV patrol instance.

Search Phase	Immediate Improvement ($\lambda_1$ term)	Milestone Reward ($\lambda_2$ term)	Feasibility Reward ($\lambda_3$ term)	Dominant Component
Early Phase	Frequent, small/medium positive rewards as the initial solution is easily improved.	Occasional, high reward when a new global best is found.	Critical, as many random moves may break time windows. Penalties/rewards for restoring feasibility are common.	Feasibility & Immediate Improvement
Mid Phase	Smaller, less frequent positive rewards as the solution matures.	Less frequent, but highly significant for convergence.	Still important, but constraints are generally maintained by well-performing operators.	Immediate Improvement & Milestone
Late Phase (Refinement)	Very small or negative rewards (worsening moves accepted by SA to escape local optima).	Rare, high reward for finding the final optimum.	Most moves are feasible. Reward is near zero unless a previously infeasible region is explored.	Milestone (guides final convergence)

Conclusion

Addressing the mission requirements for cooperative patrol of multiple police UAVs from a single nest, this paper establishes a dual-objective optimization model incorporating prioritized points and time window constraints. We propose a novel Chaotic Reinforcement Learning-driven Multimodal ALNS (CRL-ALNS) algorithm. It features an intelligent operator selection strategy via Deep Reinforcement Learning and enhances global exploration through a hybrid chaotic perturbation mechanism, forming an efficient and stable solution framework.

Experimental results demonstrate that, compared to an Improved Genetic Algorithm, our CRL-ALNS algorithm achieves a 72.7% reduction in total flight distance and a 38.5% reduction in the number of police UAVs required, while maintaining significantly higher stability across diverse spatial scenarios. This indicates substantial improvements in both patrol efficacy and resource utilization efficiency. The algorithm exhibits considerable engineering application value and promising potential for broader deployment in real-world police UAV operations and other complex vehicle routing problems with rich constraints.