Chaotic Reinforcement Learning-Driven Multimodal ALNS for Police Drone Cooperative Patrol Path Planning

In modern public security operations, the complexity of urban environments demands advanced surveillance systems. Police drones, or unmanned aerial vehicles (UAVs), have emerged as critical assets due to their mobility and multi-source data collection capabilities. However, coordinating multiple police drones from a single nest for patrol missions poses significant challenges, including balancing endurance and coverage, ensuring inspection quality in key areas, and handling priority differences among patrol points. These spatiotemporal constraints intertwine, forming a typical NP-hard combinatorial optimization problem that requires robust and dynamically adaptive path planning solutions. Traditional methods, such as genetic algorithms, often struggle with local optima when dealing with complex airspace constraints. To address this, I propose a novel Chaotic Reinforcement Learning-driven Multimodal Adaptive Large Neighborhood Search (CRL-ALNS) algorithm. This approach integrates a deep reinforcement learning framework for intelligent operator selection and chaotic sequences to enhance global exploration, aiming to optimize police drone patrol routes efficiently.

The patrol scenario involves a fixed police station nest serving as the base for multiple police drones. Patrol points are categorized into two levels: Level 1 points, such as government buildings, require full coverage, while Level 2 points, like general public areas, need at least 90% coverage. Each point has strict time windows for service, and police drones must adhere to maximum flight time and distance limits. This problem is modeled as a dual-objective optimization to minimize total flight distance and the number of police drones used, subject to constraints like time windows, coverage requirements, and route continuity.

To formalize the problem, let me define the mathematical model. Consider a set of nodes $ V = \{0, 1, \dots, n\} $, where node 0 represents the police station nest. Let $ i $ and $ j $ be indices for patrol points in $ V $ (with $ i, j \neq 0 $). Define $ V_p $ as the set of Level 1 patrol points and $ V_s $ as the set of Level 2 patrol points. The police drone fleet is denoted by $ K = \{1, 2, \dots, m\} $. Parameters include $ d_{ij} $ as the Euclidean distance between points $ i $ and $ j $, $ t_{ij} $ as the flight time, $ s_i $ as the service time at point $ i $, $ e_i $ and $ l_i $ as the earliest and latest allowed arrival times, $ T_{\text{max}} $ as the maximum flight time per police drone, and $ D_{\text{max}} $ as the maximum flight distance per police drone. Decision variables are $ x_{ijk} $, a binary variable equal to 1 if police drone $ k $ travels from point $ i $ to point $ j $, and 0 otherwise; $ y_{ik} $, a binary variable equal to 1 if police drone $ k $ services point $ i $; $ t_i $ as the arrival time at point $ i $; and $ u_{ik} $ as an auxiliary variable for subtour elimination.

The dual-objective optimization model is formulated as follows. The primary goal is to minimize the total flight distance of all police drones:

$$
\min Z_1 = \sum_{k \in K} \sum_{i \in V} \sum_{j \in V} d_{ij} x_{ijk}
$$

The secondary goal is to minimize the number of police drones deployed:

$$
\min Z_2 = \sum_{k \in K} y_{0k}
$$

Subject to constraints that ensure coverage and operational feasibility. Each Level 1 patrol point must be serviced exactly once by a police drone:

$$
\sum_{k \in K} y_{ik} = 1, \quad \forall i \in V_p
$$

For Level 2 patrol points, the overall coverage must be at least 90%:

$$
\frac{1}{n_s} \sum_{i \in V_s} \sum_{k \in K} y_{ik} \geq 0.9
$$

where $ n_s $ is the number of Level 2 points. Each patrol point (except the nest) is serviced at most once:

$$
\sum_{k \in K} y_{ik} \leq 1, \quad \forall i \in V, i \neq 0
$$

Flow conservation constraints ensure route continuity for each police drone:

$$
\sum_{j \in V} x_{ijk} = y_{ik}, \quad \forall i \in V, \forall k \in K
$$

$$
\sum_{i \in V} x_{ijk} = y_{jk}, \quad \forall j \in V, \forall k \in K
$$

Each police drone must start and end at the nest:

$$
\sum_{j \in V, j \neq 0} x_{0jk} = y_{0k}, \quad \forall k \in K
$$

$$
\sum_{i \in V, i \neq 0} x_{i0k} = y_{0k}, \quad \forall k \in K
$$

Time window constraints are enforced to ensure timely service by police drones. For any police drone traveling from point $ i $ to point $ j $:

$$
t_j \geq t_i + s_i + t_{ij} – M(1 – x_{ijk}), \quad \forall i,j \in V, i \neq j, \forall k \in K
$$

$$
t_i + s_i + t_{ij} \leq t_j + M(1 – x_{ijk}), \quad \forall i,j \in V, i \neq j, \forall k \in K
$$

where $ M $ is a large constant. Arrival times must fall within specified windows:

$$
e_i \leq t_i \leq l_i, \quad \forall i \in V
$$

Police drone operational limits include maximum flight time and distance:

$$
\sum_{i \in V} \sum_{j \in V} t_{ij} x_{ijk} + \sum_{i \in V} s_i y_{ik} \leq T_{\text{max}}, \quad \forall k \in K
$$

$$
\sum_{i \in V} \sum_{j \in V} d_{ij} x_{ijk} \leq D_{\text{max}}, \quad \forall k \in K
$$

Subtour elimination is handled using the Miller-Tucker-Zemlin formulation:

$$
u_{ik} – u_{jk} + |V| \cdot x_{ijk} \leq |V| – 1, \quad \forall i,j \in V, i \neq j, \forall k \in K
$$

Finally, variable domains are defined:

$$
x_{ijk}, y_{ik} \in \{0,1\}, \quad u_{ik} \geq 0, \quad t_i \geq 0
$$

This model captures the intricacies of police drone patrol planning, but solving it requires an efficient algorithm due to its NP-hard nature. Hence, I developed the CRL-ALNS algorithm, which combines adaptive large neighborhood search with chaotic reinforcement learning.

The Adaptive Large Neighborhood Search (ALNS) framework forms the backbone of my approach. ALNS iteratively destroys and repairs solutions using operators selected based on adaptive weights. However, traditional ALNS relies on single-modal heuristics, which may not balance exploration and exploitation effectively in dynamic police drone environments. To overcome this, I enhance ALNS with a multimodal strategy driven by chaotic reinforcement learning.

The chaotic mechanism replaces pseudo-random number generators to inject exploratory perturbations. I use a hybrid chaotic system combining Logistic and Tent maps for better traversal uniformity. The Logistic map is defined as:

$$
x_{n+1} = \mu x_n (1 – x_n)
$$

where $ \mu $ is a control parameter, and the Tent map is:

$$
x_{n+1} = \begin{cases}
\alpha x_n & \text{if } x_n < 0.5 \\
\alpha (1 – x_n) & \text{otherwise}
\end{cases}
$$

with $ \alpha $ as a parameter. These maps generate sequences that adaptively perturb search directions. An adaptive chaos weight adjusts perturbation intensity based on search progress:

$$
w_c = w_{\text{init}} \cdot \exp\left(-\beta \cdot \frac{\text{iteration}}{\text{max iterations}}\right)
$$

where $ w_{\text{init}} $ is the initial weight, $ \beta $ is a decay rate, and iteration refers to the current step. This ensures that police drone path exploration is more aggressive early on and refined later.

The reinforcement learning component employs a Deep Q-Network (DQN) to intelligently select destruction and repair operators. The state space $ s_t $ encodes current solution features, search progress, and operator history. It includes metrics like search progression ratio, solution quality relative to historical best, diversity measure based on Hamming distance, and operator performance via sliding window averages. The action space $ a_t $ is the Cartesian product of destruction and repair operator pairs. The reward function $ r_t $ guides learning:

$$
r_t = \lambda_1 \cdot \frac{L(S_{\text{current}}) – L(S_{\text{candidate}})}{|L(S_{\text{current}})|} + \lambda_2 \cdot I(L(S_{\text{candidate}}) < L(S^*) ) + \lambda_3 \cdot \Phi(S_{\text{candidate}})
$$

where $ L(\cdot) $ is the cost function (e.g., total distance for police drones), $ S^* $ is the historical best solution, $ I(\cdot) $ is an indicator function, $ \Phi(\cdot) $ rewards feasibility, and $ \lambda_1, \lambda_2, \lambda_3 $ are weights summing to 1. The DQN uses a multi-layer perceptron for Q-value approximation, trained with stochastic gradient descent and experience replay.

The multimodal operator selection strategy integrates three modes: reinforcement learning-driven selection, probability-based selection with sliding windows, and a stagnation-breaking mode. The probability-based mode uses a Boltzmann distribution for operator selection:

$$
P_i = \frac{\exp(\eta_i / \tau)}{\sum_{j=1}^K \exp(\eta_j / \tau)}
$$

where $ \eta_i $ is the normalized performance score of operator $ i $, $ K $ is the number of operators, and $ \tau $ is a temperature parameter controlling randomness. When stagnation is detected, the algorithm increases perturbation intensity and prioritizes underused operators to escape local optima, ensuring robust search for police drone paths.

To validate the CRL-ALNS algorithm, I conducted extensive experiments on randomly generated test instances. The simulation area is a 20 km × 20 km square with the police station nest at the center. Patrol points include 50 Level 1 and 150 Level 2 points, uniformly distributed. Police drones have a speed of 20 km/h and a maximum flight time of 6 hours. I generated 10 instances using random seeds from 42 to 60, each run 20 times for average results. The algorithm parameters are set as follows: initial chaos weight $ w_{\text{init}} = 0.5 $, decay rate $ \beta = 0.01 $, DQN learning rate 0.001, and reward weights $ \lambda_1 = 0.6, \lambda_2 = 0.3, \lambda_3 = 0.1 $.

The performance of CRL-ALNS is summarized in the table below, showing total flight distance and number of police drones used, compared to initial solutions from a greedy algorithm.

Random Seed	Initial Distance (km)	CRL-ALNS Distance (km)	Optimization Rate	Initial Drones	CRL-ALNS Drones	Drones Saved
42	2303.9	355.4	84.6%	21	14	7
44	2329.7	282.8	87.9%	24	9	15
46	2308.1	303.2	86.9%	24	10	14
48	2302.1	311.5	86.5%	23	10	13
50	2187.2	326.2	85.1%	24	12	12
52	2373.3	280.6	88.2%	25	9	16
54	2179.7	307.5	85.9%	23	11	12
56	2216.1	301.1	86.4%	26	9	17
58	2128.3	317.5	85.1%	25	11	14
60	2101.1	280.2	86.7%	24	9	15
Average	2243.0	306.6	86.3%	23.9	10.4	13.5

The results demonstrate that CRL-ALNS consistently reduces total flight distance by an average of 86.3% and decreases the number of police drones by 13.5 on average, highlighting its efficiency in resource utilization for police drone operations. The convergence curves show rapid improvement within 250 iterations and stability after 500, indicating robust performance across different patrol scenarios.

For comparative analysis, I tested CRL-ALNS against an improved genetic algorithm (GA) with parameters: population size 100, crossover rate 0.85, mutation rate 0.05. The table below presents the comparison in terms of total distance and police drone usage.

Random Seed	GA Distance (km)	CRL-ALNS Distance (km)	Improvement	GA Drones	CRL-ALNS Drones	Drones Saved
42	974.7	355.4	63.5%	16	14	2
44	1540.3	282.8	81.6%	17	9	8
46	1108.1	303.2	72.6%	19	10	9
48	1562.4	311.5	80.1%	20	10	10
50	1018.5	326.2	68.0%	15	12	3
52	1566.1	280.6	82.1%	20	9	11
54	1524.9	307.5	79.8%	20	11	9
56	761.0	301.1	60.5%	10	9	1
58	914.9	317.5	65.3%	13	11	2
60	1058.6	280.2	73.5%	19	9	10
Average	1203.0	306.6	72.7%	16.9	10.4	6.5

CRL-ALNS outperforms GA by reducing total distance by 72.7% on average and using 38.5% fewer police drones. Moreover, the standard deviation of distances for CRL-ALNS is 20.9 km, compared to 332.8 km for GA, indicating superior stability. This is attributed to the intelligent multimodal search, which adapts to varying patrol point distributions, ensuring reliable police drone deployment.

To further analyze algorithm components, I evaluated the impact of chaos and reinforcement learning separately. The table below shows ablation study results for instance 42, comparing full CRL-ALNS with variants.

Variant	Distance (km)	Drones Used	Convergence Iterations
CRL-ALNS (Full)	355.4	14	500
Without Chaos	410.2	16	600
Without RL	480.7	18	700
Basic ALNS	520.1	19	800

The full CRL-ALNS achieves the best performance, emphasizing the synergy between chaotic exploration and reinforcement learning for police drone path optimization. The chaos mechanism enhances global search, while RL fine-tunes operator selection, leading to faster convergence and better solutions.

In terms of computational efficiency, CRL-ALNS has a time complexity of $ O(n^2 \cdot m \cdot I) $, where $ n $ is the number of patrol points, $ m $ is the number of police drones, and $ I $ is iterations. For typical instances with 200 points and 10 police drones, the average runtime is under 5 minutes on a standard CPU, making it suitable for real-time police drone scheduling. The algorithm can handle dynamic changes, such as adding new patrol points or adjusting time windows, by reinitializing the search with updated constraints.

The effectiveness of CRL-ALNS stems from its multimodal approach. The reinforcement learning module continuously learns from search states, optimizing operator choices for police drone routes. For example, in dense urban areas, it may prioritize destruction operators that remove distant points to shorten paths, while in sparse regions, repair operators that cluster points are favored. The chaotic perturbations prevent stagnation, as shown by the Lyapunov exponent analysis of the hybrid chaotic system:

$$
\lambda = \frac{1}{N} \sum_{n=0}^{N-1} \ln \left| \frac{df}{dx} \right|_{x=x_n}
$$

where $ f $ is the chaotic map function. Positive values indicate chaotic behavior, ensuring diverse search directions for police drone paths.

From a practical perspective, the CRL-ALNS algorithm offers significant benefits for police drone operations. By minimizing flight distance, it extends battery life, allowing longer patrols or reduced charging needs. Reducing the number of police drones lowers costs and simplifies logistics. The time window adherence ensures timely surveillance of critical areas, enhancing public safety. Moreover, the algorithm’s stability across different spatial distributions makes it reliable for various urban layouts, from compact city centers to sprawling suburbs.

Future work could integrate real-time data from police drones, such as traffic conditions or weather updates, to further adapt paths. Extending the model to multiple nests or heterogeneous police drone fleets with varying capabilities would increase applicability. Additionally, incorporating multi-objective optimization techniques like Pareto fronts could balance distance, drone count, and other factors like energy consumption. The reinforcement learning framework could be enhanced with deep reinforcement learning for more complex state representations, potentially using convolutional neural networks to process spatial data of patrol points.

In conclusion, the CRL-ALNS algorithm represents a significant advancement in police drone cooperative patrol path planning. By fusing chaotic mechanisms with reinforcement learning in a multimodal ALNS framework, it achieves substantial improvements in distance reduction and resource efficiency. The mathematical model and experimental validation confirm its robustness and adaptability, making it a valuable tool for modern law enforcement agencies. As police drone technology evolves, such intelligent algorithms will be crucial for optimizing surveillance missions and maintaining public security in complex environments.