Reinforcement Learning-Based Path Planning for Unmanned Drone Maritime Search

The safety of maritime activities is a cornerstone of developing a strong maritime nation, with emergency rescue operations being a critical component of this safety framework. Traditional maritime search and rescue (SAR) primarily relies on manned helicopters and surface vessels. While effective, these platforms have inherent limitations. Rescue vessels typically cruise at speeds around 12 knots (22 km/h), making timely response to mid- and far-sea incidents challenging. Manned helicopters, though faster, are constrained by visual search limitations, restricted coverage areas, high operational costs, and the complexity of multi-aircraft coordination. In contrast, the rapid advancement of unmanned drone technology presents a transformative opportunity. Unmanned drone systems offer significant advantages, including rapid deployment, extensive coverage capability, lower operational costs, and the ability to carry various sensors for more accurate target identification. Their potential for autonomous or semi-autonomous operation and easier multi-agent coordination makes them a powerful supplement to existing maritime SAR forces, filling critical gaps in the current response system.

A primary challenge in maritime SAR is target location uncertainty due to dynamic ocean currents and winds. Even with a precise initial datum, a distress target (e.g., a person in the water or a life raft) will drift over time. While drift prediction models exist, their accuracy diminishes as time progresses. Therefore, maximizing search efficiency within the crucial early hours is paramount. This paper addresses the path planning problem for an unmanned drone tasked with searching for a drifting target in a dynamic maritime environment characterized by probabilistic location information. We propose a method based on Reinforcement Learning (RL), specifically the Proximal Policy Optimization (PPO) algorithm, to generate intelligent search paths that prioritize high-probability areas, thereby enhancing overall mission success rates compared to conventional search patterns.

1. Modeling the Unmanned Drone Search Task

We consider a fixed-wing unmanned drone for its superior endurance and range, which are essential for covering vast maritime areas. To focus the RL model on the strategic search policy, we simplify the drone’s dynamics. We assume the unmanned drone operates at a constant cruise speed and altitude, with a fixed sensor sweep width. The search area is discretized into a two-dimensional grid of $M \times N$ cells. Each cell $(m, n)$ has an associated prior Probability of Containment (POC), $P_{mn}$, derived from drift prediction models (e.g., from national maritime environmental service platforms). This POC matrix represents the initial belief state of the target’s location.

1.1 State Space Representation

The state $s_t$ at time $t$ must encapsulate all necessary information for the agent to make a decision. It consists of three core components, flattened into a vector for neural network input:

  1. Grid Probability Map: The normalized prior POC values $P_{mn}$ for all cells.
  2. Search History Map: A record for each cell indicating its search status: $visited_{mn}(t) = 0$ (unsearched), $-1$ (already searched), or $1$ (currently being searched). This prevents redundant coverage.
  3. Agent State: The drone’s own positional and contextual information: its current grid coordinates $(x, y)$, the action taken in the previous step $a_{t-1}$, the POC $P_{current}$ of its current cell, and the search status $visited_{current}(t)$ of its current cell.

Thus, the state vector can be represented as:
$$ s_t = \text{Flatten}\left( \mathbf{P}, \mathbf{V}(t) \right) \oplus (x, y, a_{t-1}, P_{current}, visited_{current}(t)) $$
where $\oplus$ denotes vector concatenation, $\mathbf{P}$ is the POC matrix, and $\mathbf{V}(t)$ is the search status matrix.

1.2 Action Space

The unmanned drone agent navigates the grid. At each decision step, it selects one of four possible movements: North, East, South, or West. Diagonal moves are excluded to ensure complete and continuous sensor coverage of adjacent cells, a principle aligned with standard search theory. The action at time $t$ is $a_t \in \{0, 1, 2, 3\}$, corresponding to the four directional moves.

1.3 Reward Function Design

The reward function is critical for guiding the RL agent toward optimal search behavior. It must incentivize searching high-probability cells, discourage re-searching, and promote efficient area coverage. Our design incorporates immediate per-step rewards and an episodic bonus.

Immediate Reward: Upon entering a cell $(m, n)$ at time $t$, the agent receives:
$$ r_{mn}(t) =
\begin{cases}
P_{mn} \cdot w \cdot g^{t}, & \text{if } visited_{mn}(t) = 0 \quad \text{(First-time search)} \\
r_{\text{punish}}, & \text{if } visited_{mn}(t) = -1 \quad \text{(Re-search penalty)}
\end{cases}
$$
where:

  • $P_{mn}$ is the cell’s prior probability.
  • $w$ is a reward weight to scale the probability gain.
  • $g \in [0,1]$ is a discount factor that reduces the value of finding a target later in the episode, reflecting the decreasing likelihood of success over time.
  • $r_{\text{punish}}$ is a negative constant penalizing inefficient re-searching.

This immediate reward is normalized to stabilize training:
$$ \tilde{r}_t = \frac{r_t – r_{\text{min}}}{r_{\text{max}} – r_{\text{min}}} $$
(For simplicity, we denote the normalized reward as $r_t$ hereafter).

Episodic Efficiency Bonus: To encourage the unmanned drone to cover a wide area effectively, a bonus $r_e$ is awarded at the end of an episode (of length $L$ steps) if the coverage ratio exceeds a threshold:
$$ r_e =
\begin{cases}
B \cdot \frac{E}{L}, & \text{if } \frac{E}{L} \geq 0.6 \\
0, & \text{otherwise}
\end{cases}
$$
where $E$ is the total number of unique cells searched during the episode, and $B$ is a baseline bonus constant.

Cumulative Return: The total discounted return $G_t$ for a step is calculated using the Generalized Advantage Estimate (GAE), which balances bias and variance:
$$ \delta_t = r_t + \gamma V(s_{t+1}) (1 – d_t) – V(s_t) $$
$$ A_t^{\text{GAE}(\gamma, \lambda)} = \sum_{k=0}^{\infty} (\gamma \lambda)^k \delta_{t+k} $$
$$ G_t = A_t + V(s_t) $$
Here, $\gamma$ is the standard RL discount factor, $\lambda$ is the GAE parameter, $V(s)$ is the state-value function, and $d_t$ is a termination flag.

2. PPO-Based Path Planning Algorithm for Unmanned Drones

We employ the Proximal Policy Optimization (PPO) algorithm due to its stability, sample efficiency, and ease of implementation. PPO is an actor-critic method that optimizes a policy while preventing excessively large updates that could destabilize training.

2.1 Algorithm Architecture and Workflow

The training process follows these key stages, as conceptually outlined in the workflow below, and involves continuous interaction between the unmanned drone agent and the simulated maritime environment.

  1. Initialization: Configure the Actor ($\pi_\theta$) and Critic ($V_\phi$) neural networks with fully connected layers and ReLU activations. Set hyperparameters (see Table 1).
  2. Data Sampling (Interaction): For $N$ episodes, the agent interacts with the environment.
    • The Actor network takes the state $s_t$ and outputs a action probability distribution: $\pi_\theta(a_t | s_t) = \text{Softmax}(f_\theta(s_t))$.
    • An action $a_t$ is sampled from this distribution.
    • The agent executes $a_t$, receives reward $r_t$, and transitions to state $s_{t+1}$.
    • The trajectory $(s_t, a_t, r_t, \log \pi_\theta(a_t|s_t), d_t)$ is stored.
  3. Advantage Estimation: After an episode, compute advantages $A_t$ and returns $G_t$ for all timesteps using GAE (Eq. 5 & 6).
  4. Policy Optimization: This is the core update phase. Data from multiple episodes is batched, and the following loss function is minimized over several epochs:
    $$ L_t^{\text{CLIP+VF+S}}(\theta, \phi) = \hat{\mathbb{E}}_t \left[ L_t^{\text{CLIP}}(\theta) + c_1 L_t^{\text{VF}}(\phi) – c_2 S[\pi_\theta](s_t) \right] $$
    The three components are:

    • Clipped Surrogate Objective: Prevents destructive large policy updates.
      $$ L_t^{\text{CLIP}}(\theta) = \hat{\mathbb{E}}_t \left[ \min\left( \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} A_t, \text{clip}\left(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}, 1-\epsilon, 1+\epsilon\right) A_t \right) \right] $$
    • Value Function Loss: Trains the Critic to better estimate state values.
      $$ L_t^{\text{VF}}(\phi) = \hat{\mathbb{E}}_t \left[ (V_\phi(s_t) – G_t)^2 \right] $$
    • Entropy Bonus: Encourages exploration by penalizing low entropy (certainty) in the policy.
      $$ S[\pi_\theta](s_t) = – \sum_a \pi_\theta(a|s_t) \log \pi_\theta(a|s_t) $$

    The parameters $\theta$ and $\phi$ are then updated using the Adam optimizer.

Table 1: Key Hyperparameters for PPO Training
Hyperparameter Symbol Value Description
Discount Factor $\gamma$ 0.995 Discounts future rewards.
GAE Parameter $\lambda$ 0.95 Controls bias-variance trade-off in advantage estimation.
Clip Range $\epsilon$ 0.2 Limits policy update step size.
Learning Rate $\alpha$ $6 \times 10^{-5}$ Step size for optimizer.
Value Coefficient $c_1$ 0.5 Weight of value function loss.
Entropy Coefficient $c_2$ 0.01 Weight of entropy bonus.
Update Epochs 4 Number of passes over sampled data.
Episode Length $L$ 120 steps Maximum steps per training episode.

3. Simulation Experiments and Analysis

3.1 Case Setup

A simulated maritime incident is constructed: a fishing vessel capsizes at a known location. Drift prediction data is processed to generate a probabilistic target distribution over a search area, which is discretized into a grid. The normalized POC heatmap forms the basis of the environment. The unmanned drone starts from a designated point and must plan a 120-step search path.

3.2 Parameter Sensitivity and Optimization

The selection of hyperparameters significantly impacts performance. Ablation studies were conducted:

  • Discount Factor ($\gamma$): A value of 0.995 proved optimal, balancing immediate reward acquisition with long-term strategy, whereas lower values caused training instability.
  • Learning Rate ($\alpha$): A rate of $6 \times 10^{-5}$ provided a good trade-off between convergence speed and final policy quality. Higher rates led to premature convergence at suboptimal policies.
  • Clip Range ($\epsilon$): A value of 0.2 provided stable and effective policy updates, preventing the collapse observed with overly large ranges.

3.3 Results and Comparative Analysis

The trained PPO policy converges effectively, as shown by the increasing episodic reward and decreasing loss over 20,000 training episodes. To evaluate performance, we define search efficiency $E_f$ for a path of length $L$ as:
$$ E_f = \frac{\sum_{i=1}^{L} p_i}{L} $$
where $p_i$ is the prior POC of the cell visited at step $i$. A higher $E_f$ indicates that the path more consistently traverses high-probability areas. The proposed PPO-based method is compared against two benchmarks:

  1. Genetic Algorithm (GA): A heuristic optimization method.
  2. Parallel Line Search: A standard pattern prescribed in SAR manuals (e.g., IAMSAR).

The comparative search efficiency at different stages of the search (step counts) is summarized below:

Table 2: Search Efficiency ($E_f$) Comparison Across Methods
Search Step PPO (Proposed) Genetic Algorithm Parallel Line Search
20 0.4679 0.3546 0.0492
40 0.4929 0.4493 0.0878
60 0.5075 0.4086 0.1568
80 0.4988 0.4015 0.1756
100 0.4838 0.3919 0.2112
120 0.4512 0.3941 0.2383

The results demonstrate a clear advantage for the RL-based approach. The PPO-generated path achieves significantly higher efficiency in the early and middle stages of the search. This is its critical strength: it directs the unmanned drone to cover the highest probability regions first, maximizing the chance of early success when the drift prediction is most accurate. The parallel line search, while systematic, shows low early efficiency as it does not prioritize probability information. The GA-based path, though adaptive, results in a less coherent and sometimes repetitive trajectory, leading to lower and more erratic efficiency. The cumulative probability discovered over time further confirms the superiority of the PPO policy in covering high-likelihood areas more quickly.

3.4 Robustness Analysis

To validate the generalizability of the proposed framework, the PPO agent was trained and evaluated on several different search scenarios with varying grid sizes and probability distributions (e.g., single concentrated hotspot, multiple dispersed hotspots). The normalized search efficiency $E_{f,k}$ for scenario $k$ is calculated as:
$$ E_{f,k} = \eta \cdot \frac{ \sum_{i=1}^{L} p_i }{ \sum_{i=1}^{M \cdot N} p_i \cdot L } $$
where $\eta$ is a scaling constant. Across multiple distinct scenarios, the algorithm consistently generated efficient paths that prioritized high-probability zones. Statistical analysis of the results yielded a Coefficient of Variation (CV) below 15% and a tight Interquartile Range (IQR), indicating stable and robust performance regardless of the specific probability map configuration. This demonstrates that the learned policy is not overfitted to a single case but captures general principles of efficient probabilistic search for an unmanned drone.

4. Conclusion

This paper presents a reinforcement learning-based framework for path planning in maritime search operations using an unmanned drone. By modeling the search area as a probabilistic grid and formulating the problem within a PPO-based RL paradigm, we develop an intelligent agent that learns to optimize search paths dynamically. The core contribution is a reward function and training methodology that effectively balances the exploitation of high-probability areas with the exploration of the search space, while discouraging inefficient behavior. Simulation results based on realistic drift prediction data demonstrate that the proposed method significantly outperforms traditional parallel line search and a standard genetic algorithm in terms of early and cumulative search efficiency. The learned policy enables the unmanned drone to prioritize key regions when time is most critical, directly addressing the fundamental challenge of uncertainty in maritime SAR. The method shows robust performance across varied scenarios. Future work will focus on extending the framework to multi-unmanned drone cooperative search scenarios and integrating real-time environmental updates into the state space for fully dynamic replanning.

Scroll to Top