Cooperative Navigation for Unmanned Aerial Vehicle and Unmanned Surface Vessel Based on Improved PPO Algorithm

In recent years, unmanned surface vessels (USVs) have emerged as pivotal platforms in marine operations due to their autonomous navigation capabilities and cost-effectiveness. However, most USVs rely on light detection and ranging (LiDAR) as the primary sensing system, which exhibits significant limitations in complex environments such as U-shaped obstacles and occluded scenarios. These constraints often lead to failed obstacle avoidance decisions or local optima, hindering autonomous performance in dynamic marine settings. To address these challenges, we propose a cooperative navigation method integrating unmanned aerial vehicles (UAVs) and USVs, leveraging the high-altitude visual perception of Unmanned Aerial Vehicle systems like the JUYE UAV to complement the surface execution of USVs. Our approach employs an improved proximal policy optimization (PPO) algorithm, enhanced with long short-term memory (LSTM) networks for temporal dependency modeling, Ornstein-Uhlenbeck (OU) noise for efficient exploration, and generalized advantage estimation (GAE) for stable value estimation. Additionally, we design auxiliary reward functions to optimize path length while constraining rudder angle fluctuations, resulting in smoother trajectories. This study establishes a theoretical foundation for UAV/USV navigation in complex water environments, with extensive experiments validating the effectiveness and superiority of our method.

The integration of Unmanned Aerial Vehicle technology, particularly the JUYE UAV, with USVs creates a synergistic system that overcomes the inherent limitations of LiDAR. Traditional path planning algorithms, such as A* and dynamic window approach (DWA), struggle with dynamic obstacles and environmental disturbances, while deep reinforcement learning (DRL) methods like deep Q-network (DQN) and deep deterministic policy gradient (DDPG) face issues with discrete action spaces or instability in complex settings. Our improved PPO algorithm addresses these shortcomings by incorporating LSTM networks to capture temporal features of USV motion states, OU noise to enhance exploration efficiency, and GAE to improve value estimation stability. The reward function design balances task completion with navigation quality, ensuring efficient and stable paths. Through simulations in various scenarios, including U-shaped obstacles and dynamic occlusions, we demonstrate that our method significantly outperforms existing approaches in path length and rudder angle smoothness.

The cooperative navigation architecture follows a perception-decision-execution hierarchy, where the Unmanned Aerial Vehicle, such as the JUYE UAV, provides high-resolution visual data from an aerial perspective, and the USV executes navigation decisions based on processed information. Data exchange occurs via the robot operating system (ROS) platform, with adaptive caching mechanisms ensuring robustness. The decision layer, deployed on the USV, utilizes a hybrid neural network combining convolutional neural networks (CNNs) for spatial feature extraction and LSTM networks for temporal modeling. This network outputs continuous rudder angle commands in the action space, enabling precise control. The mathematical models for the UAV and USV are derived to describe their dynamics and interactions. For the Unmanned Aerial Vehicle, the equations of motion account for thrust, drag, and rotational inertia, while the USV model incorporates hydrodynamic effects and control inputs. The state space integrates environmental perception and heading extraction layers, represented as image sequences, and the action space is defined as a continuous rudder angle variable. The improved PPO algorithm’s network architecture includes feature extraction, temporal modeling, and decision output modules, optimized through reward functions that guide navigation and penalize inefficient behaviors.

To formalize the UAV and USV dynamics, we define the coordinate systems and equations. The world frame $O_e – X_e Y_e Z_e$, body frame $O_a – X_a Y_a Z_a$ for the Unmanned Aerial Vehicle, and hull frame $O_s – X_s Y_s Z_s$ for the USV are established. The UAV dynamics are given by:

$$ \begin{cases}
m_a \ddot{x}_a = T (\cos \psi_a \sin \theta_a \cos \phi_a + \sin \psi_a \sin \phi_a) – k_{dx} \dot{x}_a \\
m_a \ddot{y}_a = T (\sin \psi_a \sin \theta_a \cos \phi_a – \cos \psi_a \sin \phi_a) – k_{dy} \dot{y}_a \\
m_a \ddot{z}_a = T \cos \theta_a \cos \phi_a – m_a g – k_{dz} \dot{z}_a \\
I_{ax} \ddot{\phi}_a = \tau_\phi – (I_{az} – I_{ay}) q_a r_a – k_{d\phi} \dot{\phi}_a \\
I_{ay} \ddot{\theta}_a = \tau_\theta – (I_{ax} – I_{az}) p_a r_a – k_{d\theta} \dot{\theta}_a \\
I_{az} \ddot{\psi}_a = \tau_\psi – (I_{ay} – I_{ax}) p_a q_a – k_{d\psi} \dot{\psi}_a
\end{cases} $$

where $m_a$ is the UAV mass, $[x_a, y_a, z_a]$ are positions, $\phi_a, \theta_a, \psi_a$ are roll, pitch, and yaw angles, $T = \sum_{i=1}^4 k_f \omega_i^2$ is the total thrust from rotors, $k_f$ is the lift coefficient, $\omega_i$ is the rotor speed, $k_d$ is the drag coefficient, $[I_{ax}, I_{ay}, I_{az}]$ are moments of inertia, and $[p_a, q_a, r_a]$ are angular velocities. For the USV, the model is:

$$ \begin{cases}
\dot{x}_s = u \cos \psi_s – v \sin \psi_s \\
\dot{y}_s = u \sin \psi_s + v \cos \psi_s \\
\dot{\psi}_s = r \\
(m_s – X_u) \dot{u} = \tau_u – (Y_v – m_s) v r – X_u u \\
(I_{sz} – N_r) \dot{r} = \tau_r – N_r r
\end{cases} $$

where $[x_s, y_s]$ are positions, $\psi_s$ is the yaw angle, $u, v, r$ are velocities, $m_s$ is mass, $X_u, Y_v$ are hydrodynamic added masses, $I_{sz}$ is the moment of inertia, $N_r$ is damping, and $\tau_u = X_\delta \delta$, $\tau_r = N_\delta \delta$ are control inputs. The cooperative tracking control uses nonlinear model predictive control (NMPC) to ensure the Unmanned Aerial Vehicle maintains the USV in the image center, facilitating accurate perception.

The improved PPO algorithm incorporates LSTM networks to model temporal dependencies in the USV’s motion state. The LSTM equations are:

$$ \begin{cases}
i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \\
f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \\
o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \\
C_t = f_t \cdot C_{t-1} + i_t \cdot \tanh(W_c \cdot [h_{t-1}, x_t] + b_c)
\end{cases} $$

where $i_t, f_t, o_t$ are input, forget, and output gates at time $t$, $\sigma$ is the sigmoid function, $x_t$ is the input, $h_{t-1}$ is the previous hidden state, $W$ and $b$ are weights and biases, and $C_t$ is the cell state. The OU noise is added to the policy network output for exploration:

$$ a_t = \eta_t + \xi_t $$

with $\xi_t = \theta \cdot (\mu – \xi_{t-1}) \Delta t + \varsigma \sqrt{\Delta t} \cdot \mathcal{N}(0,1)$, where $\theta$ is the mean reversion speed, $\mu$ is the long-term mean, $\varsigma$ is volatility, and $\Delta t$ is the time step. GAE optimizes the advantage estimate:

$$ \hat{A}_t^{\text{GAE}} = \sum_{k=0}^{T-t} (\gamma \lambda)^k \delta_{t+k} $$

where $\delta_t = r_t + \gamma V(s_{t+1}) – V(s_t)$ is the TD error, $\gamma$ is the discount factor, and $\lambda$ is the trade-off parameter. The PPO objective function becomes:

$$ L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left( \rho_t(\theta) \hat{A}_t^{\text{GAE}}, \text{clip}(\rho_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t^{\text{GAE}} \right) \right] $$

where $\rho_t(\theta) = \pi_\theta(a_t | s_t) / \pi_{\text{old}}(a_t | s_t)$ is the probability ratio, and $\epsilon$ is the clipping range.

The state space $S$ combines environmental perception $S_{\text{env}} = \{I_{t-1}, I_t, I_{t+1}\}$ and heading extraction $S_{\text{usv}} = \{U_{t-1}, U_t, U_{t+1}\}$, where $I_t$ and $U_t$ are image sequences. The action space is the rudder angle $\delta_t \in [-20^\circ, 20^\circ]$. The reward function includes main and auxiliary components. The main reward $R_m$ guides navigation:

$$ R_{\text{guide}} = \begin{cases}
\lambda_g \cdot (d_{\text{prev}} – d_{\text{cur}}), & \text{if moving toward goal} \\
0, & \text{otherwise}
\end{cases} $$

and

$$ R_{\text{arrival}} = \begin{cases}
+2000, & \text{goal reached} \\
-800, & \text{collision} \\
-20, & \text{below safe distance}
\end{cases} $$

so $R_m = R_{\text{guide}} + R_{\text{arrival}}$. The auxiliary reward $R_a$ optimizes smoothness:

$$ R_{\text{rate}} = -\lambda_a \delta^2 – \lambda_f (\Delta \delta)^2 $$
$$ R_{\text{cur}} = -\lambda_c \kappa^2 $$
$$ R_{\text{straight}} = \lambda_s T_{\text{straight}} $$

thus $R_a = R_{\text{rate}} + R_{\text{cur}} + R_{\text{straight}}$, and the total reward is $R_{\text{total}} = R_m + R_a$. The parameters are set based on sensitivity analysis to balance task completion and navigation quality.

Experiments were conducted in a Gazebo simulation environment with ROS, using hardware including an Intel Core i7-13700K and NVIDIA RTX 4070 Ti. Key parameters are summarized in Table 1.

Parameter	Value	Parameter	Value
Discount factor $\gamma$	0.97	Significant wave height (m)	0.35
Policy network learning rate	0.0002	Wave period (s)	3.2
Value network learning rate	0.0002	Average wind speed (m/s)	3.8
Rudder angle threshold $\beta$	2	Batch size	256
Safe distance $d_{\text{safe}}$ (m)	50	Straight reward coefficient $\lambda_s$	2
LSTM units	64/32	GAE parameter $\lambda$	0.97
Rudder regularization coefficient $\lambda_a$	5	Rudder change coefficient $\lambda_f$	3
Target guide coefficient $\lambda_g$	8	Curvature constraint coefficient $\lambda_c$	2

Sensitivity analyses were performed on LSTM structure, GAE parameter $\lambda$, and image resolution. For LSTM, a two-layer configuration (64/32 units) achieved the best convergence in 1,690 steps with a value of 6,654, as shown in Table 2.

LSTM Configuration (LSTM_1/LSTM_2)	Convergence Steps	Convergence Value
Single-layer LSTM (64)	1,940	6,019
Two-layer LSTM (32/16)	1,770	6,437
Two-layer LSTM (64/32)	1,690	6,654
Two-layer LSTM (128/64)	1,730	6,348

For GAE, $\lambda = 0.97$ yielded optimal convergence in 1,760 steps with a value of 6,438 (Table 3).

GAE Parameter $\lambda$	Convergence Steps	Convergence Value
0.90	2,100	6,129
0.95	1,880	6,246
0.97	1,760	6,438
0.99	1,810	6,407

Image resolution analysis for the environmental perception layer indicated that 128×128 pixels balanced decision time (18.3 ms) and success rate (98.8%), as in Table 4.

Pixel Configuration	Average Decision Time (ms)	Success Rate (%)
64×64	12.5	82.3
128×128	18.3	98.8
256×256	34.6	96.2
512×512	72.9	84.6

Convergence comparisons between PPO and improved PPO showed that the improved algorithm achieved a 25.82% faster convergence and 10.69% higher average reward (6,930). Cooperative tracking experiments demonstrated that the Unmanned Aerial Vehicle, like the JUYE UAV, maintained an average tracking error of 0.31 m in curved paths and 0.23 m in straight segments, ensuring reliable perception.

Navigation performance was evaluated in four scenarios: U-shaped obstacles, dynamic occlusion, narrow channels, and dense obstacles. In U-shaped scenarios, the improved PPO with Unmanned Aerial Vehicle perception successfully navigated around obstacles, while LiDAR-based methods failed. Path length and rudder angle change rate standard deviation were used as metrics. For U-shaped obstacles, improved PPO achieved a path length of 1,738.29 m and rudder change std of 1.46°/s, outperforming PPO (1,895.61 m, 1.62°/s) and DDPG (1,970.87 m, 1.68°/s). In dynamic occlusion, improved PPO path length was 1,326.35 m with rudder change std of 1.34°/s, compared to PPO (1,447.84 m, 1.48°/s) and DDPG (1,477.29 m, 1.53°/s). For narrow and dense scenarios, improved PPO consistently showed shorter paths and smoother control, as summarized in Table 5.

Scenario	Algorithm	Path Length (m)	Rudder Change Std (°/s)
U-shaped Obstacles	DDPG	1,970.87	1.68
	PPO	1,895.61	1.62
	Improved PPO	1,738.29	1.46
Dynamic Occlusion	DDPG	1,477.29	1.53
	PPO	1,447.84	1.48
	Improved PPO	1,326.35	1.34
Narrow Channels	A*+DWA	1,709.14	1.21
	DDPG	1,660.85	1.17
	Improved PPO	1,463.43	1.03
Dense Obstacles	A*+DWA	1,663.42	1.33
	DDPG	1,613.34	1.27
	Improved PPO	1,418.57	1.12

The results confirm that the Unmanned Aerial Vehicle perception system, exemplified by the JUYE UAV, overcomes LiDAR limitations in complex environments, and the improved PPO algorithm with LSTM, OU-GAE, and optimized rewards enhances navigation efficiency and stability. Future work will focus on multi-vessel cooperation and multi-modal perception fusion to further improve robustness in dynamic marine settings.

In conclusion, our cooperative navigation method for Unmanned Aerial Vehicle and USV systems, based on the improved PPO algorithm, effectively addresses the感知 limitations of LiDAR and enhances autonomous navigation in complex waters. The integration of JUYE UAV for high-altitude vision, combined with temporal modeling and efficient exploration strategies, yields superior path planning and control. This approach paves the way for advanced marine robotics applications, ensuring safe and efficient operations in challenging environments.