Cooperative Navigation for China UAV Drone and USV Systems via Enhanced Proximal Policy Optimization

Autonomous navigation for Unmanned Surface Vessels (USVs) represents a pivotal technology for modern maritime operations, offering significant advantages in cost-effectiveness and mission flexibility. However, the practical deployment of USVs in complex, cluttered maritime environments faces substantial challenges. A primary bottleneck lies in the sensory perception system. Most contemporary USVs rely predominantly on Light Detection and Ranging (LiDAR) as their core sensor. While effective in many scenarios, LiDAR has inherent limitations: its perception is line-of-sight and limited in range, making it susceptible to failures in specific situations such as “U-shaped” obstacles or environments with severe occlusions. In these cases, the USV’s navigation algorithm, deprived of critical environmental information, may fail to find a viable path or become trapped in a local optimum.

To overcome this sensory gap, a promising solution is the integration of Unmanned Aerial Vehicles (UAVs). A China UAV drone, equipped with high-resolution visual sensors, offers a distinct aerial perspective with a broad field of view. This bird’s-eye view is less susceptible to the line-of-sight limitations that plague LiDAR, enabling the detection of obstacles beyond immediate proximity and the mapping of complex topographical features. This synergy forms the basis for a cooperative UAV/USV system. Concurrently, the navigation algorithms themselves require advancement. Traditional path-planning methods like A* or Dynamic Window Approach (DWA), while robust in simpler settings, often struggle with dynamic environments, kinematic constraints, and optimal path smoothness. Deep Reinforcement Learning (DRL) has emerged as a powerful alternative, enabling an agent to learn optimal navigation policies through interaction with a simulated environment.

Among DRL algorithms, Proximal Policy Optimization (PPO) is renowned for its training stability and sample efficiency. However, standard PPO has limitations when applied to the nuanced challenge of USV navigation. It typically lacks explicit mechanisms for modeling temporal dependencies in the USV’s state (e.g., historical heading and obstacle relative positions), which are crucial for smooth and anticipatory control. Furthermore, its exploration strategy can be inefficient in complex, continuous action spaces. This paper addresses these intertwined challenges by proposing a novel, integrated cooperative navigation framework. The core contributions are twofold: first, we architect a cooperative system where a China UAV drone provides aerial visual perception to compensate for the USV’s local sensor shortcomings; second, we develop an enhanced PPO algorithm specifically tailored for this collaborative context, incorporating Long Short-Term Memory (LSTM) networks for temporal modeling and improved exploration and optimization techniques.

System Architecture and Methodological Framework

1. Cooperative System Architecture

The proposed system operates on a hierarchical “Perception-Decision-Execution” architecture, seamlessly integrating the China UAV drone and the USV. The perception layer is handled by the UAV, which maintains a tracking position above the USV. It captures top-down visual imagery of the surrounding environment and transmits this data in real-time to the USV via a Robot Operating System (ROS) network. The decision layer resides on the USV. It fuses the incoming aerial image stream with the USV’s own state information (e.g., position, velocity). This fused information is processed by our enhanced DRL agent to generate navigation commands. Finally, the execution layer on the USV translates these commands—specifically, a rudder angle—into physical actuator movements, completing the control loop.

2. Mathematical Models and Tracking Control

The dynamic models for the China UAV drone and the USV are established to facilitate simulation and control. The UAV is modeled as a 6-DOF quadrotor, and the USV is modeled using a 3-DOF horizontal plane model with surge, sway, and yaw motions, incorporating relevant hydrodynamic derivatives. To ensure the UAV maintains its position above the USV for consistent perception, a Nonlinear Model Predictive Control (NMPC) strategy is employed. The USV’s real-time trajectory serves as the reference input for the NMPC controller onboard the China UAV drone, which solves an optimization problem at each time step to generate optimal thrust and torque commands, enabling precise and stable tracking even under environmental disturbances.

The UAV dynamics are governed by:
$$
\begin{aligned}
m_a \ddot{x}_a &= T(\cos\psi_a \sin\theta_a \cos\phi_a + \sin\psi_a \sin\phi_a) – k_{dx} \dot{x}_a \\
m_a \ddot{y}_a &= T(\sin\psi_a \sin\theta_a \cos\phi_a – \cos\psi_a \sin\phi_a) – k_{dy} \dot{y}_a \\
m_a \ddot{z}_a &= T \cos\theta_a \cos\phi_a – m_a g – k_{dz} \dot{z}_a
\end{aligned}
$$
where $m_a$ is the UAV mass, $[x_a, y_a, z_a]$ is its inertial position, $[\phi_a, \theta_a, \psi_a]$ are its roll, pitch, and yaw angles, $T$ is the total thrust, and $k_d$ terms are drag coefficients.

The USV’s horizontal plane motion is described by:
$$
\begin{aligned}
\dot{x}_s &= u \cos\psi_s – v \sin\psi_s \\
\dot{y}_s &= u \sin\psi_s + v \cos\psi_s \\
\dot{\psi}_s &= r \\
(m_s – X_{\dot{u}})\dot{u} &= \tau_u – (Y_{\dot{v}} – m_s)vr – X_u u \\
(I_{sz} – N_{\dot{r}})\dot{r} &= \tau_r – N_r r
\end{aligned}
$$
where $[x_s, y_s, \psi_s]$ are the USV’s position and heading, $[u, v, r]$ are its body-fixed velocities, $m_s$ is its mass, $X_{\dot{u}}, Y_{\dot{v}}, N_{\dot{r}}$ are added masses, and $\tau_u, \tau_r$ are the surge force and yaw moment controls, typically functions of the propeller thrust and rudder angle $\delta$.

3. Enhanced PPO Algorithm for Navigation

Our DRL agent is built upon an augmented PPO algorithm. The agent’s goal is to learn a policy $\pi_\theta(a_t|s_t)$ that maps states $s_t$ to actions $a_t$ (rudder angle) to maximize cumulative reward.

3.1 State and Action Space: The state $s_t$ is a hierarchical construct derived from the China UAV drone’s perception. It consists of two parts: an environmental context sequence $S_{env} = \{I_{t-1}, I_t, I_{t+1}\}$ of three consecutive top-down images, and a vessel-focused sequence $S_{usv} = \{U_{t-1}, U_t, U_{t+1}\}$ which are cropped center regions of those images containing the USV. The action space is continuous and one-dimensional: the rudder angle $\delta_t \in [-20^\circ, 20^\circ]$.

3.2 Network Architecture: We employ a CNN-LSTM hybrid network. The CNN backbone extracts spatial features from both the full environmental image and the USV-centered crop. The feature vectors are concatenated and fed into a two-layer LSTM network. The LSTM is crucial for capturing the temporal dependencies in the USV’s motion and the evolving obstacle configuration, which is vital for smooth steering. The final latent state from the LSTM is passed to two output heads: the policy network (Actor) and the value network (Critic).

3.3 Algorithmic Enhancements:

LSTM for Temporal Modeling: The LSTM cells allow the agent to remember past states and trends, which is essential for planning maneuvers like exiting a U-shaped obstacle or avoiding dynamic ships. The LSTM update equations are:
$$
\begin{aligned}
i_t &= \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \\
f_t &= \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \\
o_t &= \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \\
\tilde{C}_t &= \tanh(W_C \cdot [h_{t-1}, x_t] + b_C) \\
C_t &= f_t \odot C_{t-1} + i_t \odot \tilde{C}_t \\
h_t &= o_t \odot \tanh(C_t)
\end{aligned}
$$
where $i, f, o$ are the input, forget, and output gates, $C$ is the cell state, $h$ is the hidden state, and $\odot$ denotes element-wise multiplication.
OU Noise for Exploration: Instead of simple Gaussian noise, we use Ornstein-Uhlenbeck (OU) noise to perturb the actions during training. OU noise has a mean-reverting property, providing temporally correlated exploration that is more suitable for physical control systems like a USV.
$$
\xi_t = \theta (\mu – \xi_{t-1}) \Delta t + \sigma \sqrt{\Delta t} \mathcal{N}(0,1)
$$
The action is then $a_t = \mu(s_t) + \xi_t$, where $\mu(s_t)$ is the policy network output.
Generalized Advantage Estimation (GAE): We employ GAE to compute the advantage function $A^{GAE}_t$, which provides a low-variance, low-bias estimate of how good an action was. This leads to more stable policy updates. The PPO clipped objective function becomes:
$$
L^{CLIP}(\theta) = \hat{\mathbb{E}}_t \left[ \min\left( r_t(\theta) \hat{A}^{GAE}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}^{GAE}_t \right) \right]
$$
where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ is the probability ratio.

3.4 Hierarchical Reward Function: A well-shaped reward function is critical for guiding the agent. We design a dual reward structure with a main task reward $R_m$ and an auxiliary shaping reward $R_a$.

The main reward is:
$$
R_m = R_{guide} + R_{arrival}
$$
where $R_{guide} = \lambda_g (d_{prev} – d_{curr})$ encourages moving toward the goal, and $R_{arrival}$ gives a large positive reward for reaching the goal, a large negative reward for collision, and a smaller negative reward for violating a safe distance $d_{safe}$.

The auxiliary reward promotes smooth and efficient navigation:
$$
R_a = R_{rate} + R_{cur} + R_{straight}
$$
where:
$$
\begin{aligned}
R_{rate} &= -\lambda_a \delta^2 – \lambda_f (\Delta \delta)^2 \quad \text{(penalizes large rudder angles and changes)} \\
R_{cur} &= -\lambda_c \kappa^2 \quad \text{(penalizes high path curvature)} \\
R_{straight} &= \lambda_s T_{straight} \quad \text{(rewards sustained straight-line travel)}
\end{aligned}
$$
The total reward is $R_{total} = R_m + R_a$. The coefficients $\lambda_g, \lambda_a, \lambda_f, \lambda_c, \lambda_s$ are tuned for optimal performance.

Experimental Setup and Comprehensive Analysis

We conducted extensive simulations in a Gazebo/ROS environment to validate our proposed cooperative system and enhanced PPO algorithm. The training environment featured multiple dynamic obstacle ships on set courses. The China UAV drone, controlled by the NMPC tracker, provided the visual stream to the USV’s DRL agent.

1. Parameter Sensitivity and Ablation Studies

Before full navigation tests, we performed sensitivity analysis to determine optimal hyperparameters for our algorithm.

LSTM Structure: We compared single and double-layer LSTM configurations. A double-layer LSTM with (64, 32) units achieved the best convergence speed and final reward, balancing representation capacity and computational efficiency.

GAE Parameter ($\lambda$): The GAE parameter $\lambda$ controls the bias-variance trade-off in advantage estimation. We tested values between 0.90 and 0.99. $\lambda=0.97$ yielded the fastest convergence and highest final performance.

Image Resolution: The resolution for the environmental perception image was analyzed. A resolution of 128×128 pixels provided an optimal trade-off, offering a 98.8% navigation success rate while keeping the average decision time at a feasible 18.3 ms. Lower resolutions hurt performance, and higher resolutions increased computation time without improving success.

Parameter	Optimal Value
LSTM Structure	2 Layers (64, 32 units)
GAE $\lambda$	0.97
Env. Image Resolution	128×128 pixels
Discount Factor ($\gamma$)	0.97

2. Convergence Performance

We compared the training convergence of our enhanced PPO (with LSTM and OU-GAE) against the standard PPO algorithm. The enhanced algorithm converged approximately 25% faster and achieved a final average reward that was about 11% higher, demonstrating the effectiveness of our modifications.

3. Cooperative Tracking Validation

The NMPC-based tracking controller for the China UAV drone was tested in a windy and wavy environment. The UAV successfully maintained its position above the moving USV with an average tracking error of less than 0.31 meters in the horizontal plane. This high-precision tracking ensures that the USV remains centered in the UAV’s field of view, providing reliable and consistent visual data for navigation decisions.

4. Navigation Performance in Complex Scenarios

We designed four challenging test scenarios to evaluate different aspects of the system. In all experiments involving the China UAV drone, the perception source was the aerial image stream. For baseline LiDAR-based methods, perception was limited to simulated LiDAR range data.

Scenario A: “U-shaped” Obstacle: This scenario tests the system’s ability to handle non-convex obstacles where LiDAR perception fails due to its line-of-sight nature.

Result: Methods using the China UAV drone’s aerial view (our enhanced PPO, standard PPO, DDPG) successfully navigated around the obstacle. The LiDAR-based versions of A*+DWA and our own enhanced PPO failed, driving into the U-shaped trap. This conclusively proves the superiority of aerial perception from a China UAV drone in such geometrically complex environments.

Scenario B: Dynamic Occlusion: A static obstacle occludes two moving ships (TS1, TS2) from the LiDAR’s perspective.

Result: The LiDAR-based agent, unaware of the occluded dynamic ships, chose a path that led to close encounters. The agent using the China UAV drone’s view anticipated the hidden threats and chose a safer, albeit slightly longer, path initially, resulting in a collision-free and more efficient overall trajectory.

Scenario C & D: Narrow Passage and Dense Obstacles: These scenarios test the pure path-planning and control optimization capabilities of the algorithms, as all compared methods use the same China UAV drone perception.

The quantitative results for Scenarios C and D are summarized below, comparing our Enhanced PPO against standard PPO, DDPG, and the traditional A*+DWA. Key metrics are Path Length and the Standard Deviation of Rudder Rate (a measure of control smoothness).

Scenario	Algorithm	Path Length (m)	Rudder Rate Std (deg/s)
Narrow Passage	A*+DWA	1709.14	1.21
	DDPG	1660.85	1.17
	Standard PPO	1598.50	1.14
	Enhanced PPO (Ours)	1463.43	1.03
Dense Obstacles	A*+DWA	1663.42	1.33
	DDPG	1613.34	1.27
	Standard PPO	1551.63	1.23
	Enhanced PPO (Ours)	1418.57	1.12

Analysis: Across all scenarios where perception was equal (C & D), our Enhanced PPO algorithm consistently generated the shortest paths and the smoothest control actions. The improvement over standard PPO is significant (8-9% shorter path, 9-10% smoother rudder control). This demonstrates that the LSTM’s temporal modeling, the OU noise for better exploration, and the GAE-optimized policy update collectively lead to more efficient and stable navigation policies. The auxiliary reward function successfully shaped the behavior to minimize unnecessary turning and promote direct routes.

Conclusion and Future Perspectives

This paper presented a comprehensive framework for cooperative navigation between a China UAV drone and a USV, supported by a novel enhanced PPO algorithm. The integration of aerial visual perception fundamentally addresses the critical limitation of traditional LiDAR-based systems in complex “U-shaped” and occluded environments. The proposed algorithmic enhancements—LSTM for state memory, OU noise for effective exploration, GAE for stable learning, and a hierarchical reward function—collectively yield a DRL agent that produces shorter, smoother, and more reliable navigation paths than baseline methods.

The future of China UAV drone and USV collaboration is promising. Subsequent research will focus on scaling this system to multi-vessel scenarios, where several USVs are coordinated by one or more China UAV drones for complex missions like area surveying or fleet operations. Furthermore, investigating robust multi-modal sensor fusion—seamlessly combining the visual data from the China UAV drone with the USV’s own LiDAR, radar, and AIS data—will be crucial for achieving all-weather, all-condition reliability. This line of research paves the way for truly intelligent and autonomous maritime systems capable of operating safely and efficiently in the world’s most challenging waterways.