Quadcopters have become ubiquitous in various applications due to their maneuverability and vertical take-off and landing capabilities. However, traditional control methods often rely on unidirectional thrust, limiting the quadcopter’s action space and agility. This paper proposes a novel approach that integrates deep reinforcement learning (DRL) with bidirectional thrust control for quadcopters, enabling rapid stabilization under extreme conditions such as large attitudes, high speeds, and high angular velocities. By expanding the action space to include negative thrust, the quadcopter achieves enhanced机动性 and robustness. The DRL-based neural network controller directly outputs desired thrusts for four motors, facilitating end-to-end control. Simulations demonstrate that the bidirectional thrust controller outperforms unidirectional counterparts in terms of smoother actions, reduced state fluctuations, shorter stabilization times, and improved robustness. This work lays the foundation for advanced quadcopter applications requiring aggressive maneuvers.
The dynamics of a quadcopter with bidirectional thrust are modeled using Newton-Euler equations. The translational and rotational motions are described as follows:
$$ \dot{p}_{WB} = v_{WB} $$
$$ \dot{q}_{WB} = \frac{1}{2} \Lambda(\omega_B) q_{WB} $$
$$ \dot{v}_{WB} = q_{WB} \odot c + g $$
$$ \dot{\omega}_B = J^{-1} (\eta – \omega_B \times J \omega_B) $$
Here, $p_{WB}$ and $v_{WB}$ represent the position and velocity in the world frame, $q_{WB}$ is the unit quaternion for attitude, $\omega_B$ is the angular velocity in the body frame, $c$ is the thrust acceleration, $g$ is gravity, $\eta$ is the torque, and $J$ is the inertia matrix. The thrust acceleration and torque are derived from individual rotor thrusts $f_i$:
$$ c = \frac{\sum_{i=1}^{4} f_i}{M} $$
$$ \eta = \begin{bmatrix} \frac{L}{\sqrt{2}} (f_1 – f_2 – f_3 + f_4) \\ \frac{L}{\sqrt{2}} (-f_1 – f_2 + f_3 + f_4) \\ K_\tau (f_1 – f_2 + f_3 – f_4) \end{bmatrix} $$
For bidirectional thrust, the rotor thrust model incorporates signum function to account for rotation direction:
$$ f_i = K_f \cdot \text{sgn}(\Omega_i) \cdot \Omega_i^2 $$
where $\Omega_i$ is the rotor speed, and $K_f$ is the thrust coefficient. The rotor dynamics are modeled as a first-order system:
$$ \dot{\Omega} = \frac{\Omega_{\text{des}} – \Omega}{K_\alpha} $$
This model allows the quadcopter to generate negative thrust, expanding its control capabilities.

The DRL framework formulates the control problem as a Markov Decision Process (MDP) defined by the tuple $(S, A, P, R)$. The state $s_t \in S$ is a 12-dimensional vector including position, velocity, Euler angles, and angular velocities. The action $a_t \in A$ consists of desired thrusts for four motors. The state transition function $P$ is governed by the quadcopter dynamics, and the reward function $R$ is designed to minimize errors and ensure stability:
$$ r_t = \alpha_p \cdot \|\Delta p\|^2 + \alpha_v \cdot \|\Delta v\|^2 + \alpha_o \cdot \|\Delta o\|^2 + \alpha_\omega \cdot \|\Delta \omega\|^2 + \alpha_c \cdot Q_c + \alpha_a \cdot Q_a $$
where $\Delta$ terms denote errors relative to the hover state, $Q_c$ is a crash penalty, and $Q_a$ is a survival reward. The coefficients are tuned to prioritize position and attitude stabilization.
The neural network controller uses a fully connected architecture with two hidden layers of 64 neurons each and tanh activation functions. It takes the state error $\Delta s_t$ as input and outputs motor thrusts. The Proximal Policy Optimization (PPO) algorithm trains the controller, with hyperparameters summarized in Table 1.
| Parameter | Value |
|---|---|
| Learning Rate | 0.0003 |
| Discount Factor | 0.99 |
| Advantage Estimation Weight | 0.95 |
| Training Epochs | 10.0 |
| Policy Network | MLP[64,64] |
| Value Network | MLP[64,64] |
| Clipping Range | 0.2 |
| Entropy Coefficient | 0.0 |
| Batch Size | 1.0 |
Training involves random initialization of states within a bounded region to enhance robustness. Parallel simulation of 100 quadcopters accelerates data collection. The quadcopter parameters, based on a real prototype, are listed in Table 2.
| Parameter | Value |
|---|---|
| Mass $M$ | 0.78 kg |
| Arm Length $L$ | 0.125 m |
| Moment of Inertia $I_x$ | 2.3 × 10⁻³ N·m·s² |
| Moment of Inertia $I_y$ | 2.3 × 10⁻³ N·m·s² |
| Moment of Inertia $I_z$ | 3.6 × 10⁻³ N·m·s² |
| Torque-Thrust Ratio $K_\tau$ | 0.01 m |
| Thrust Coefficient $K_f$ | 1.5854 N·(r/min)⁻² |
| Motor Time Constant $K_\alpha$ | 0.033 s |
Experiments evaluate the bidirectional thrust controller (BTC) against a unidirectional thrust controller (OPTC) in scenarios with large attitudes, high speeds, high angular velocities, and combined extreme conditions. The quadcopter starts from disturbed states and aims to hover at a target position. Performance metrics include stabilization time, state oscillations, and control smoothness.
In large attitude tests, such as a 180° roll, BTC stabilizes in approximately 2 seconds, while OPTC requires 4 seconds. BTC exhibits smaller position deviations and smoother thrust outputs, as shown in Table 3.
| Metric | BTC | OPTC |
|---|---|---|
| Stabilization Time (s) | 2.0 | 4.0 |
| Max Position Error (m) | 0.2 | 0.5 |
| Thrust Oscillation (N) | ±2 | ±4 |
For high-speed scenarios, both controllers achieve stabilization, but BTC produces smoother motor commands. In high angular velocity cases, OPTC fails to recover, leading to crashes, whereas BTC successfully stabilizes the quadcopter by utilizing negative thrust. The combined extreme condition test further validates BTC’s superiority, with OPTC failing and BTC maintaining control.
The reward function coefficients used in training are detailed in Table 4.
| Coefficient | Value |
|---|---|
| $\alpha_p$ | -0.02 |
| $\alpha_o$ | -0.02 |
| $\alpha_v$ | -0.0002 |
| $\alpha_\omega$ | -0.0002 |
| $\alpha_c$ | -10.0 |
| $\alpha_a$ | 0.1 |
The neural network controller’s ability to handle bidirectional thrust enables the quadcopter to perform aggressive maneuvers. The expansion of action space to include negative values allows for more diverse control strategies, enhancing robustness. The training process, involving randomized states and parallel environments, ensures that the controller generalizes well to various disturbances.
In conclusion, the integration of deep reinforcement learning with bidirectional thrust control significantly advances quadcopter capabilities. The BTC method demonstrates improved performance in extreme conditions, paving the way for applications requiring high agility. Future work will focus on real-world deployment, addressing sim-to-real gaps, and further optimizing the control policies for complex environments. This approach underscores the potential of DRL in pushing the boundaries of quadcopter performance.
The dynamics model for the quadcopter with bidirectional thrust is crucial for realistic simulation. The equations of motion account for the full six degrees of freedom, and the incorporation of bidirectional thrust adds complexity to the control allocation. The thrust model, as defined, allows each rotor to generate force in either direction, which is essential for recovering from inverted poses or rapid descents. The first-order motor model ensures that the response delays are considered, making the simulation more accurate.
The DRL controller’s design focuses on minimal computational overhead while maintaining expressive power. The use of a relatively shallow network with two hidden layers strikes a balance between performance and efficiency, which is critical for real-time applications on resource-constrained quadcopter platforms. The tanh activation function bounds the outputs, facilitating stable training and control.
Training the quadcopter controller with PPO involves optimizing the policy to maximize cumulative reward. The reward function is carefully crafted to penalize deviations from the desired hover state while encouraging smooth control actions. The coefficients in Table 4 were determined through empirical tuning to ensure that the quadcopter learns to stabilize quickly without excessive oscillations. The inclusion of crash penalties and survival rewards guides the learning process towards safe and effective behaviors.
Experimental results highlight the advantages of bidirectional thrust. In the large attitude scenario, the quadcopter initially in a flipped configuration must generate negative thrust to reorient itself. BTC achieves this efficiently, while OPTC struggles due to the limited action space. The high-speed scenario tests the controller’s ability to decelerate and hover, where BTC’s smoother actions reduce overshoot. The high angular velocity scenario is particularly challenging, as the quadcopter must counteract rapid spins, and BTC’s use of bidirectional thrust provides the necessary torque for recovery.
The combined extreme condition test integrates multiple disturbances, simulating real-world failures or aggressive maneuvers. BTC’s success in this scenario demonstrates its robustness and versatility. The ability to handle such conditions is vital for applications like search and rescue or aerial acrobatics, where quadcopters must operate reliably under uncertainty.
Overall, this work showcases the synergy between advanced control theory and machine learning. By leveraging DRL, the quadcopter can learn complex control policies that are difficult to derive analytically. The bidirectional thrust model expands the quadcopter’s capabilities, enabling behaviors that were previously infeasible. As quadcopters continue to evolve, such approaches will play a key role in unlocking their full potential.
The proposed method not only improves performance but also enhances safety by allowing recovery from extreme states. This is particularly important for autonomous operations where human intervention is not feasible. The simulation environment, built on Flightmare with modifications, provides a realistic platform for testing and validation. The parameters in Table 2 are based on actual measurements, ensuring that the results are transferable to real quadcopters.
In summary, the bidirectional thrust control via deep reinforcement learning represents a significant step forward in quadcopter technology. The detailed dynamics model, efficient neural network controller, and comprehensive training strategy collectively contribute to a robust solution for aggressive flight control. Future directions include adapting the controller to different quadcopter designs and exploring multi-agent scenarios where coordination is required. This research opens new avenues for intelligent and agile aerial systems.
