A Q-Learning-Based Adaptive Stochastic Fuzzy Model Predictive Control for Quadrotor Drone Technology

In this work, we present a comprehensive study on advanced control strategies for quadrotor drone technology, which has gained significant attention in both military and civilian applications such as aerial photography, cargo transport, and emergency rescue. The inherent strong nonlinearity, underactuation, and susceptibility to external disturbances make attitude control a challenging problem in drone technology. To address these issues, we propose a novel control framework named Disturbance-Observer-based Adaptive Stochastic Fuzzy Model Predictive Control (DO-AFSMPC), which integrates Q-learning, Unscented Kalman Filter (UKF), T-S fuzzy modeling, and a disturbance observer. Our approach aims to enhance robustness, tracking accuracy, and disturbance rejection in complex environments.

We begin by establishing a nonlinear dynamic model of the quadrotor, capturing the attitude dynamics with roll, pitch, and yaw angles. The state-space representation is given by:

\[
\dot{\mathbf{x}} = f(\mathbf{x}, \mathbf{u}) + \mathbf{d}, \quad \mathbf{y} = C\mathbf{x}
\]

where \(\mathbf{x} = [\phi, \theta, \psi, p, q, r]^T\) is the state vector, \(\mathbf{u}\) is the control input vector, and \(\mathbf{d}\) represents external disturbances and model uncertainties. To handle the strong nonlinearity, we employ a Takagi-Sugeno (T-S) fuzzy modeling approach, which decomposes the system into a set of local linear models. The fuzzy rules are defined as:

\[
\text{Rule } i: \text{IF } z_1 \text{ is } \Gamma_i, \text{ THEN }
\mathbf{x}(k+1) = \mathbf{A}_i \mathbf{x}(k) + \mathbf{B}_i \mathbf{u}(k) + \mathbf{R}_i
\]

The overall fuzzy model is obtained through weighted interpolation:

\[
\mathbf{x}(k+1) = \sum_{i=1}^{N} \mu_i(z) \big( \mathbf{A}_i \mathbf{x}(k) + \mathbf{B}_i \mathbf{u}(k) + \mathbf{R}_i \big), \quad \mathbf{y}(k) = C\mathbf{x}(k)
\]

where \(\mu_i(z)\) are Gaussian membership functions covering the operating space. We select nine operating points as shown in Table 1 to cover the typical attitude range.

Table 1: Operating Point Parameters for T-S Fuzzy Model
Index	Roll \(\phi\) (deg)	Pitch \(\theta\) (deg)
1	\(-\pi/4\)	\(-\pi/4\)
2	\(-\pi/4\)	0
3	\(-\pi/4\)	\(\pi/4\)
4	0	\(-\pi/4\)
5	0	0
6	0	\(\pi/4\)
7	\(\pi/4\)	\(-\pi/4\)
8	\(\pi/4\)	0
9	\(\pi/4\)	\(\pi/4\)

Based on the T-S fuzzy model, we design a Stochastic Model Predictive Controller (SMPC) that minimizes a cost function over a prediction horizon \(N_p\) and control horizon \(N_c\). The predicted output is:

\[
\hat{Y}(k) = P x(k) + H U
\]

with \(P\) and \(H\) matrices constructed from the system matrices. The cost function is:

\[
J = Y^T Q Y + U^T W U
\]

where \(Q\) and \(W\) are weighting matrices. Input constraints are imposed to reflect actuator limits:

\[
\mathbf{u}_{\min} \le \mathbf{u} \le \mathbf{u}_{\max}, \quad \Delta \mathbf{u}_{\min} \le \Delta \mathbf{u} \le \Delta \mathbf{u}_{\max}
\]

To handle unknown disturbances, we incorporate a disturbance observer (DO) that estimates the lumped disturbance \(\hat{\mathbf{d}}\). The observer dynamics are:

\[
\begin{aligned}
\mathbf{z}(k+1) &= \mathbf{z}(k) + \mathbf{F}(\mathbf{A}_z \mathbf{x}(k) + \mathbf{B}_z \mathbf{u}(k) + \mathbf{R}_z + \mathbf{K}_z \hat{\mathbf{d}}) \\
\hat{\mathbf{d}}(k) &= \mathbf{F} \mathbf{x}(k) – \mathbf{z}(k)
\end{aligned}
\]

The estimated disturbance is then compensated in the control law:

\[
\mathbf{u}(k) = \mathbf{u}_c(k) + \mathbf{F}_d \hat{\mathbf{d}}(k)
\]

We further improve robustness by introducing an adaptive prediction horizon strategy that adjusts \(N_p\) based on the tracking error \(e(k)\). The error is defined as:

\[
e(k) = \sqrt{ (y_1 – r_1)^2 + (y_2 – r_2)^2 + (y_3 – r_3)^2 }
\]

When the error exceeds a threshold \(\epsilon\), we reduce \(N_p\) to enhance responsiveness; otherwise, we increase it to improve smoothness. The adjustment rule is:

\[
N_p(k+1) = \begin{cases}
\max(N_p(k) – \alpha, 1), & e(k) > \epsilon \\
N_p(k) + \alpha, & e(k) \le \epsilon
\end{cases}
\]

To cope with process and measurement noise, we employ an Adaptive Unscented Kalman Filter (AUKF). The sigma points are generated as:

\[
\begin{aligned}
\mathcal{X}^{(0)}_{k-1} &= \hat{x}_{k-1} \\
\mathcal{X}^{(i)}_{k-1} &= \hat{x}_{k-1} + \sqrt{(n+\lambda)P_{k-1}}, \quad i=1,\ldots,n \\
\mathcal{X}^{(i+n)}_{k-1} &= \hat{x}_{k-1} – \sqrt{(n+\lambda)P_{k-1}}, \quad i=1,\ldots,n
\end{aligned}
\]

The predicted state and covariance are computed, and the measurement update yields the Kalman gain and posterior state. To adapt to time-varying noise, we update the measurement covariance matrix adaptively:

\[
P_{zz} = \sum_{i=0}^{2n} W_c^{(i)} (\mathcal{Z}^{(i)} – \hat{z})(\mathcal{Z}^{(i)} – \hat{z})^T + R
\]

\[
P_{xz} = \sum_{i=0}^{2n} W_c^{(i)} (\mathcal{X}^{(i)} – \hat{x}^-)(\mathcal{Z}^{(i)} – \hat{z})^T
\]

\[
K = P_{xz} P_{zz}^{-1}, \quad \hat{x} = \hat{x}^- + K(z – \hat{z}), \quad P = P^- – K P_{zz} K^T
\]

We introduce Q-learning to optimize the control weighting matrix \(W\) online. The Markov Decision Process (MDP) is defined with state \(s = [e(k), \delta(k)]\), action set \(F = \{f_1, f_2, f_3\}\), and reward function \(r(k) = w_1 r_t + w_2 r_s + w_3 r_i + r_v + r_p\). The Q-value update follows:

\[
Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a’} Q(s’,a’) – Q(s,a) \right]
\]

The actions adjust \(W\) as:

\[
W(k+1) = \begin{cases}
\max(W(k) – \beta, W_{\min}), & a = f_1 \\
W(k), & a = f_2 \\
\min(W(k) + \beta, W_{\max}), & a = f_3
\end{cases}
\]

The stability of the closed-loop system is proven via Lyapunov theory. We define the Lyapunov function:

\[
V(k) = e^T(k) P e(k) + J^*(k)
\]

Under the effect of DO and AUKF, the error dynamics satisfy:

\[
e^T(k+1) P e(k+1) – e^T(k) P e(k) \le – e^T(k) Q_e e(k)
\]

and the optimal cost \(J^*\) decreases monotonically. Hence \(\Delta V(k) < 0\), confirming asymptotic stability.

We evaluate the proposed DO-AFSMPC through simulation experiments using MATLAB. The quadrotor parameters are listed in Table 2.

Table 2: Quadrotor Model Parameters
Parameter	Value
Mass \(m\)	1.545 kg
Arm length \(l\)	0.255 m
Thrust coefficient \(k_T\)	\(5.84 \times 10^{-6}\) N/(rad/s)²
Drag coefficient \(k_D\)	\(1.168 \times 10^{-7}\) N/(rad/s)²
Roll inertia \(J_{xx}\)	0.029 kg·m²
Pitch inertia \(J_{yy}\)	0.029 kg·m²
Yaw inertia \(J_{zz}\)	0.055 kg·m²
Min motor speed \(\omega_{\min}\)	1805 rad/s
Max motor speed \(\omega_{\max}\)	11100 rad/s
Control horizon \(N_c\)	3
Prediction horizon \(N_p\) (initial)	6
Sampling time \(t_s\)	0.2 s

The Q-learning parameters are set as shown in Table 3.

Table 3: Q-Learning Parameters
Parameter	Value
Learning rate \(\alpha\)	0.3
Discount factor \(\gamma\)	0.6
Initial exploration rate	0.35
Minimum exploration rate	0.02
Exploration decay	0.997

The input constraints are:

\[
\mathbf{u}_{\min} = [-1.182, -1.182, -0.131]^T, \quad \mathbf{u}_{\max} = [1.182, 1.182, 0.131]^T
\]
\[
\Delta\mathbf{u}_{\min} = [-0.709, -0.709, -0.709]^T, \quad \Delta\mathbf{u}_{\max} = [0.709, 0.709, 0.709]^T
\]

The reference trajectory is defined as:

\[
\begin{aligned}
\phi_r(k) &= 0.8 \cos(0.2k – 0.1) \\
\theta_r(k) &= 0.6 \sin(0.05\pi k) \\
\psi_r(k) &= 0.5 \cos(0.2k – 0.1)
\end{aligned}
\]

We first compare the tracking performance under different fixed prediction horizons. Simulation results show that a shorter horizon yields faster convergence but larger steady-state error, while a longer horizon improves smoothness at the cost of slower response. Our adaptive scheme dynamically balances these aspects.

Under external sinusoidal disturbances (amplitude 0.1, frequency 0.0318 Hz) added to all attitude channels, the proposed DO-AFSMPC exhibits superior disturbance rejection. The root mean square error (RMSE) values for roll, pitch, and yaw are summarized in Table 4.

Table 4: RMSE Comparison under External Disturbances
Controller	\(\phi\) RMSE	\(\theta\) RMSE	\(\psi\) RMSE
SMPC	0.0412	0.0518	0.0453
AFSMPC	0.0307	0.0377	0.0326
DO-AFSMPC	0.0269	0.0291	0.0301

In the model mismatch scenario (inertia parameters varied by ±20% randomly), DO-AFSMPC maintains the smallest tracking error and suppresses overshoot effectively. The control inputs are smoother compared to other methods, demonstrating enhanced robustness.

Ablation experiments confirm the necessity of each component. Removing Q-learning degrades tracking accuracy. Removing different reward components (tracking reward, smoothness reward, volatility reward) leads to higher error or larger oscillations, validating the designed reward structure.

Hyperparameter sensitivity analysis shows that optimal performance is achieved when the learning rate \(\alpha \in [0.2, 0.4]\) and discount factor \(\gamma \in [0.55, 0.6]\).

In conclusion, the proposed DO-AFSMPC framework significantly improves the attitude control performance of quadrotor drone technology under complex uncertainties. The integration of T-S fuzzy modeling, UKF, disturbance observer, adaptive horizon, and Q-learning provides a comprehensive solution for robust and precise control. Future work will focus on hardware-in-the-loop validation to bridge the gap between simulation and real-world deployment.