Navigation Algorithms for Quadrotor Drones Based on Stereo Visual-Inertial SLAM

The autonomous navigation of quadrotor drones in GPS-denied or dynamically complex environments remains a significant challenge. For a quadrotor drone to truly operate independently, it must be able to understand its position within an unknown space and simultaneously build a map of its surroundings. While vision-based Simultaneous Localization and Mapping (SLAM) systems have made considerable progress, they exhibit critical weaknesses, such as low accuracy and susceptibility to tracking failure during rapid motion or in environments lacking distinct visual features. To address these limitations for robust quadrotor drone operation, this work presents a tightly-coupled Stereo Visual-Inertial SLAM (VI-SLAM) algorithm founded on nonlinear optimization.

The core principle is to leverage the complementary strengths of cameras and an Inertial Measurement Unit (IMU). A camera provides rich, high-fidelity pose information but is susceptible to motion blur and featureless scenes. An IMU, consisting of accelerometers and gyroscopes, offers high-frequency measurements of acceleration and angular velocity, providing excellent short-term motion estimates but suffering from significant drift over time due to sensor bias and noise. By fusing these two data streams in a tightly-coupled manner—where a single state estimator processes both raw or pre-processed sensor measurements—we achieve a navigation system for a quadrotor drone that is significantly more accurate, robust, and reliable than vision-only solutions.

The proposed algorithm is built upon the foundation of a state-of-the-art visual SLAM system but integrates IMU data at a fundamental level. The process begins with a careful initialization phase to estimate initial IMU biases. Subsequently, IMU measurements between camera frames are pre-integrated to provide relative motion constraints. These constraints are then fused with visual observations within a nonlinear optimization framework, specifically a local bundle adjustment with marginalization. Finally, loop closure detection and pose-graph optimization are employed to eliminate accumulated drift and ensure global consistency of the map built by the quadrotor drone. The performance of this system is rigorously validated using standard datasets and through real-world experiments on a custom-built quadrotor drone platform.

System Overview and Mathematical Foundation

The sensor suite for our quadrotor drone consists of a stereo camera and an IMU. Defining clear coordinate frames is essential. We denote the world frame as $W$, the body frame (aligned with the IMU) as $B$, and the camera frame as $C$. The transformation from the body frame to the world frame is represented by $T_{WB} = (R_{WB}, ^Wp_B) \in SE(3)$, where $R_{WB} \in SO(3)$ is the rotation matrix and $^Wp_B \in \mathbb{R}^3$ is the translation vector. The extrinsic calibration between the IMU and the camera, $T_{CB} = (R_{CB}, ^Cp_B)$, is assumed to be known and fixed.

The state of the quadrotor drone at time $k$ is defined as:
$$x_k = [R_{WB_k}, ^Wp_{B_k}, ^Wv_{B_k}, b_{g_k}, b_{a_k}]$$
where $^Wv_{B_k}$ is the velocity of the body frame in the world, and $b_g$, $b_a$ are the biases of the gyroscope and accelerometer, respectively. These biases are modeled as random walks: $\dot{b}_g(t) = \eta_{bg}, \dot{b}_a(t) = \eta_{ba}$, with $\eta$ representing Gaussian white noise.

The raw IMU measurements for angular velocity $\tilde{\omega}$ and acceleration $\tilde{a}$ are modeled as:
$$
\begin{aligned}
\tilde{\omega}(t) &= \omega(t) + b_g(t) + \eta_g(t) \\
\tilde{a}(t) &= a(t) + b_a(t) + \eta_a(t)
\end{aligned}
$$
where $\omega(t)$ and $a(t)$ are the true kinematic quantities.

Symbol	Description
$W$, $B$, $C$	World, Body (IMU), and Camera frames
$T_{WB}$, $R_{WB}$, $^Wp_B$	Pose of Body in World (transform, rotation, translation)
$^Wv_{B}$	Velocity of Body in World
$b_g$, $b_a$	Gyroscope and Accelerometer biases
$\eta_g$, $\eta_a$, $\eta_{bg}$, $\eta_{ba}$	Gyro, Accel, and Bias random walk noises
$\Delta \tilde{R}_{ij}$, $\Delta \tilde{v}_{ij}$, $\Delta \tilde{p}_{ij}$	Pre-integrated rotation, velocity, and position increments between keyframes $i$ and $j$

IMU Pre-Integration for Efficient State Propagation

The IMU samples at a frequency (e.g., 200-500 Hz) much higher than the camera frame rate (e.g., 20-30 Hz). Integrating all IMU measurements between two camera frames every time the state is updated is computationally inefficient. Pre-integration solves this by calculating the relative motion between two time instances on the IMU body frame, independent of the initial state.

Given IMU measurements between times $k=i$ and $k=j$, the discrete-time motion equations for the quadrotor drone body are:
$$
\begin{aligned}
R_{WB_{k+1}} &= R_{WB_k} \cdot \text{Exp}\left((\tilde{\omega}_k – b_{g_k} – \eta_{g_k})\Delta t\right) \\
^Wv_{B_{k+1}} &= ^Wv_{B_k} + g_W \Delta t + R_{WB_k} (\tilde{a}_k – b_{a_k} – \eta_{a_k}) \Delta t \\
^Wp_{B_{k+1}} &= ^Wp_{B_k} + ^Wv_{B_k} \Delta t + \frac{1}{2}g_W \Delta t^2 + \frac{1}{2} R_{WB_k} (\tilde{a}_k – b_{a_k} – \eta_{a_k}) \Delta t^2
\end{aligned}
$$
where $g_W$ is the gravity vector in the world frame, and $\text{Exp}(\cdot)$ maps a vector in $\mathfrak{so}(3)$ to $SO(3)$.

To separate the incremental motion from the initial state $x_i$, we define the pre-integration terms $\Delta R_{ij}$, $\Delta v_{ij}$, and $\Delta p_{ij}$ in the body frame at time $i$:
$$
\begin{aligned}
\Delta R_{ij} &\triangleq R_{WB_i}^T R_{WB_j} = \prod_{k=i}^{j-1} \text{Exp}\left((\tilde{\omega}_k – b_{g_k} – \eta_{g_k})\Delta t\right) \\
\Delta v_{ij} &\triangleq R_{WB_i}^T (^Wv_{B_j} – ^Wv_{B_i} – g_W \Delta t_{ij}) = \sum_{k=i}^{j-1} \Delta R_{ik} (\tilde{a}_k – b_{a_k} – \eta_{a_k}) \Delta t \\
\Delta p_{ij} &\triangleq R_{WB_i}^T \left(^Wp_{B_j} – ^Wp_{B_i} – ^Wv_{B_i} \Delta t_{ij} – \frac{1}{2} g_W \Delta t_{ij}^2 \right) \\
&= \sum_{k=i}^{j-1} \left[ \Delta v_{ik} \Delta t + \frac{1}{2} \Delta R_{ik} (\tilde{a}_k – b_{a_k} – \eta_{a_k}) \Delta t^2 \right]
\end{aligned}
$$
These quantities depend only on IMU measurements and biases between times $i$ and $j$, not on the states $R_{WB_i}$, $^Wv_{B_i}$, $^Wp_{B_i}$, or gravity. When biases are updated, instead of re-integrating from scratch, a first-order approximation is used to efficiently update the pre-integrated terms and their covariance matrices $\Sigma_{ij}$. The constraint between two keyframe states $x_i$ and $x_j$ for the quadrotor drone becomes:
$$
\begin{aligned}
R_{WB_j} &= R_{WB_i} \Delta R_{ij} \text{Exp}\left(\frac{\partial \Delta \bar{R}_{ij}}{\partial b_g} \delta b_g \right) \\
^Wv_{B_j} &= ^Wv_{B_i} + g_W \Delta t_{ij} + R_{WB_i} \left( \Delta v_{ij} + \frac{\partial \Delta \bar{v}_{ij}}{\partial b_g} \delta b_g + \frac{\partial \Delta \bar{v}_{ij}}{\partial b_a} \delta b_a \right) \\
^Wp_{B_j} &= ^Wp_{B_i} + ^Wv_{B_i} \Delta t_{ij} + \frac{1}{2} g_W \Delta t_{ij}^2 + R_{WB_i} \left( \Delta p_{ij} + \frac{\partial \Delta \bar{p}_{ij}}{\partial b_g} \delta b_g + \frac{\partial \Delta \bar{p}_{ij}}{\partial b_a} \delta b_a \right)
\end{aligned}
$$

Robust Initialization of Visual-Inertial States

A critical challenge for tight coupling is obtaining an accurate initial estimate of the quadrotor drone’s state, particularly scale, velocity, gravity direction, and IMU biases. A poor initial guess can lead to divergence. Our initialization procedure uses a short period of vision-only SLAM to bootstrap the inertial parameters.

Gyroscope Bias Calibration: We collect several keyframes from the visual odometry. Assuming constant gyroscope bias during this short window, we estimate it by minimizing the difference between the relative rotation from pre-integration and the relative rotation from vision:
$$ \arg \min_{\delta b_g} \sum_{i=1}^{N-1} \left\| \text{Log}\left( \Delta R_{i,i+1} \text{Exp}\left( \frac{\partial \Delta \bar{R}_{i,i+1}}{\partial b_g} \delta b_g \right)^T (R_{WB_i}^T R_{WB_{i+1}}) \right) \right\|^2 $$
where $R_{WB_{(\cdot)}} = R_{WC_{(\cdot)}} R_{CB}$ and $R_{WC}$ is from the visual odometry.
Scale, Gravity, and Accelerometer Bias Initialization: With known gyro bias, we re-preintegrate. For a stereo quadrotor drone, scale is observable and fixed. We first ignore accelerometer bias and solve for gravity using a linear equation system derived from the pre-integrated motion constraints and visual poses over a series of keyframes:
$$ \begin{bmatrix} \lambda(k) & \beta(k) \end{bmatrix} \begin{bmatrix} s \\ g_W \end{bmatrix} = \gamma(k) $$
After obtaining an initial gravity estimate $\hat{g}_W$, we refine it and solve for the accelerometer bias $b_a$ by modeling gravity as a vector subject to small perturbations: $g_W \approx R_{WI} \, ^I g_G – R_{WI} (^I g_G)^{\wedge} \delta \theta$, where $^Ig_G = [0, 0, -9.81]^T$.
Velocity Initialization: With estimates for $b_g$, $b_a$, $g_W$, and visual poses, the velocities for each keyframe can be solved linearly using the pre-integration constraint involving $\Delta v_{ij}$.

This initialization provides a consistent starting point for the tightly-coupled nonlinear optimization, enabling the quadrotor drone to begin its accurate state estimation.

Tightly-Coupled Nonlinear Optimization for State Estimation

The heart of the navigation system for the quadrotor drone is a window-based nonlinear optimizer that jointly minimizes visual reprojection errors and IMU pre-integration errors. For a window of $N$ recent keyframes, the full state vector $\mathcal{X}$ is:
$$ \mathcal{X} = [x_1, x_2, …, x_N, \lambda_1, \lambda_2, …] $$
where $x_k$ is the IMU state defined earlier and $\lambda_l$ are inverse depths or 3D positions of landmarks observed within the window.

The optimization problem is:
$$ \arg \min_{\mathcal{X}} \left( \sum_{\text{k, l}} \rho \left( \| r_{\mathcal{C}}(z^{\text{k}}_{\text{l}}, \mathcal{X}) \|^2_{\Sigma_{\mathcal{C}}} \right) + \sum_{i=1}^{N-1} \rho \left( \| r_{\mathcal{I}}(\Delta \tilde{z}_{i,i+1}, \mathcal{X}) \|^2_{\Sigma_{\mathcal{I}}} \right) \right) $$
where $\rho$ is a robust kernel function (e.g., Huber) to handle outliers.

Visual Reprojection Error ($r_{\mathcal{C}}$): For a landmark $\lambda_l$ observed in keyframe $k$, the error is the difference between its measured pixel location $z^k_l$ and its predicted projection $\pi(\cdot)$. For a stereo quadrotor drone:
$$ r_{\mathcal{C}} = \begin{bmatrix} u_L \\ v_L \\ u_R \end{bmatrix} – \pi\left( T_{CB} T_{WB_k}^{-1} \cdot ^Wp_l \right) $$
where $\pi(\cdot)$ is the stereo projection model incorporating focal lengths $(f_x, f_y)$, principal points $(c_x, c_y)$, and baseline $b$.

IMU Pre-integration Error ($r_{\mathcal{I}}$): This term constraints consecutive IMU states $x_i$ and $x_{i+1}$:
$$ r_{\mathcal{I}} = \begin{bmatrix} r_{\Delta R_{ij}} \\ r_{\Delta v_{ij}} \\ r_{\Delta p_{ij}} \\ r_{\Delta b} \end{bmatrix} $$
with individual components defined as:
$$
\begin{aligned}
r_{\Delta R_{ij}} &= \text{Log}\left( \left( \Delta \bar{R}_{ij} \text{Exp}(J^{\Delta R}_{b_g} \delta b_{g_i}) \right)^T \cdot R_{WB_i}^T R_{WB_j} \right) \\
r_{\Delta v_{ij}} &= R_{WB_i}^T (^Wv_{B_j} – ^Wv_{B_i} – g_W \Delta t_{ij}) – (\Delta \bar{v}_{ij} + J^{\Delta v}_{b_g} \delta b_{g_i} + J^{\Delta v}_{b_a} \delta b_{a_i}) \\
r_{\Delta p_{ij}} &= R_{WB_i}^T \left( ^Wp_{B_j} – ^Wp_{B_i} – ^Wv_{B_i} \Delta t_{ij} – \frac{1}{2}g_W \Delta t_{ij}^2 \right) – (\Delta \bar{p}_{ij} + J^{\Delta p}_{b_g} \delta b_{g_i} + J^{\Delta p}_{b_a} \delta b_{a_i}) \\
r_{\Delta b} &= [b_{g_j} – b_{g_i}, \; b_{a_j} – b_{a_i}]^T
\end{aligned}
$$
The Jacobians $J$ account for the effect of bias updates on the pre-integrated terms. This tight coupling allows the visual measurements to correct IMU drift and the IMU data to regularize the visual solution, especially during aggressive maneuvers of the quadrotor drone or temporary visual degradation.

System Architecture for a Quadrotor Drone Platform

To validate the algorithm in practice, a quadrotor drone platform was developed. The hardware and software architecture is designed for real-time performance.

Component	Specification	Role
Flight Controller	Pixhawk (STM32F427 Cortex M4 Core)	Low-level motor control, IMU reading, communication with companion computer.
Companion Computer	NVIDIA Jetson TX2 Module	Runs the VI-SLAM algorithm, processes camera images, and sends high-level velocity/pose commands to the flight controller.
Vision Sensor	ZED Stereo Camera	Provides synchronized stereo images at VGA/HD resolution and wide field of view for robust feature tracking.
Software Framework	Robot Operating System (ROS)	Middleware for inter-process communication. Manages sensor drivers (ZED, IMU), runs the SLAM node, and publishes navigation commands.

The system workflow is as follows: The ZED camera driver publishes left and right images as ROS topics. The VI-SLAM algorithm subscribes to these topics and the IMU data (typically read by the flight controller and forwarded). The algorithm runs the tracking, local mapping, and loop closing threads. The estimated quadrotor drone pose and velocity are then published as a ROS topic, which is consumed by a controller node that generates actuator commands for the flight controller. This modular architecture allows for robust testing and development.

Experimental Validation and Results

The performance of the proposed navigation algorithm was evaluated in two phases: first using a standard benchmark dataset for quantitative analysis, and second through real-world flights on the quadrotor drone platform.

1. Evaluation on the EuRoC MAV Dataset

The EuRoC dataset provides visual-inertial data from a micro aerial vehicle flying in indoor environments, along with ground truth from a motion capture system. It contains sequences of varying difficulty, from slow and smooth to fast and aggressive motion with illumination changes. We compared the proposed VI-SLAM against the baseline stereo ORB-SLAM2 (visual-only). The key metric is the Absolute Trajectory Error (ATE) Root Mean Square Error (RMSE).

EuRoC Sequence	Description	ORB-SLAM2 RMSE (m)	Proposed VI-SLAM RMSE (m)	Accuracy Improvement
V1_01_easy	Slow, simple texture	0.035	0.018	~ 2.0x
V1_02_medium	Medium speed, good texture	0.042	0.021	~ 2.0x
V1_03_difficult	Fast motion, close to walls	0.085 (or fail)	0.039	> 2.0x
V2_01_easy	Slow, large scene	0.038	0.020	~ 1.9x
V2_02_medium	Medium speed, some blur	0.051	0.026	~ 2.0x

The results consistently show that the proposed VI-SLAM algorithm reduces the trajectory error by approximately half, effectively doubling the navigation accuracy of the quadrotor drone. More importantly, in challenging sequences like V1_03 where pure vision might fail due to motion blur, the inertial integration maintains tracking robustness.

The initialization procedure was also tested. Plots of the estimated gyroscope and accelerometer biases over time for sequences V1_01 and V1_02 show convergence to stable values within the first 10-15 seconds, providing a reliable starting point for the nonlinear optimizer.

2. Real-World Flight Test on Quadrotor Drone Platform

The algorithm was deployed on the custom quadrotor drone. The drone was tasked with autonomous navigation in an indoor laboratory space approximately 8m x 8m. Ground truth was provided by an OptiTrack motion capture system. The quadrotor drone followed a predefined 3D trajectory involving take-off, horizontal figure-eight patterns, altitude changes, and landing.

The real-time performance was measured: the state estimation thread (tracking + local optimization) processed each stereo frame in approximately 50ms on the Jetson TX2, which aligns with a 20 Hz camera rate, satisfying real-time requirements for autonomous flight of a quadrotor drone.

The estimated trajectory from the VI-SLAM system was compared to the OptiTrack ground truth. The plots for the X, Y, and Z positions over a 120-second flight show very close agreement. The VI-SLAM estimate (solid line) faithfully follows the actual quadrotor drone motion (dashed line) throughout the flight envelope, including during accelerations and turns where a vision-only system might degrade. The average position error during stable flight was within 0.05m, demonstrating the practical viability of the algorithm for enabling precise autonomous navigation of a quadrotor drone in complex environments.

Conclusion and Future Work

This work presented a complete, tightly-coupled stereo visual-inertial SLAM system based on nonlinear optimization, specifically designed and validated for quadrotor drone navigation. By deeply integrating pre-integrated IMU constraints with visual feature measurements in a joint optimization framework, the system achieves significantly higher accuracy and robustness compared to pure visual SLAM. The proposed method addresses the critical initialization problem and efficiently manages sensor fusion in a sliding window, making it suitable for real-time operation on embedded hardware like the Jetson TX2.

The experimental validation on the EuRoC benchmark dataset confirms an average doubling of trajectory accuracy. Successful real-world autonomous flights of the quadrotor drone platform further demonstrate the system’s practical effectiveness. The integration provides the quadrotor drone with a reliable sense of self-localization and mapping, which is foundational for higher-level tasks like path planning and obstacle avoidance in unknown settings.

Future work will focus on enhancing the system’s robustness in highly dynamic environments. The current formulation assumes a static world. Integrating the ability to detect and discount moving objects (e.g., people, other robots) using semantic information or motion consistency checks would greatly improve the reliability of a quadrotor drone operating in populated spaces. Furthermore, exploring tightly-coupled fusion with additional sensors like barometers for improved altitude estimation or event cameras for handling extreme high-speed motion could push the capabilities of autonomous quadrotor drones even further. Finally, optimizing the computational footprint to enable deployment on even lower-power hardware would broaden the applicability of this technology.