In recent years, the rapid advancement of unmanned aerial vehicle (UAV) technology has enabled widespread applications in surveillance, reconnaissance, and disaster response. Among various UAV platforms, vertical take-off and landing (VTOL) UAVs, particularly quadrotors, have gained significant attention due to their agility, hovering capability, and low cost. However, many traditional VTOL UAV systems rely on expensive sensors such as GPS for position measurement, magnetometers for heading, and optical flow sensors for velocity, which increase both cost and payload. To address these limitations, my research focuses on developing a robust image-based visual servoing (IBVS) control scheme for VTOL UAVs that utilizes only an inertial measurement unit (IMU) and a monocular camera, eliminating the need for external position, velocity, or heading measurements. This approach leverages visual feedback to track maneuvering targets, making it suitable for resource-constrained environments. The core challenge lies in designing a control system that can handle the underactuated dynamics of VTOL UAVs while compensating for unknown target motions and measurement uncertainties. In this article, I present a comprehensive framework for image-based target tracking, including system modeling, controller design with super-twisting sliding mode techniques, and simulation validation. By integrating image moments and virtual image planes, the proposed method ensures stable and accurate tracking without relying on additional sensors. Throughout this work, the term VTOL UAV is emphasized to highlight the specific platform and its unique control challenges.

The motivation for this research stems from the growing demand for autonomous VTOL UAVs that can operate in GPS-denied environments, such as indoor spaces or areas with magnetic interference. By using vision-based control, VTOL UAVs can perform tasks like following moving targets or inspecting structures without external aids. However, IBVS for VTOL UAVs is non-trivial due to the coupling between translational and rotational motions. When a VTOL UAV changes its attitude to achieve translation, the image dynamics become complex, requiring careful controller design. Previous works have addressed this using spherical image moments or perspective moments, but they often assume availability of heading or velocity measurements. My contribution is a novel control strategy that overcomes these limitations through a combination of image feature selection, virtual image plane transformation, and robust sliding mode control. This approach not only reduces sensor dependency but also enhances robustness against target maneuvers. The following sections detail the mathematical foundations, control architecture, and performance evaluation, with extensive use of formulas and tables to summarize key concepts. As VTOL UAVs continue to evolve, such vision-based methods will play a crucial role in enabling fully autonomous operations.
To begin, I define the mathematical model for a quadrotor VTOL UAV. The dynamics are derived from Newton-Euler equations, considering the VTOL UAV as a rigid body with mass m and inertia matrix J. Let the inertial frame be denoted as W = {Ow, w1, w2, w3} and the body-fixed frame as B = {Ob, b1, b2, b3}, where b1 points forward. The position of the VTOL UAV’s center of mass in W is r ∈ ℝ3, and its velocity is v ∈ ℝ3. The rotation matrix R ∈ SO(3) maps vectors from B to W, and the angular velocity in B is Ω = [ω1 ω2 ω3]T. The control inputs are the total thrust f ∈ ℝ and the moment vector M ∈ ℝ3. The translational and rotational dynamics are given by:
$$ \dot{\mathbf{r}} = \mathbf{v}, $$
$$ \dot{\mathbf{v}} = g\mathbf{w}_3 – \frac{f}{m}\mathbf{R}\mathbf{b}_3, $$
$$ \dot{\mathbf{R}} = \mathbf{R} \text{sk}(\mathbf{\Omega}), $$
$$ \mathbf{J}\dot{\mathbf{\Omega}} + \mathbf{\Omega} \times \mathbf{J}\mathbf{\Omega} = \mathbf{M}, $$
where w3 = [0 0 1]T, b3 = [0 0 1]T, g is gravity, and sk(·) is the skew-symmetric operator. For a VTOL UAV, the thrust direction is aligned with b3, making it underactuated since translational motion requires attitude changes. This coupling is a key aspect in controlling VTOL UAVs. To simplify, I decompose R into yaw and tilt components: R = RψRt, where Rψ represents yaw rotation around w3, and Rt represents tilt (roll and pitch). The tilt matrix can be further expressed as Rt = RθRϕ, with angles θ (pitch) and ϕ (roll). The yaw dynamics are:
$$ \dot{\mathbf{R}}_{\psi} = \mathbf{R}_{\psi} \text{sk}(\dot{\psi}\mathbf{b}_3), $$
$$ \dot{\psi} = (\omega_2 \sin\phi + \omega_3 \cos\phi)/\cos\theta. $$
In practice, VTOL UAVs often lack heading measurements, so my control design avoids relying on ψ. Instead, I use the tilt matrix Rt, which can be estimated from IMU data. This is a critical adaptation for VTOL UAVs operating without magnetometers.
Next, I describe the image dynamics for the VTOL UAV’s camera. Assume a downward-facing camera mounted at the VTOL UAV’s center, with focal length λ. The camera frame C = {Oc, xc, yc, zc} has its origin at the lens center. To simplify analysis, I introduce a virtual image plane V = {Ov, xv, yv, zv} that is fixed to the VTOL UAV and horizontal (parallel to w1–w2 plane), sharing the same yaw as the camera. This virtual plane decouples image dynamics from attitude changes, a common technique for VTOL UAV visual servoing. Let a point P on the target have coordinates Pv in V. The relationship is:
$$ \mathbf{P}_v = \mathbf{R}_{\psi}^T (\mathbf{P} – \mathbf{O}_v). $$
The time derivative yields:
$$ \dot{\mathbf{P}}_v = -\text{sk}(\dot{\psi})\mathbf{P}_v + \mathbf{v}_p – \mathbf{v}_v, $$
where vp is the target’s velocity in V, and vv is the camera’s velocity in V. The image coordinates ( uv, vv ) in the virtual plane are derived from the camera measurements ( u, v ) via:
$$ \begin{bmatrix} u_v \\ v_v \end{bmatrix} = \frac{1}{\bar{\mathbf{R}}_3 \mathbf{p}} \begin{bmatrix} \bar{\mathbf{R}}_1 \mathbf{p} \\ \bar{\mathbf{R}}_2 \mathbf{p} \end{bmatrix}, $$
with p = [u v 1]T and R̄1, R̄2, R̄3 as rows of Rt. Differentiating gives the image velocity:
$$ \begin{bmatrix} \dot{u}_v \\ \dot{v}_v \end{bmatrix} = \begin{bmatrix} -\lambda & 0 & u_v \\ 0 & -\lambda & v_v \end{bmatrix} \frac{(\mathbf{v}_v – \mathbf{v}_p)}{z_v} + \begin{bmatrix} v_v \\ -u_v \end{bmatrix} \dot{\psi}. $$
For feature selection, I use perspective image moments to represent the target. Given N feature points pk = [uk vk 1]T, define moments mij = Σk uki vkj. Central moments are µij = Σk (uk – ug)i (vk – vg)j, with centroid ( ug, vg ) = ( m10/m00, m01/m00 ). The image area is a = µ20 + µ02. I choose the image feature vector q = [qx qy qz]T for VTOL UAV control:
$$ q_x = q_z u_g, \quad q_y = q_z v_g, \quad q_z = z_d \sqrt{a_d / a}, $$
where zd is the desired height and ad is the desired area. This formulation encodes relative position information, crucial for VTOL UAV tracking. The dynamics of q are:
$$ \dot{\mathbf{q}} = -\text{sk}(\dot{\psi}\mathbf{b}_3) \begin{bmatrix} q_x \\ q_y \\ q_{Dz} \end{bmatrix} – \mathbf{v}_v + \mathbf{v}_p, $$
with qDz arbitrary. Notably, zv = zd √(ad/a), allowing height estimation without direct measurement. Define the error δ = q – qd, where qd = [0 0 qdz]T for centered tracking. The error dynamics become:
$$ \dot{\mathbf{\delta}} = -\text{sk}(\dot{\psi}\mathbf{b}_3)\mathbf{\delta} – \mathbf{v}_v + \mathbf{v}_p. $$
Combining with VTOL UAV translational dynamics in V:
$$ \dot{\mathbf{v}}_v = -\text{sk}(\dot{\psi}\mathbf{b}_3)\mathbf{v}_v + \mathbf{u}_v, \quad \mathbf{u}_v = \mathbf{R}_t \mathbf{u}_b, $$
where ub = gRtTb3 – (f/m)b3. After differentiation, I obtain the second-order error system:
$$ \ddot{\mathbf{\delta}} = -2\mathbf{M}_1 \dot{\mathbf{\delta}} – \mathbf{M}_2 \mathbf{\delta} – \mathbf{M}_1 \mathbf{v}_p – \dot{\mathbf{v}}_p + \mathbf{u}_v, $$
with M1 = sk(Ωb3), M2 = sk(Ω̇b3) + sk(Ωb3)sk(Ωb3). Substituting uv yields:
$$ \ddot{\mathbf{\delta}} = -2\mathbf{M}_1 \dot{\mathbf{\delta}} – \mathbf{M}_2 \mathbf{\delta} – \mathbf{M}_1 \mathbf{v}_p – \dot{\mathbf{v}}_p + g\mathbf{b}_3 – \frac{f}{m} \mathbf{r}_3, $$
where r3 is the third column of Rt. This equation forms the basis for controller design for the VTOL UAV.
To summarize the system parameters and variables, I provide Table 1, which lists key symbols used in modeling the VTOL UAV and image dynamics.
| Symbol | Description | Unit |
|---|---|---|
| m | Mass of VTOL UAV | kg |
| J | Inertia matrix | kg·m² |
| r | Position in inertial frame | m |
| v | Velocity in inertial frame | m/s |
| R | Rotation matrix | — |
| Ω | Angular velocity | rad/s |
| f | Total thrust | N |
| M | Control moment | N·m |
| λ | Focal length | pixel |
| q | Image feature vector | — |
| δ | Image error vector | — |
The controller design for the VTOL UAV consists of two main parts: an IBVS position controller and an attitude controller. The position controller generates thrust and attitude commands from image features, while the attitude controller tracks these commands. This decoupled structure is common for underactuated VTOL UAVs. A block diagram of the control system is shown in Figure 1, illustrating the flow from image processing to actuator inputs for the VTOL UAV.
Since the image velocity δ̇ is not measured, I design a high-order sliding mode observer (HOSMO) to estimate it. Let x̂1 and x̂2 be estimates of δ and δ̇, respectively. Define estimation errors e1 = δ – x̂1. The HOSMO is:
$$ \dot{\hat{\mathbf{x}}}_1 = \hat{\mathbf{x}}_2 + k_1 |\mathbf{e}_1|^{2/3} \text{sgn}(\mathbf{e}_1), $$
$$ \dot{\hat{\mathbf{x}}}_2 = \hat{\mathbf{x}}_3 + \mathbf{u} + \mathbf{f}(\mathbf{\delta}, \hat{\mathbf{x}}_2) + k_2 |\mathbf{e}_1|^{1/3} \text{sgn}(\mathbf{e}_1), $$
$$ \dot{\hat{\mathbf{x}}}_3 = k_3 \text{sgn}(\mathbf{e}_1), $$
where f(δ, x̂2) = –2M1x̂2 – M2δ + gb3, and u is the virtual control input. Gains k1, k2, k3 > 0 ensure finite-time convergence of x̂2 to δ̇. This observer is robust to uncertainties, which is vital for VTOL UAV operations in dynamic environments.
For the position controller, I consider the error dynamics with disturbance D = –(M1vp + v̇p), representing unknown target maneuvers. Assume D is bounded, |D| < σd. The control law aims to drive δ to zero. Define a sliding surface s = c1δ + x̂2, with c1 > 0. I propose a super-twisting sliding mode controller:
$$ \mathbf{u} = -c_1 \hat{\mathbf{x}}_2 – \int_0^t k_3 \text{sgn}(\mathbf{e}_1) d\tau – k_2 |\mathbf{e}_1|^{1/3} \text{sgn}(\mathbf{e}_1) – \lambda_1 |\mathbf{s}|^{1/2} \text{sgn}(\mathbf{s}) – \int_0^t \lambda_2 \text{sgn}(\mathbf{s}) d\tau – \mathbf{f}(\mathbf{\delta}, \hat{\mathbf{x}}_2), $$
where λ1, λ2 > 0 are gains. This controller yields a continuous output, reducing chattering common in VTOL UAV applications. Under the assumption that the observer and attitude dynamics converge faster than the position loop, the closed-loop system ensures finite-time stability of s = 0, implying exponential convergence of δ → 0. The proof uses Lyapunov analysis, as detailed in my prior work. The controller output u is then converted to thrust and attitude commands for the VTOL UAV. Let u = –(fd/m)rd, where rd is the desired tilt direction and fd is desired thrust. Then:
$$ \mathbf{r}_d = -\frac{\mathbf{u}}{\|\mathbf{u}\|}, \quad f_d = m \|\mathbf{u}\|. $$
From rd = [rd1 rd2 rd3]T, compute desired pitch and roll angles:
$$ \theta_d = \arctan\left(\frac{r_{d1}}{r_{d3}}\right), \quad \phi_d = \arctan\left(\frac{-\cos(\theta_d) r_{d2}}{r_{d3}}\right). $$
The desired tilt matrix is Rt,d = RθdRϕd. For yaw control, since heading measurement is absent, I either suppress yaw rotation (i.e., set ω3 = 0) or use visual information to align with the target. The relative yaw angle α can be computed from image moments:
$$ \alpha = \frac{1}{2} \arctan\left(\frac{2\mu_{11}}{\mu_{20} – \mu_{02}}\right). $$
If the target yaw ψt is known, desired yaw is ψd = ψt – α. Otherwise, keeping α constant suffices for many VTOL UAV tasks.
The attitude controller for the VTOL UAV tracks Rt,d. Define the attitude error as Re = RtRt,dT. A proportional-derivative (PD) controller on SO(3) is used:
$$ \mathbf{M} = -\mathbf{\Omega} \times \mathbf{J}\mathbf{\Omega} – \mathbf{K}_p \text{sk}^{-1}(\log(\mathbf{R}_e)) – \mathbf{K}_d \mathbf{\Omega}, $$
where Kp, Kd are positive definite matrices, and log(·) is the logarithmic map on SO(3). This controller ensures exponential convergence of Re to identity, provided initial error is not 180°. The attitude controller is fast and accurate, essential for VTOL UAV stability.
To evaluate the performance of the proposed control scheme for VTOL UAVs, I conducted numerical simulations in MATLAB/Simulink. The VTOL UAV parameters are based on a typical quadrotor: m = 0.455 kg, J = diag(0.43, 0.43, 1.02) × 10–2 kg·m². The target is a planar object with four feature points. The desired height is zd = 5 m. The controller gains are tuned as follows: k1 = 20, k2 = 10, k3 = 10, λ1 = 3, λ2 = 6, c1 = 1. The simulation scenarios include target linear motion and S-shaped maneuvers, with a focus on tracking accuracy and control effort for the VTOL UAV.
Table 2 summarizes the simulation parameters for the VTOL UAV and target.
| Parameter | Value | Description |
|---|---|---|
| m | 0.455 kg | VTOL UAV mass |
| J1, J2 | 0.43e–2 kg·m² | Roll/pitch inertia |
| J3 | 1.02e–2 kg·m² | Yaw inertia |
| zd | 5 m | Desired height |
| λ | 500 pixel | Focal length |
| Target speed | 3 m/s | Maximum velocity |
| Simulation time | 40 s | Duration |
In the first scenario, the target moves linearly at 3 m/s, with 90° turns at 9 s, 19 s, and 29 s. The VTOL UAV starts at [6 5 –10]T m. Figure 2 shows the trajectories of the VTOL UAV and target. The VTOL UAV successfully tracks the target, with position errors shown in Figure 3. The errors peak during turns (up to 1.5 m) but quickly recover, demonstrating the robustness of the VTOL UAV controller. The control inputs—thrust f and moments M—are smooth (Figures 4–5), avoiding excessive actuation. The VTOL UAV’s Euler angles and angular velocities (Figures 6–7) remain within practical limits, confirming that the VTOL UAV adjusts tilt rather than yaw for translation, as intended.
For the S-shaped maneuver, the target follows a sinusoidal path at 3 m/s. The VTOL UAV tracks accurately, with errors below 1 m (Figure 8). The control inputs (Figures 9–10) are again chattering-free, thanks to the super-twisting design. These results validate the effectiveness of the IBVS approach for VTOL UAVs tracking maneuvering targets.
I also tested the impact of velocity estimation delay on VTOL UAV performance. Delays of 50 ms and 100 ms were introduced in the observer. As expected, larger delays increase tracking error and cause slight oscillations (Figures 11–12), but the VTOL UAV remains stable. This highlights the importance of fast computation for real-time VTOL UAV control. To assess computational efficiency, I generated C code from the controller using Simulink Coder and deployed it on an STM32F407 microcontroller (168 MHz). The average execution time per cycle was 1.2 ms, well within typical VTOL UAV control rates (10–100 Hz). This confirms the feasibility of implementing the proposed algorithm on embedded VTOL UAV platforms.
Furthermore, I performed a virtual simulation in V-REP with a realistic camera model to validate the vision system. The VTOL UAV successfully hovered over a target using image moments, as shown in Figure 13. This demonstrates the practicality of the method for VTOL UAV applications in simulated environments.
In conclusion, I have presented a comprehensive image-based target tracking control method for VTOL UAVs that requires only an IMU and camera. By using perspective moments and a virtual image plane, the scheme estimates relative position without direct measurements. The super-twisting sliding mode controller and observer provide robustness against target maneuvers and measurement uncertainties, while ensuring smooth control actions. Simulations confirm stable tracking performance for various target motions, with the VTOL UAV adjusting tilt attitude for translation instead of slow yaw changes. The approach is computationally efficient and suitable for low-cost VTOL UAVs. Future work will involve real-world flight tests and extending the method to multi-VTOL UAV scenarios. As VTOL UAV technology advances, such vision-based controls will enable more autonomous and resilient operations in complex environments.
