Vision-Based Indoor Drone Formation Flight System

In recent years, drone formation flight has emerged as a pivotal advancement in unmanned aerial vehicle (UAV) technology, significantly expanding the operational scope beyond single-drone applications. As a researcher focused on autonomous systems, I have been deeply involved in addressing the challenges of indoor drone formation, where GPS signals are unavailable. This article presents a comprehensive design and implementation of a vision-based indoor drone formation flight system, leveraging AprilTags as visual landmarks for localization and control. The system enables precise trajectory tracking and formation maintenance without relying on external GPS, thereby overcoming a major limitation in confined environments. Throughout this work, the emphasis is on robust algorithms and practical experimentation, with the goal of advancing drone formation capabilities for diverse applications such as surveillance, logistics, and collaborative tasks. The integration of computer vision, multi-agent control, and real-time processing forms the core of this approach, and I will detail each aspect to provide a thorough understanding of the system’s design and performance.

The proliferation of drone technology has been driven by its simplicity, cost-effectiveness, and adaptability, but indoor operations pose unique challenges due to the absence of GPS. In my research, I have focused on developing a solution that uses visual cues for localization, specifically employing AprilTags—a robust fiducial system—to create a reliable reference map. This system allows drones to compute their position and velocity from camera images, facilitating coordinated drone formation flight in environments like warehouses, factories, or disaster sites. The importance of drone formation cannot be overstated; it enhances efficiency, redundancy, and task parallelism, making it a critical area for innovation. By designing this system, I aim to contribute to the growing body of work on autonomous swarms, with practical insights into hardware setup, software architecture, and control strategies. The following sections will delve into the hardware and software components, algorithmic foundations, experimental validation, and broader implications, all while emphasizing the key theme of drone formation throughout.

To begin, the hardware configuration is fundamental to the system’s success. I utilized four Ardrone 2.0 quadrotors, each equipped with a downward-facing camera and Wi-Fi module. These drones are chosen for their affordability and programmability, making them ideal for research into drone formation. The cameras capture images of the ground, where AprilTags are arranged in a grid pattern to serve as landmarks. A network infrastructure connects the drones to four graphics workstations via a router, enabling real-time image processing and data exchange. This setup ensures that each workstation can handle the visual data from a corresponding drone, computing position and velocity at high frequencies. The AprilTags map is constructed with a specific layout: starting from an origin point (Tag ID 0), tags are placed in rows and columns with consistent spacing, providing a coordinate system for localization. This hardware foundation supports the complex computations required for drone formation flight, as illustrated in the system block diagram. The integration of Wi-Fi for communication and graphics workstations for processing highlights the system’s scalability, allowing for future expansions to larger drone formations.

In terms of software architecture, I developed a multi-layered framework to manage the drone formation. The software runs on the ground station and individual workstations, coordinating tasks such as image analysis, control algorithm execution, and data fusion. A key component is the AprilTags library, which detects tags in camera images and estimates their pose relative to the camera. From this, I derive the drone’s position in the world coordinate system through a perspective-n-point (PnP) algorithm. The software also implements sensor fusion, combining visual data with onboard IMU and optical flow measurements to achieve a high update rate of 200 Hz for position information. This is critical for stable drone formation control, as delays can lead to instability. The control logic is structured in a hierarchical manner: a position loop generates velocity commands, which are then fed into a velocity loop to produce attitude commands for the drones. This cascaded approach ensures smooth and responsive control, essential for maintaining precise drone formation patterns. The overall software framework is designed for modularity, allowing easy adjustments to formation shapes or trajectories.

The algorithmic core of this system revolves around three main aspects: visual localization, single-drone control, and multi-drone formation coordination. For visual localization, I employ AprilTags to compute the transformation between camera and world coordinates. Let the world coordinate system be denoted as $O_w$, with origin at the center of Tag ID 0. The camera coordinate system is $O_c$, and the drone body coordinate system is $O_p$. When a camera captures an image, the AprilTags algorithm identifies tag corners in pixel coordinates. Given the known world coordinates of these corners from the map, the PnP algorithm solves for the rotation matrix $\mathbf{R}$ and translation vector $\mathbf{t}$ that transform points from $O_w$ to $O_c$. The transformation can be expressed as:

$$ \mathbf{P}_c = \mathbf{R} \mathbf{P}_w + \mathbf{t} $$

where $\mathbf{P}_w$ is a point in world coordinates and $\mathbf{P}_c$ is the corresponding point in camera coordinates. From this, the drone’s position $\mathbf{p}_d$ in $O_w$ is derived as the inverse transformation. To enhance accuracy, I fuse this visual data with inertial measurements. The fused position $\hat{\mathbf{p}}_d$ at time $t$ is computed by integrating optical flow velocity $\mathbf{v}_{of}$ from the drone’s onboard sensors, adjusted for a time delay $\tau$ (empirically set to 150 ms):

$$ \hat{\mathbf{p}}_d(t) = \mathbf{p}_v(t – \tau) + \int_{t-\tau}^{t} \mathbf{v}_{of}(s) \, ds $$

where $\mathbf{p}_v$ is the visual position estimate. This fusion achieves a high-frequency output, crucial for real-time drone formation control.

For single-drone control, I implement a PID-based cascaded structure. The position error $\mathbf{e}_p = \mathbf{p}_{des} – \hat{\mathbf{p}}_d$ is used to generate a desired velocity $\mathbf{v}_{des}$ through a PID controller:

$$ \mathbf{v}_{des} = K_{p,p} \mathbf{e}_p + K_{i,p} \int \mathbf{e}_p \, dt + K_{d,p} \frac{d\mathbf{e}_p}{dt} $$

Then, the velocity error $\mathbf{e}_v = \mathbf{v}_{des} – \mathbf{v}_{est}$ (where $\mathbf{v}_{est}$ is the estimated velocity from sensor fusion) produces attitude commands $\boldsymbol{\theta}_{cmd}$ via another PID controller. This approach decouples position and attitude control, improving stability in drone formation flights. The attitude commands are sent to the drone’s inner-loop controller, which handles motor thrust and orientation. The parameters for these controllers are tuned experimentally to ensure robust performance across different flight conditions.

The multi-drone formation control algorithm builds on this by adding coordination layers. In a drone formation, one drone is designated as the leader (e.g., Drone 1), and others follow based on a desired formation pattern. The ideal position $\mathbf{p}_{i,ideal}$ for follower drone $i$ is computed from the leader’s position $\mathbf{p}_L$ and a formation offset $\mathbf{\delta}_i$:

$$ \mathbf{p}_{i,ideal} = \mathbf{p}_L + \mathbf{\delta}_i $$

The formation offset $\mathbf{\delta}_i$ is defined by the formation geometry, such as a square or line pattern. During flight, the actual position $\hat{\mathbf{p}}_i$ is compared to $\mathbf{p}_{i,ideal}$, and a formation compensation vector $\mathbf{c}_i$ is generated:

$$ \mathbf{c}_i = K_f (\mathbf{p}_{i,ideal} – \hat{\mathbf{p}}_i) $$

where $K_f$ is a gain matrix. This compensation is added to the velocity command of drone $i$, effectively adjusting its trajectory to maintain the drone formation. The overall control law for a follower drone combines position tracking and formation maintenance:

$$ \mathbf{v}_{des,i} = \text{PID}(\mathbf{p}_{i,ideal} – \hat{\mathbf{p}}_i) + \mathbf{c}_i $$

This distributed approach allows scalable and flexible drone formation management, as each drone independently computes its commands based on shared data from the ground station.

To validate the system, I conducted experiments focusing on localization accuracy and formation flight performance. The visual localization test involved fixing a drone and moving it along a straight path, comparing the estimated position to ground truth measurements. The results, summarized in Table 1, show an average positioning error of 0.2 meters, which is well below the minimum inter-drone spacing of 2 meters in a drone formation. This accuracy ensures reliable coordination without collisions.

Table 1: Visual Localization Accuracy Test Results
Test Run	Distance Traveled (m)	Average Error (m)	Maximum Error (m)
1	5.0	0.18	0.25
2	5.0	0.21	0.28
3	5.0	0.19	0.24

The drone formation flight experiments involved four drones executing predefined trajectories in square and line formations. The drones took off simultaneously, navigated to waypoints, and maintained formation while adapting to shape changes. During a 2-minute flight, the formation error—defined as the deviation from ideal positions—was recorded. As shown in Table 2, the average formation error across all drones was less than 0.4 meters, demonstrating effective drone formation control. The drones successfully transitioned between formations, highlighting the system’s flexibility. These experiments confirm that the vision-based approach enables stable indoor drone formation flight without GPS, meeting the requirements for applications like inventory management or search and rescue.

Table 2: Drone Formation Flight Performance
Formation Pattern	Duration (s)	Average Formation Error (m)	Maximum Error (m)
Square	60	0.35	0.42
Line	60	0.32	0.39
Shape Transition	30	0.38	0.45

In analyzing the system’s performance, several factors contribute to its success. The use of AprilTags provides robust visual landmarks that are easy to detect and interpret, even under varying lighting conditions. The sensor fusion algorithm compensates for the low update rate of visual localization, ensuring smooth control inputs for drone formation. However, challenges remain, such as occlusions or tag misdetections, which can temporarily degrade localization. To mitigate this, I incorporated redundancy by using multiple tags in the camera’s field of view. Future improvements could include integrating additional sensors like ultrasonic rangefinders or developing adaptive control laws that adjust to dynamic environments. The scalability of this system is another advantage; by optimizing the software, it can support larger drone formations with more complex patterns. This aligns with the broader trend toward autonomous swarms, where drone formation technology plays a central role in enabling collaborative behaviors.

From a broader perspective, this work contributes to the field of multi-robot systems by demonstrating a practical indoor drone formation solution. The integration of vision, control, and communication showcases how off-the-shelf components can be leveraged for advanced research. The algorithms presented here—such as the PnP-based localization and PID cascaded control—are widely applicable to other robotic platforms, extending beyond drones. Moreover, the emphasis on real-time processing and network architecture offers insights into designing robust distributed systems. As drone formation applications expand into areas like agriculture, where indoor farming requires precise coordination, or entertainment, where aerial displays demand tight synchronization, the techniques described here will be invaluable. The experimental results validate the system’s feasibility, paving the way for further innovations in autonomous drone formation flight.

In conclusion, the vision-based indoor drone formation flight system I designed effectively addresses the GPS-denied challenge through innovative use of AprilTags and sensor fusion. The hardware and software components work in harmony to achieve accurate localization and stable control, enabling drones to fly in coordinated formations with minimal error. The experiments underscore the system’s reliability, with formation errors within acceptable bounds for practical deployment. This research highlights the potential of visual landmarks for enhancing autonomy in confined spaces, and it sets a foundation for future work on adaptive formations or heterogeneous swarms. As drone formation technology continues to evolve, systems like this will be crucial for unlocking new capabilities in automation and robotics, ultimately driving progress toward fully autonomous aerial networks.