The quest for robust and reliable multi-drone coordination has been a central focus in robotics research. While traditional methods rely heavily on communication links, which are vulnerable to denial or disruption, vision-based approaches offer a compelling alternative. By equipping follower drones with cameras to visually identify and track a leader, the formation can maintain its integrity without constant data exchange. This paradigm, known as the Leader-Follower architecture, places a critical burden on the real-time and accurate visual detection of the leader drone. Our research centers on advancing this core capability.
We present a significant enhancement to the state-of-the-art YOLOv8 object detection model, specifically tailored for the demanding task of leader drone identification in dynamic aerial environments. The challenges are manifold: the leader’s appearance can change drastically with different orientations, it may occupy only a small portion of the image when flying at a distance, and partial occlusions can occur. To address these, our improved algorithm integrates three key innovations into the YOLOv8n architecture: Deformable Convolutions within the Neck network, a Multi-Head Self-Attention mechanism, and strategic Data Augmentation during training. This combination yields a detector that is both more accurate and resilient, forming the perfect perceptual cornerstone for autonomous formation drone light show systems and other precision flocking applications.
The YOLOv8 Framework: A Foundation for Speed and Accuracy
Our work builds upon YOLOv8, the latest iteration in the renowned ‘You Only Look Once’ family of real-time object detectors. Its architecture is a masterclass in efficiency, comprising three main sections: the Backbone, the Neck, and the Head. The Backbone (CSPDarkNet-based) is responsible for extracting hierarchical features from the input image. A key component here is the C2f module, an efficient design that enriches gradient flow by aggregating features from multiple branches, enhancing the model’s learning capacity without compromising speed.
The Neck, adopting a Path Aggregation Network (PANet) structure with Feature Pyramid Networks (FPN), performs the crucial task of multi-scale feature fusion. It combines deep, semantically rich features (good for recognizing large objects) with shallow, high-resolution features (good for pinpointing small objects). This allows a single model to effectively detect the leader drone whether it is flying close and large or far and small in the follower’s camera view.
Finally, the Head has been redesigned in YOLOv8 with a decoupled structure. Instead of a single convolutional layer predicting class, objectness, and bounding box jointly, it uses separate branches for classification (Cls) and regression (Box). Furthermore, YOLOv8 adopts an Anchor-Free approach, predicting the center of an object directly rather than relying on pre-defined anchor boxes. This simplifies the detection process, reduces computational overhead, and mitigates issues like missed detections or duplicate boxes that can arise from poorly tuned anchors—a vital trait for the consistent tracking needed in a formation drone light show.
Enhancing the Model for Aerial Vision
The standard YOLOv8, while powerful, is a general-purpose detector. To specialize it for the unique challenges of aerial leader detection, we introduced targeted architectural improvements.
Deformable Convolutions for Geometric Adaptability
A fundamental limitation of standard convolution is its fixed geometric structure. It samples input features on a rigid grid (e.g., a 3×3 kernel), which struggles to model geometric transformations of objects. A leader drone banking, tilting, or viewed from an angle presents a highly variable shape to the camera.
We integrate Deformable Convolutions (DCN) into the Bottleneck blocks of the C2f modules within the Neck network. DCN augments the standard convolution by learning additional 2D offset fields for each sample point in the kernel. This allows the sampling grid to freely deform and adapt to the actual structure of the leader drone in the feature map. The operation can be formalized as:
$$y(p_0) = \sum_{p_n \in R} w(p_n) \cdot x(p_0 + p_n + \Delta p_n)$$
where $p_0$ is a location on the output feature map $y$, $R$ defines the kernel’s sampling grid (e.g., $(-1,-1), (-1,0), …, (1,1)$), $w(p_n)$ is the weight for location $p_n$, $x$ is the input feature map, and $\Delta p_n$ is the learned offset for location $p_n$. Since $\Delta p_n$ is typically fractional, bilinear interpolation is used to compute $x(p_0 + p_n + \Delta p_n)$. This adaptability is crucial for maintaining a tight bounding box around a maneuvering leader, directly improving the precision of the relative pose estimation that drives formation drone light show control.
Multi-Head Self-Attention for Contextual Awareness
While convolutions excel at capturing local patterns, understanding broader contextual relationships in the scene can be pivotal. For instance, distinguishing the leader drone from background clutter or other flying objects requires reasoning about the global scene context.
We incorporate a Multi-Head Self-Attention (MHSA) block into the Neck. Self-attention allows the model to weigh the importance of all other feature map locations when refining the representation at a specific location. The “multi-head” aspect runs several of these attention mechanisms in parallel, each potentially learning to focus on different types of relationships (e.g., shape, color, spatial layout), leading to a more robust and comprehensive feature representation.
The core attention mechanism for a single head is calculated as:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
Where $Q$ (Query), $K$ (Key), and $V$ (Value) are linear projections of the input features. Multi-Head Attention runs $h$ separate attention heads and concatenates their outputs:
$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, …, \text{head}_h)W^O$$
$$\text{where head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$
Here, $W_i^Q, W_i^K, W_i^V$ are learned projection matrices for head $i$, and $W^O$ is an output projection matrix. This module helps the network suppress irrelevant background features and accentuate the distinctive features of the leader, even in complex visual environments typical of outdoor formation drone light show performances.
Robust Training with Cutout Augmentation
Real-world operations are messy. A follower drone’s view of the leader can be temporarily occluded by environmental elements like tree branches, light posts, or even other drones in a swarm. Collecting and labeling a comprehensive dataset covering all possible occlusion scenarios is impractical.
To build inherent robustness against such partial occlusions, we employ Cutout data augmentation during training. This technique randomly selects square regions on the training images and fills them with zeros (or other masks). This forces the network to learn to recognize the leader drone from its partial, non-occluded features, making the final model significantly more reliable when faced with real-world obstructions. This is a non-negotiable feature for a system that must perform flawlessly during a public formation drone light show, where reliability is paramount.

From Pixels to Formation Control
The bounding box output by our enhanced detector is not an end in itself; it is the primary sensory input for formation control. Let $b(t) = [u(t), v(t), w(t), h(t)]$ denote the bounding box at time $t$, with $(u, v)$ as the pixel coordinates of its center, and $(w, h)$ as its width and height in pixels.
Under the pinhole camera model, the relative depth $Z$ (distance along the camera’s optical axis) between the follower and the leader can be estimated if the leader’s physical size is approximately known. Assuming a known physical cross-sectional area $A$ of the leader (or a key part of it), the depth is inversely proportional to the area of the bounding box in the image:
$$Z = \frac{A \sqrt{f_x f_y}}{\sqrt{\alpha}} \quad \text{where} \quad \alpha = w(t) \times h(t)$$
Here, $f_x$ and $f_y$ are the camera’s focal lengths. The image-plane coordinates $(x_1, x_2)$ of the bounding box center are derived as:
$$x_1(t) = \frac{u(t) – c_u}{f_x}, \quad x_2(t) = \frac{v(t) – c_v}{f_y}$$
where $(c_u, c_v)$ is the principal point (image center). These coordinates are related to the real-world normalized coordinates:
$$\mathbf{s}(t) = \begin{bmatrix} x_1(t) \\ x_2(t) \\ x_3(t) \end{bmatrix} = \begin{bmatrix} X/Z \\ Y/Z \\ 1/Z \end{bmatrix}$$
where $(X, Y)$ is the leader’s position relative to the follower’s camera frame. The control objective is to drive the current feature vector $\mathbf{s}(t)$ to a desired setpoint $\mathbf{s}^*$, which corresponds to the leader being at the desired relative position $(X_{\text{des}}, Y_{\text{des}}, Z_{\text{des}})$ for the formation. A classical Proportional-Integral (PI) controller can be employed:
$$\mathbf{u}(t) = K_P \mathbf{e}(t) + K_I \int_0^t \mathbf{e}(\tau) d\tau, \quad \text{with} \quad \mathbf{e}(t) = \mathbf{s}(t) – \mathbf{s}^*$$
The control signal $\mathbf{u}(t)$, typically representing velocity or acceleration commands, is then sent to the follower drone’s flight controller. Thus, the accuracy and latency of our improved YOLOv8 detector directly and critically influence the stability and precision of the entire formation drone light show.
Experimental Validation and Performance
We rigorously evaluated our improved model against the baseline YOLOv8n and other contenders like YOLOv5n. Testing was conducted on two datasets: the public PASCAL VOC 2012 benchmark to gauge general object detection capability, and a custom, challenging dataset of 1,420 images specifically capturing leader drones in various flight states (straight flight, banking, ascending/descending).
The results on our custom drone dataset are particularly telling. While all models achieved high accuracy due to the focused nature of the task, our enhancements pushed the performance boundaries further.
| Model | Input Size | Inference Speed (ms) | mAP@50 (%) | mAP@50-95 (%) |
|---|---|---|---|---|
| YOLOv5n | 640 | 5.8 | 99.1 | 86.7 |
| YOLOv8n (Baseline) | 640 | 6.0 | 99.1 | 87.8 |
| Our Improved YOLOv8n | 640 | 6.1 | 99.5 | 89.1 |
The table shows that our model achieves the highest detection precision (mAP@50 of 99.5%) and the most robust performance across a range of Intersection-over-Union thresholds (mAP@50-95 of 89.1%). The minimal increase in inference time (from 6.0 ms to 6.1 ms) is a negligible trade-off for the significant gains in accuracy and robustness, ensuring the system remains comfortably real-time. On the VOC2012 dataset, our model also showed a clear improvement over the baseline, with mAP@50 rising from 67.4% to 68.1%, confirming the general efficacy of our architectural modifications.
Simulation: Closing the Loop on Formation Flight
To validate the complete perception-control pipeline, we deployed our improved detector within a full Gazebo/ROS simulation. A follower drone, equipped with a simulated camera, used our algorithm to detect a leader drone. The bounding box information was converted to the relative feature vector $\mathbf{s}(t)$, and a PI controller generated velocity commands to maintain a pre-defined “line astern” formation (a specific pattern highly relevant to formation drone light show choreography), with a desired following distance of 2 meters laterally and 0 meters vertically.
The simulation results demonstrated stable and precise formation keeping. The follower successfully maintained the prescribed separation, with the relative position in the horizontal plane oscillating tightly around the 2-meter setpoint. This experiment proves that our high-precision visual detector can serve as the reliable “eyes” for an autonomous formation control system, translating visual observations directly into stable flight commands.
Conclusion and Future Trajectory
This work demonstrates that targeted enhancements to a state-of-the-art object detector like YOLOv8 can yield substantial benefits for the specialized task of aerial leader detection. By incorporating Deformable Convolutions for shape adaptability, Multi-Head Self-Attention for contextual reasoning, and Cutout augmentation for occlusion robustness, we have developed a vision module that is exceptionally well-suited for autonomous drone formations.
The implications extend directly to real-world applications, most spectacularly in synchronized formation drone light show performances. In these displays, hundreds of drones must maintain exact relative positions to create complex, moving images in the sky. A vision-based backup or primary guidance system, powered by an algorithm like ours, could provide redundancy against GPS signal loss or interference, ensuring the show goes on flawlessly. Beyond entertainment, this technology is applicable to search and rescue formations, agricultural surveying fleets, and automated warehouse inventory drones, where reliable relative navigation is key. Our research provides a robust, vision-based perceptual foundation upon which the future of coordinated drone systems can be built.
