In recent years, the field of drone control and navigation has garnered significant attention, with applications increasingly demanding multi-drone collaboration. As single drones often struggle to accomplish complex missions, research into drone swarm formation has become a critical focus. Traditional formation methods relying on communication links are susceptible to denial or interference, prompting a shift towards vision-based approaches that offer enhanced robustness. In this context, I propose an improved real-time object detection algorithm based on YOLOv8n, specifically designed for leader detection in drone formation scenarios. My contributions include integrating deformable convolution modules into the Neck network, incorporating a multi-head self-attention mechanism to bolster feature extraction, and employing advanced data augmentation techniques during training. This article delves into the algorithmic principles, enhancements, experimental validation, and practical application in drone formation control, all from my firsthand research perspective.
The core challenge in vision-based drone formation lies in the follower’s ability to accurately and swiftly detect the leader within its camera feed. The YOLO (You Only Look Once) family of algorithms has emerged as a cornerstone in real-time object detection due to its balance of speed and accuracy. YOLOv8, the latest iteration, incorporates advancements from its predecessors, such as an efficient backbone network, a decoupled head, and anchor-free detection. My work builds upon YOLOv8n, the nano version, to create a lightweight yet powerful detector suitable for onboard processing in drones. The standard YOLOv8 architecture comprises three main components: the Backbone for feature extraction, the Neck for multi-scale feature fusion, and the Head for final detection output. The Backbone utilizes CSPDarkNet53, featuring modules like C2f and SPPF for efficient gradient flow and spatial pyramid pooling. The Neck employs a PAN-FPN structure to aggregate features from different scales, while the Head uses a decoupled design for classification and regression, moving away from anchor-based priors to an anchor-free approach for greater flexibility and speed.

To address the dynamic nature of drone formation—where the leader’s appearance can vary due to perspective changes, occlusion, and distance fluctuations—I introduce three key modifications. First, I replace standard convolutions within the C2f modules of the Neck with deformable convolutions (DCN). Standard convolution operates on a fixed grid, which can be limiting for objects with non-rigid transformations. Deformable convolution learns spatial offsets, allowing the sampling grid to adapt to the object’s geometry. Mathematically, for an output position \( p_0 \), the deformable convolution operation is defined as:
$$ y(p_0) = \sum_{p_n \in \mathcal{R}} w(p_n) \cdot x(p_0 + p_n + \Delta p_n) $$
where \( \mathcal{R} \) defines the regular grid, \( w(p_n) \) are the convolution weights, and \( \Delta p_n \) are the learned offsets. Since \( p_0 + p_n + \Delta p_n \) may be fractional, bilinear interpolation is applied:
$$ x(p) = \sum_q g(q_x, p_x) \cdot g(q_y, p_y) \cdot x(q) $$
with \( g(a,b) = \max(0, 1 – |a-b|) \). This enables the model to better capture the leader’s shape deformations during flight maneuvers.
Second, I integrate a multi-head self-attention (MHSA) mechanism into the Neck network. While CNNs excel at capturing local features, attention mechanisms can model long-range dependencies, which is beneficial when the leader might be partially occluded or appear against cluttered backgrounds. The MHSA module projects the input features into multiple subspaces, computes self-attention in each, and concatenates the results. For each head \( i \), the attention output is:
$$ Z_i = \text{Attention}(Q_i, K_i, V_i) = \text{softmax}\left(\frac{Q_i K_i^T}{\sqrt{d_k}}\right) V_i $$
where \( Q_i = W_i^Q Q \), \( K_i = W_i^K K \), \( V_i = W_i^V V \) are linear projections of the query, key, and value matrices. The final output is:
$$ \text{MHSA}(Q,K,V) = \text{Concat}(Z_1, Z_2, …, Z_h) W^O $$
with \( h \) typically set to 8. This allows the model to attend to different parts of the leader simultaneously, enhancing feature representation.
Third, to improve robustness to occlusions—a common issue in drone formation—I augment the training data with Cutout. During training, random square regions of the input image are masked with zeros, simulating partial occlusion of the leader. This encourages the model to rely on multiple visual cues rather than a single contiguous region. Combined with the standard Mosaic augmentation, which stitches multiple images together, this significantly improves generalization to real-world scenarios where the leader might be temporarily obscured.
The overall modified architecture, which I refer to as Improved YOLOv8n, retains the original Backbone but enhances the Neck with C2f-D (deformable) modules and MHSA blocks. This design ensures that feature extraction is both adaptive and context-aware, crucial for the varying conditions in drone formation tasks.
In the context of drone formation control, the follower uses my improved detector to obtain the leader’s bounding box \( b(t) = [u(t), v(t), w(t), h(t)] \), where \( u, v \) are the center coordinates in pixel space, and \( w, h \) are the width and height. Assuming a pinhole camera model, the relative distance \( Z \) between the leader and follower can be estimated from the bounding box area. Let \( A \) be the actual physical area of the leader (known a priori or calibrated), and \( \alpha = w(t) \times h(t) \) be the detected area in pixels. Then:
$$ Z = \frac{A \sqrt{f_x f_y}}{\sqrt{\alpha}} $$
where \( f_x \) and \( f_y \) are the camera’s focal lengths. The image coordinates \( (x_1, x_2) \) are derived as:
$$ x_1(t) = \frac{u(t) – c_u}{f_x}, \quad x_2(t) = \frac{v(t) – c_v}{f_y} $$
with \( (c_u, c_v) \) being the principal point. Defining the image feature vector \( s(t) = [x_1(t), x_2(t), 1/Z(t)]^T \), the control objective is to drive \( s(t) \) to a desired setpoint \( s^*(t) = [X_{\text{des}}/Z_{\text{des}}, Y_{\text{des}}/Z_{\text{des}}, 1/Z_{\text{des}}]^T \), which corresponds to the desired relative position in the formation. A PI controller is employed:
$$ u(t) = K_P e(t) + K_I \int_0^t e(\tau) d\tau $$
where \( e(t) = s(t) – s^*(t) \). This closed-loop system enables the follower to maintain a specified formation geometry based solely on visual feedback, demonstrating the practicality of my detection algorithm in autonomous drone formation.
To validate my improvements, I conducted extensive experiments on both public and custom datasets. For the public benchmark, I used VOC2012, comprising 17,125 images across 20 categories. This tests general object detection capability. For drone-specific evaluation, I collected a custom dataset of 1,420 images simulating various formation scenarios: linear flight, leader pose variations, and distance changes. The dataset was split 80/10/10 for training, validation, and testing, respectively, and annotated in YOLO format with a single class “uav”.
The training environment consisted of an Ubuntu 20.04 system with an NVIDIA GeForce RTX 3090 GPU, Intel i9-10900X CPU, CUDA 11.1, Python 3.8.16, and PyTorch 1.10.0. The models were trained with standard hyperparameters: input size 640×640, batch size 16, and 300 epochs. Data augmentation included Mosaic, Cutout, and standard color jittering to enhance robustness.
Performance metrics included mean Average Precision at IoU threshold 0.5 (mAP50), mAP50-95 (averaged over IoU thresholds from 0.5 to 0.95), inference speed in milliseconds (ms), Giga Floating Point Operations Per Second (GFLOPS), and parameter count. The results on the VOC2012 dataset are summarized in Table 1.
| Model | Inference Speed (ms) | mAP50 (%) | GFLOPS | Parameters (MB) |
|---|---|---|---|---|
| YOLOv5n | 4.9 | 64.8 | 4.3 | 1.71 |
| YOLOv8n (baseline) | 5.9 | 67.4 | 8.2 | 2.88 |
| Improved YOLOv8n (my method) | 6.1 | 68.1 | 8.4 | 2.98 |
My improved model achieves a higher mAP50 than both YOLOv5n and the baseline YOLOv8n, with only a marginal increase in inference time (6.1 ms vs. 5.9 ms) and parameters (2.98 MB vs. 2.88 MB). This demonstrates that the enhancements effectively boost accuracy without compromising real-time performance—a critical factor for drone formation where low latency is essential.
On the custom drone formation dataset, the results are even more pronounced, as shown in Table 2. Since the dataset is tailored to the leader detection task, all models perform well, but my improvements push the boundaries further.
| Model | Input Size | Inference Speed (ms) | mAP50 (%) | mAP50-95 (%) |
|---|---|---|---|---|
| YOLOv5n | 640 | 5.8 | 99.1 | 86.7 |
| YOLOv8n (baseline) | 640 | 6.0 | 99.1 | 87.8 |
| Improved YOLOv8n (my method) | 640 | 6.1 | 99.5 | 89.1 |
My model achieves the highest mAP50 (99.5%) and mAP50-95 (89.1%), indicating superior precision across various detection thresholds. The Precision-Recall curve for my method on this dataset, as shown in the results, approaches near-perfect detection, which is vital for stable drone formation control. The slight trade-off in speed (6.1 ms) is negligible given the significant accuracy gains, ensuring reliable leader tracking even in challenging conditions.
To further illustrate the impact of my improvements, I analyze the feature maps before and after adding the deformable convolution and MHSA modules. The deformable convolution allows the network to focus on irregular leader contours, especially during banking turns or when the leader is at an angle. The MHSA module highlights contextual regions around the leader, reducing false negatives in cluttered skies. These enhancements collectively make the detector more resilient to the dynamic environments typical of drone formation flights.
Beyond offline evaluation, I integrated my improved detector into a full drone formation simulation using Gazebo and ROS (Robot Operating System). Gazebo provides a high-fidelity 3D environment with accurate physics and sensor models, ideal for testing vision-based algorithms. I modeled two quadcopters: a leader following a predefined trajectory and a follower equipped with a simulated monocular camera running my detection algorithm. The follower’s control loop, as described earlier, uses the detected bounding box to compute relative pose and generate velocity commands via a PI controller, communicated to the flight controller through MAVROS.
The simulation scenarios included straight-line formation, circular patrolling, and sudden leader maneuvers. In all cases, the follower successfully maintained the desired formation geometry. For instance, in a linear formation with a desired separation of 2 meters, the relative position in the x-axis (along-track) remained around 2 m with minimal deviation, while the y-axis (cross-track) error stayed near zero, as shown in the control performance graphs. This validates that my detection algorithm provides accurate and stable feedback for formation control, enabling autonomous drone formation without external communication.
The success of this simulation underscores the practical viability of my approach. The improved YOLOv8n detector runs in real-time on the follower’s onboard computer (simulated with equivalent processing power), processing each frame within ~6 ms, which is well within the control loop frequency of typical drones (e.g., 30 Hz). This leaves ample computational headroom for other tasks, such as obstacle avoidance or path planning, making it suitable for complex multi-drone missions.
In conclusion, my work presents a robust and efficient leader detection algorithm for vision-based drone formation. By integrating deformable convolutions and multi-head self-attention into YOLOv8n, along with strategic data augmentation, I have enhanced the model’s ability to handle the variability and occlusions inherent in aerial scenarios. Experimental results on both public and custom datasets confirm significant improvements in detection accuracy while maintaining real-time performance. The simulation in Gazebo demonstrates seamless integration into a complete formation control system, enabling drones to autonomously maintain formation using only visual cues. Future work will focus on extending this to multi-leader scenarios, incorporating deep learning-based pose estimation for more precise relative localization, and testing in outdoor environments with real drones. The continuous evolution of drone formation technologies will undoubtedly benefit from such advances in perception algorithms, pushing the boundaries of autonomous collaborative systems.
Throughout this article, I have emphasized the importance of “drone formation” as a key application domain. The proposed algorithm not only addresses the specific need for leader detection but also contributes to the broader field of autonomous swarm robotics. The use of deformable convolution and attention mechanisms can be adapted to other dynamic object detection tasks, while the control framework provides a blueprint for vision-based formation keeping. As drones become increasingly integral to logistics, surveillance, and disaster response, reliable and communication-independent formation methods like the one described here will be essential for safe and efficient operations.
