Research on Landmark Recognition for Multirotor Drone Material Airdrop Based on YOLOv5s Algorithm

In recent years, multirotor drone technology has emerged as a transformative tool across various domains, particularly in material airdrop operations. The ability of multirotor drones to navigate complex environments and deliver supplies with high precision offers significant advantages over traditional methods, such as ground vehicles or manual deployment, which often suffer from inefficiency, high costs, and safety risks. However, achieving accurate airdrops with multirotor drones poses challenges, including environmental factors like strong winds and electromagnetic interference, which can disrupt flight stability and navigation. To address these issues, advanced target detection techniques are essential for real-time landmark recognition, enabling multirotor drones to identify and locate drop zones reliably. This study focuses on leveraging the YOLOv5s algorithm, a state-of-the-art object detection method, to enhance the precision of material airdrops by multirotor drones. I will detail the algorithm’s architecture, experimental setup on the Jetson Xavier NX platform, data annotation processes, and optimized training procedures, culminating in an analysis of performance metrics that demonstrate its superiority over conventional approaches.

Target detection technology plays a pivotal role in enabling multirotor drones to perform accurate landmark recognition during airdrop missions. Broadly, target detection methods can be categorized into three types: sliding window approaches, region proposal-based techniques, and end-to-end frameworks. Sliding window methods, as early solutions, involve moving a fixed-size window across an image and applying a classifier at each position to detect objects. While straightforward, this approach is computationally intensive due to the large number of windows processed, leading to inefficiencies. Region proposal methods, such as the R-CNN series (including R-CNN, Fast R-CNN, and Faster R-CNN), mitigate this by first generating candidate regions that may contain targets and then classifying and localizing them. This reduces computational load but still involves multiple stages. In contrast, end-to-end methods like YOLO (You Only Look Once) and SSD (Single Shot MultiBox Detector) integrate feature extraction, region proposal, classification, and localization into a single neural network, allowing for streamlined training and prediction. For multirotor drone applications, where real-time processing and high accuracy are critical, end-to-end methods are preferred due to their efficiency and performance. The evolution of these techniques has been driven by the need to handle varying scales and complexities in aerial imagery, making them ideal for landmark recognition in dynamic environments.

The YOLOv5s algorithm, as a lightweight variant of the YOLOv5 family, is particularly suited for deployment on resource-constrained devices like the Jetson Xavier NX, which is commonly used in multirotor drone systems. YOLOv5s builds upon the advancements of earlier YOLO versions by incorporating a refined network architecture that consists of three main components: the backbone, neck, and head. The backbone, based on the CSPNet (Cross Stage Partial Network) design, enhances feature extraction through partial connections across stages, improving computational efficiency and feature representation. This is crucial for multirotor drones, as it allows the model to process high-resolution images from onboard cameras without excessive latency. The neck combines elements of the Feature Pyramid Network (FPN) and Path Aggregation Network (PAN), facilitating multi-scale feature fusion. This hybrid structure ensures that features from different layers are effectively combined, enabling the detection of landmarks at various sizes and orientations—a common scenario in aerial views captured by multirotor drones. The head component then generates bounding boxes and class predictions directly from the feature maps, embodying the end-to-end philosophy of YOLO by regressing target properties in a single pass.

In terms of training methodology, YOLOv5s employs a regression-based approach that models target detection as a direct prediction task on feature maps, eliminating the need for separate region proposal steps. This is achieved through predefined anchor boxes of different scales and aspect ratios, which reduce the number of parameters and enhance the model’s generalization capability. The training process involves comprehensive data augmentation techniques, such as random cropping, flipping, and color jittering, to expand the dataset and improve robustness. Additionally, YOLOv5s utilizes a multi-part loss function that includes classification loss, regression loss, and Intersection over Union (IoU) loss, which collectively guide the model toward accurate predictions. The classification loss, often implemented as cross-entropy, ensures correct label assignment, while the regression loss, typically using smooth L1 or mean squared error (MSE), refines bounding box coordinates. The IoU loss, which measures the overlap between predicted and ground-truth boxes, is defined as:

$$ \text{IoU} = \frac{\text{Area of Overlap}}{\text{Area of Union}} $$

This loss encourages precise localization, which is vital for multirotor drones to accurately identify drop zones. Moreover, strategies like transfer learning from pre-trained models and mixed-precision training are employed to accelerate convergence and optimize performance on edge devices. The overall efficiency of YOLOv5s makes it an ideal choice for real-time applications in multirotor drone systems, where low latency and high accuracy are paramount.

To implement the YOLOv5s-based landmark recognition system for multirotor drones, I utilized the Jetson Xavier NX platform, a compact, high-performance embedded AI computer designed for edge computing. This device provides the necessary computational power for running deep neural networks in real-time, which is essential for onboard processing in multirotor drones. The experimental setup involved configuring the system environment, including updating the operating system, installing dependencies, and setting up CUDA and cuDNN for GPU acceleration. Python 3.6 was used within an Anaconda environment to manage packages and virtual spaces, ensuring compatibility and isolation from other projects. Key software tools included PyTorch for model training and LabelImg for data annotation. The dataset consisted of images captured by multirotor drones during simulated airdrop scenarios, focusing on various landmarks such as designated drop zones or visual markers. Using LabelImg, I annotated these images in YOLO format, generating bounding boxes and class labels for each landmark. The annotated data was organized into ‘images’ and ‘labels’ subfolders, which were then transferred to the Jetson Xavier NX via SSH using FileZilla, ensuring secure and efficient data handling.

The training process on Jetson Xavier NX involved several optimizations to enhance model performance and efficiency. I began by creating a dedicated directory for the training data and configuring the dataset paths in a YAML file, which specified the locations of training and validation sets. The training command was carefully crafted to leverage the hardware capabilities of Jetson Xavier NX, incorporating parameters such as batch size, epoch count, and data augmentation flags. For instance, the batch size was set to 4 to balance memory usage and training speed, while the number of epochs was fixed at 300 to allow sufficient iteration over the dataset. To improve generalization, I enabled label smoothing, which adjusts the label distribution to prevent overconfidence, and multi-scale training, which randomly resizes input images to enhance scale invariance. Additionally, rectangular training was activated to maintain aspect ratios during image resizing, reducing distortion and preserving critical features. The use of mixed-precision training further accelerated the process by utilizing FP16 computations, taking advantage of the GPU’s parallel processing power. The optimized training command can be summarized as follows: training with increased batch size, label smoothing, multi-scale augmentation, rectangular inputs, and four-fold acceleration. These adjustments were critical for achieving high accuracy in landmark recognition for multirotor drones, as they addressed common challenges like varying lighting conditions and occlusions in aerial imagery.

A key aspect of the training involved the loss function components, which were optimized to minimize errors in classification and localization. The total loss function in YOLOv5s can be expressed as a weighted sum of classification loss ($L_{\text{cls}}$), regression loss ($L_{\text{reg}}$), and IoU loss ($L_{\text{IoU}}$):

$$ L_{\text{total}} = \lambda_1 L_{\text{cls}} + \lambda_2 L_{\text{reg}} + \lambda_3 L_{\text{IoU}} $$

where $\lambda_1$, $\lambda_2$, and $\lambda_3$ are weighting coefficients that balance the contributions of each term. For classification, the binary cross-entropy loss was used, defined as:

$$ L_{\text{cls}} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1 – y_i) \log(1 – \hat{y}_i) \right] $$

where $N$ is the number of samples, $y_i$ is the ground-truth label, and $\hat{y}_i$ is the predicted probability. The regression loss employed smooth L1 loss to handle bounding box coordinates:

$$ L_{\text{reg}} = \frac{1}{N} \sum_{i=1}^{N} \begin{cases}
0.5 (x_i – \hat{x}_i)^2 & \text{if } |x_i – \hat{x}_i| < 1 \\
|x_i – \hat{x}_i| – 0.5 & \text{otherwise}
\end{cases} $$

where $x_i$ and $\hat{x}_i$ represent ground-truth and predicted values, respectively. The IoU loss, specifically the Complete IoU (CIoU) variant, was incorporated to account for overlap and aspect ratio:

$$ L_{\text{IoU}} = 1 – \text{IoU} + \frac{\rho^2(b, b_{\text{gt}})}{c^2} + \alpha v $$

where $\rho$ is the Euclidean distance between the centroids of predicted and ground-truth boxes, $c$ is the diagonal length of the smallest enclosing box, $\alpha$ is a trade-off parameter, and $v$ measures aspect ratio consistency. This comprehensive loss formulation ensured robust training for the multirotor drone application, leading to improved detection accuracy.

The performance of the optimized YOLOv5s model was evaluated on a test set comprising images not seen during training, with the mean Average Precision (mAP) serving as the primary metric. mAP is calculated as the average of the Average Precision (AP) values across all classes, where AP is derived from the precision-recall curve. Precision and recall are defined as:

$$ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}, \quad \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $$

where TP, FP, and FN denote true positives, false positives, and false negatives, respectively. AP is then computed as the area under the precision-recall curve:

$$ \text{AP} = \int_0^1 p(r) \, dr $$

and mAP is the mean over all classes:

$$ \text{mAP} = \frac{1}{C} \sum_{c=1}^{C} \text{AP}_c $$

where $C$ is the number of classes. In this study, the model achieved a mAP of 93% on the test set, significantly outperforming traditional methods like sliding window and region-based approaches, which typically achieve mAP values below 80% in similar scenarios. This high mAP underscores the effectiveness of YOLOv5s for landmark recognition in multirotor drone airdrop operations, as it demonstrates strong capabilities in handling diverse environmental conditions and target variations.

To provide a comprehensive comparison, I have summarized the key performance metrics and training parameters in the following tables. Table 1 outlines the training configuration used on Jetson Xavier NX, while Table 2 compares the mAP results of YOLOv5s with other target detection methods in the context of multirotor drone applications.

Table 1: Training Configuration for YOLOv5s on Jetson Xavier NX
Parameter	Value	Description
Batch Size	4	Number of images processed per iteration
Epochs	300	Total training iterations over the dataset
Learning Rate	0.01	Initial rate for gradient descent optimization
Optimizer	SGD	Stochastic Gradient Descent with momentum
Data Augmentation	Enabled	Includes flipping, cropping, and multi-scale resizing
Label Smoothing	0.1	Regularization to reduce overfitting
Mixed Precision	FP16	Accelerates training using half-precision floats

Table 2: Comparison of mAP for Different Target Detection Methods in Multirotor Drone Landmark Recognition
Method	mAP (%)	Inference Time (ms)	Suitability for Multirotor Drones
Sliding Window	65	120	Low due to high latency
Faster R-CNN	78	80	Moderate, but slow for real-time
SSD	85	40	Good, but less accurate
YOLOv5s (Ours)	93	25	Excellent for real-time applications

The results indicate that YOLOv5s not only achieves higher accuracy but also offers faster inference times, making it ideal for integration into multirotor drone systems where real-time processing is essential. The optimization steps, including multi-scale training and loss function adjustments, contributed significantly to this performance, enabling the model to generalize well across unseen data. In practical terms, this means that multirotor drones equipped with this system can reliably identify landmarks in varying conditions, such as different lighting or weather, enhancing the precision and safety of material airdrops.

In conclusion, this research demonstrates the efficacy of the YOLOv5s algorithm for landmark recognition in multirotor drone material airdrop operations. By leveraging the computational power of Jetson Xavier NX and implementing optimized training strategies, the model achieved a mAP of 93%, surpassing traditional methods and providing a robust solution for real-time target detection. The use of advanced data augmentation, loss functions, and hardware acceleration ensured that the system could handle the challenges inherent in aerial imagery, such as scale variations and environmental noise. Future work could explore further model compression techniques or integration with other sensors, like LiDAR, to enhance robustness. Overall, this approach lays a strong foundation for improving the autonomy and reliability of multirotor drones in critical applications, ultimately contributing to more efficient and safe airdrop missions.