CDMG-YOLO: Advancing Small Object Detection in UAV Drone Imagery for Open-Pit Mine Monitoring

In modern open-pit mining operations, the dynamic and complex environment poses significant challenges for real-time monitoring and safety management. Traditional methods, such as manual inspections or GPS tracking, often fall short in providing timely and accurate data due to high labor costs and latency. With the advent of unmanned aerial vehicles (UAV drones), remote sensing has become a pivotal tool for capturing high-resolution imagery of mining sites. UAV drones offer unparalleled flexibility, enabling frequent and detailed aerial surveys that facilitate the tracking of machinery and vehicles. However, analyzing these images is nontrivial, as mechanical targets—like excavators, trucks, and loaders—often appear as small objects with substantial scale variations, cluttered backgrounds, and frequent occlusions. These factors lead to high rates of missed detections and false alarms, hindering effective automation. To address these issues, we propose CDMG-YOLO, a lightweight yet powerful model tailored for small object detection in UAV drone remote sensing images of open-pit mines. Our approach builds upon YOLOv11n, introducing enhancements across feature perception, modeling, fusion, and training optimization to achieve a balance between accuracy, efficiency, and robustness. This work demonstrates that through strategic architectural modifications and loss function refinements, UAV drone-based systems can reliably identify critical machinery even in challenging conditions, paving the way for smarter mine management.

The proliferation of UAV drones in industrial applications has revolutionized data acquisition, but it also demands advanced computer vision algorithms to interpret the captured imagery. In open-pit mines, UAV drones fly over vast areas, collecting images that contain numerous small mechanical targets. These targets are essential for operational oversight, as they indicate equipment utilization, safety compliance, and workflow efficiency. However, detecting them accurately is complicated by several factors: the small size of objects relative to the image frame, low contrast due to dust and shadows, dense clustering of vehicles, and occlusions from terrain or other structures. Existing deep learning models, particularly single-stage detectors like the YOLO series, have shown promise in remote sensing tasks, but they often struggle with small objects and complex backgrounds. For instance, while YOLOv8 and YOLOv9 variants improve multi-scale detection, they may incur high computational costs or lack specialized mechanisms for mining scenarios. Our model, CDMG-YOLO, aims to overcome these limitations by integrating a fine-grained detection head, deformable convolutional attention, dual-attention fusion modules, and a gradient-balanced loss function. We validate its performance on a custom dataset of open-pit mine imagery captured by UAV drones, as well as on a public dataset, showing superior results in both accuracy and efficiency. The following sections detail our methodology, experimental setup, and findings, emphasizing the role of UAV drones in enabling these advancements.

In the feature perception stage, we augment the baseline YOLOv11n by adding a P2 detection head that operates on high-resolution feature maps. This is crucial for small object detection in UAV drone imagery, as shallow layers retain fine-grained spatial details like edges and textures. The P2 head generates a 160×160 pixel feature map, which is combined with backbone features to create a dedicated pathway for small targets. Formally, let $F_{\text{in}} \in \mathbb{R}^{C \times H \times W}$ be an input feature map, where $C$, $H$, and $W$ denote channels, height, and width, respectively. The P2 head enhances detail extraction through a series of convolutional blocks, outputting a refined feature map $F_{\text{P2}}$ that emphasizes small object regions. This addition helps mitigate the information bottleneck in traditional feature pyramid networks, ensuring that minute details from UAV drone images are preserved for subsequent processing.

For feature modeling, we redesign the C3k2 module by incorporating Deformable Large Kernel Attention (DLKA), resulting in the D-C3k2 module. This module expands the receptive field and improves scale generalization, which is vital for handling the diverse sizes of machinery in UAV drone footage. The DLKA mechanism uses deformable convolutions to adaptively sample spatial locations, allowing the model to focus on irregular or occluded objects. The output of the D-C3k2 module is computed as follows. First, the Bottleneck operation processes an input feature map $F$:

$$F_{\text{bt}} = \text{CBS}_{1\times1}(\text{CBS}_{3\times3}(F)) \oplus F$$

where $\text{CBS}$ denotes a convolution-batch normalization-SiLU block, and $\oplus$ represents element-wise addition. Then, the DLKA module applies deformable depth-wise convolution and attention to produce $F_{\text{DLKA}}$:

$$F_{\text{DLKA}} = \text{Conv}_{1\times1}(F_{\text{att}} \otimes F’_{\text{bt}}) + F_{\text{bt}}$$

with $F_{\text{att}} = \text{Conv}_{1\times1}(\text{DDW}_{\text{D-Conv}}(\text{DDW}(F’_{\text{bt}})))$, where $\text{DDW}$ is depth-wise deformable convolution, $\otimes$ is element-wise multiplication, and $F’_{\text{bt}} = \text{GELU}(\text{Conv}(F))$. Finally, the D-C3k2 output is:

$$F_{\text{D-C3k2}} = \text{CBS}(\text{Concat}(\text{Conv}_{1\times1}(F), F_{\text{att}}))$$

This structure enables dynamic adjustment to object shapes, enhancing feature representation for targets captured by UAV drones in cluttered environments.

In the feature fusion stage, we introduce a dual-attention mechanism comprising the Convolution and Attention Fusion Module (CAFM) and the Multidimensional Collaborative Attention Module (MCAM). These modules facilitate multidimensional interaction and adaptive weighting of features across channel and spatial dimensions, helping to suppress background noise and highlight targets. CAFM combines global and local branches: the global branch uses depth-wise convolution and attention to model long-range dependencies, while the local branch employs channel shuffle and group convolution to capture fine structural details. For an input $F_c$, the global branch output $F_g$ is:

$$F_g = \text{Conv}_{1\times1}(F_{\text{att}}(\text{DConv}_{3\times3}(\text{Conv}_{1\times1}(F_c)))) \oplus F_c$$

where $F_{\text{att}}$ is derived from a modified attention mechanism: $F_{\text{att}} = V \otimes \text{Softmax}(K \otimes Q / \alpha)$, with $Q, K, V$ as query, key, and value matrices, and $\alpha$ a learnable scaling factor. The local branch output $F_p$ is:

$$F_p = \text{Conv}_{3\times3\times3}(\text{CS}(\text{Conv}_{1\times1}(F_c)))$$

where $\text{CS}$ denotes channel shuffle. The fused output from CAFM integrates both global context and local details, which is essential for distinguishing machinery from complex backgrounds in UAV drone images.

MCAM further enhances feature discrimination through parallel attention along channel, height, and width dimensions. It computes statistics via global average and standard deviation pooling, followed by lightweight convolutions to generate attention weights. The final output $F_{\text{out}}$ is an average of the three branch outputs:

$$F_{\text{out}} = \frac{1}{3}(F_W \oplus F_H \oplus F_C)$$

where $F_W, F_H, F_C$ are features refined along width, height, and channel axes, respectively. This collaborative attention mechanism ensures robust feature representation for small, densely packed objects typical in UAV drone surveys of mining sites.

For training optimization, we address class imbalance and gradient issues common in UAV drone datasets by combining the original distribution focal loss from YOLOv11n with the Gradient Harmonizing Mechanism Loss (GHM Loss). The composite loss function is defined as:

$$L = \gamma_1 L_{\text{DFL}} + \gamma_2 L_{\text{GHM}}$$

where $\gamma_1$ and $\gamma_2$ are learnable coefficients summing to 1. GHM Loss reweights gradients based on gradient density, giving more emphasis to hard samples (e.g., small or occluded machinery) and alleviating the dominance of easy samples. This balance improves model convergence and detection performance for challenging cases in UAV drone imagery.

To validate CDMG-YOLO, we conducted experiments on a self-built dataset of open-pit mine images acquired by UAV drones. The dataset includes 1,905 orthophotos from 39 mines, captured using an M350 RTK multi-rotor UAV drone equipped with a 102S V3 five-lens oblique camera. After preprocessing—such as cropping to 640×640 pixels, data augmentation with rotation, flipping, and noise addition—the dataset expanded to 4,554 images. These were annotated with three classes: mining equipment, transport vehicles, and other equipment. We split the data into training (3,187 images), validation (910 images), and test (457 images) sets. Experiments were run on a server with an NVIDIA RTX 3090 GPU, using PyTorch 2.2.1 and Python 3.9. Training parameters included 300 epochs, a batch size of 10, an initial learning rate of 0.01, and specific gains for box, class, and distribution losses.

Parameter	Value
Training Epochs	300
Batch Size	10
Input Image Size	640×640
Initial Learning Rate	0.01
Box Loss Gain	7.5
Class Loss Gain	0.5
Distribution Loss Gain	1.2

Ablation studies were performed to assess the contribution of each component in CDMG-YOLO. Results, as summarized in the table below, show that individual modules like P2 head, D-C3k2, and dual-attention mechanisms improve specific metrics, but their combination yields the best overall performance. CDMG-YOLO achieves a precision of 0.859, recall of 0.764, mAP@0.5 of 0.697, and mAP@0.5:0.95 of 0.547, with only 3.1×10^6 parameters and 12.7 GFLOPs. This underscores its lightweight design suited for deployment on UAV drone platforms or edge devices.

Model	Parameters (10^6)	GFLOPs (10^9)	Precision	Recall	mAP@0.5	mAP@0.5:0.95
YOLOv11n	2.6	6.3	0.759	0.612	0.690	0.383
YOLOv11n + P2	2.9	10.1	0.786	0.604	0.652	0.417
YOLOv11n + D-C3k2	2.5	5.7	0.813	0.656	0.777	0.477
YOLOv11n + CAFM + MCAM	3.0	8.7	0.791	0.650	0.695	0.492
YOLOv11n + GHM Loss	2.6	6.4	0.712	0.701	0.617	0.447
CDMG-YOLO (Full)	3.1	12.7	0.859	0.764	0.697	0.547

Comparative experiments with state-of-the-art models further demonstrate the superiority of CDMG-YOLO. We tested against YOLOv5n, YOLOv8n, YOLOv8x, BGF-YOLOv10, FFCA-YOLO, ATBHC-YOLO, and DETR on our UAV drone dataset. As shown in the table below, CDMG-YOLO outperforms all lightweight models in precision, recall, and mAP metrics, while maintaining low computational costs. Notably, it surpasses YOLOv8x in recall and mAP@0.5 with only about 1/20 of the GFLOPs, highlighting its efficiency for real-time applications on UAV drones.

Model	Parameters (10^6)	GFLOPs (10^9)	Precision	Recall	mAP@0.5	mAP@0.5:0.95
YOLOv5n	1.8	4.5	0.475	0.561	0.311	0.196
YOLOv8n	3.1	8.3	0.687	0.506	0.612	0.346
YOLOv8x	68.3	251.3	0.813	0.595	0.632	0.493
BGF-YOLOv10	2.0	8.5	0.685	0.547	0.351	0.353
FFCA-YOLO	4.5	17.4	0.769	0.689	0.559	0.397
ATBHC-YOLO	36.8	167.2	0.835	0.618	0.587	0.401
DETR	41.2	86.3	0.785	0.727	0.503	0.394
CDMG-YOLO	3.1	12.7	0.859	0.764	0.697	0.547

To evaluate robustness in complex scenarios typical of open-pit mines, we tested CDMG-YOLO on images with low contrast, severe occlusions, and dense target distributions. The model consistently achieved accurate localization and identification, thanks to its enhanced feature perception and fusion mechanisms. For instance, in low-light conditions common during UAV drone flights at dusk or under dust clouds, the D-C3k2 module’s large receptive field helped capture faint object boundaries. In occluded cases, the dual-attention modules prioritized visible parts, enabling reliable inference. These capabilities are crucial for UAV drone-based monitoring systems that must operate in unpredictable environments.

Generalization ability was assessed on the public LEVIR dataset, which includes small objects like aircraft, ships, and oil tanks. CDMG-YOLO achieved an average precision of 0.918, recall of 0.868, and mAP@0.5 of 0.936, outperforming YOLOv11n across all categories. This indicates that the model’s design principles—such as fine-grained detection heads and multidimensional attention—transfer well to other remote sensing contexts, reinforcing its utility for diverse UAV drone applications. The results, summarized below, show consistent improvements, particularly for aircraft detection where mAP@0.5 reached 0.956.

Category	CDMG-YOLO Precision	CDMG-YOLO Recall	CDMG-YOLO mAP@0.5	YOLOv11n mAP@0.5
Aircraft	0.948	0.907	0.956	0.927
Ship	0.884	0.823	0.912	0.851
Oil Tank	0.915	0.874	0.939	0.903
Average	0.918	0.868	0.936	0.911

In conclusion, CDMG-YOLO represents a significant step forward in small object detection for UAV drone imagery of open-pit mines. By integrating a P2 detection head for detail enhancement, a D-C3k2 module for adaptive context modeling, CAFM and MCAM for multidimensional feature fusion, and a GHM-based loss for balanced training, the model achieves high accuracy while remaining lightweight. Experimental results on both custom and public datasets confirm its effectiveness in challenging conditions, such as low contrast, occlusion, and dense clusters. The use of UAV drones for data acquisition is central to this work, as they provide the high-resolution images necessary for detecting small machinery. With only 3.1 million parameters and 12.7 GFLOPs, CDMG-YOLO is suitable for deployment on resource-constrained platforms, including onboard UAV drone systems or edge computing devices. Future work could explore real-time implementation on UAV drones for live monitoring, or extension to other industrial applications. Overall, this research demonstrates that advanced deep learning models can harness the power of UAV drone remote sensing to improve safety and efficiency in mining operations, contributing to the broader goal of autonomous industrial surveillance.

The success of CDMG-YOLO hinges on its synergistic components, each addressing specific challenges in UAV drone imagery. For example, the P2 head directly tackles the small object problem by leveraging high-resolution features, which is essential when UAV drones capture vast areas with tiny targets. The D-C3k2 module’s deformable convolutions mimic the flexibility of human vision, adjusting to irregular shapes often found in mining equipment. Meanwhile, the dual-attention mechanisms filter out irrelevant background information, a common issue in UAV drone photos due to terrain clutter. The GHM Loss ensures that the model learns effectively from hard samples, such as partially hidden vehicles, which are prevalent in dynamic mine sites. These innovations collectively enable robust detection, making UAV drone-based systems more reliable for practical use.

From a broader perspective, the integration of UAV drones with AI models like CDMG-YOLO opens new avenues for smart mining. UAV drones can autonomously patrol sites, capturing real-time footage that is analyzed on-site or transmitted to cloud servers. This reduces human intervention, lowers costs, and enhances response times to incidents. Moreover, the lightweight nature of our model allows it to run on embedded systems mounted on UAV drones, facilitating immediate decision-making. As UAV drone technology evolves with better sensors and longer flight times, the demand for efficient detection algorithms will only grow. CDMG-YOLO’s design principles—emphasizing accuracy, efficiency, and adaptability—set a benchmark for future developments in this field.

In summary, we have presented CDMG-YOLO, a novel model tailored for small mechanical target detection in UAV drone remote sensing images of open-pit mines. Its architectural enhancements and optimized loss function address key limitations in existing methods, yielding superior performance on multiple metrics. The model’s lightweight profile ensures compatibility with UAV drone platforms, promoting real-world deployment. Through extensive experiments, we validate its efficacy in complex scenarios and its generalization across datasets. This work underscores the transformative potential of combining UAV drones with advanced computer vision, paving the way for more intelligent and automated mining operations. As industries increasingly adopt UAV drones for monitoring, models like CDMG-YOLO will be instrumental in unlocking their full potential, driving progress toward safer and more efficient resource management.