The intensive exploitation of coal resources, particularly in the ecologically fragile western regions of China, has induced severe and widespread ground fissures due to subsidence in shallow-buried coal seams. These fissures pose significant threats, including facilitating the spontaneous combustion of residual coal in goaf areas and increasing the risk of water inrushes during rainy seasons. Timely and accurate detection and mapping of these fissures are therefore critical for ensuring mining safety and implementing ecological remediation. Traditional manual field surveys are inefficient, labor-intensive, and hazardous over large, complex terrains. While satellite remote sensing and other geomatics technologies offer broad coverage, they often lack the spatial resolution and temporal frequency required for monitoring dynamic and fine-scale ground deformations. In this context, UAV drones equipped with high-resolution cameras have emerged as a powerful, flexible, and cost-effective tool for rapid data acquisition over mining areas.

Computer vision techniques, especially deep learning, have revolutionized the automatic interpretation of imagery captured by UAV drones. While semantic segmentation networks can classify fissures at the pixel level, they fail to distinguish between individual fissure instances, which is essential for quantifying parameters like length, width, and spatial distribution. Object detection models can localize individual fissures with bounding boxes but do not provide precise pixel-wise contours. Instance segmentation, which combines object detection with pixel-level mask prediction, is ideally suited for this task. Among various architectures, the YOLO (You Only Look Once) series, particularly YOLOv8, is renowned for its excellent balance between speed and accuracy. However, its default convolutional backbone may struggle with capturing long-range dependencies and multi-scale features of fissures, which are often fragmented, thin, and embedded in complex backgrounds with noise from vegetation and shadows.
This work proposes an enhanced YOLOv8-based instance segmentation model specifically designed for detecting and segmenting mining-induced ground fissures from high-resolution imagery acquired by UAV drones. The core innovation lies in replacing the original backbone of YOLOv8 with a Pyramid Vision Transformer (PVT) structure. The PVT backbone excels at learning multi-scale and high-resolution feature representations through a hierarchical transformer encoder, thereby significantly improving the model’s ability to recognize the intricate geometric patterns and long, discontinuous nature of ground fissures. To validate the proposed method, a dedicated dataset named HYCdata was constructed from UAV drone surveys conducted over a typical mining area. Comparative experiments demonstrate that the modified model outperforms the original YOLOv8 and other contemporary lightweight YOLO variants, offering a robust and automated solution for fissure monitoring using UAV drones.
Dataset Construction: HYCdata
The performance of deep learning models is heavily dependent on the quality and relevance of the training data. For this study, a custom dataset, HYCdata, was created to address the specific characteristics of mining-induced fissures. Data was collected using a DJI Matrice 300 RTK UAV drone equipped with a Zenmuse P1 high-precision photogrammetry camera. The flight missions were conducted over active mining panels, with a relative altitude of 50 meters ensuring both safety and the acquisition of imagery with a ground sampling distance (GSD) of 1-2 cm. This high resolution is crucial for identifying narrow fissures.
The collected imagery was processed to generate Digital Orthophoto Maps (DOMs). The original DOMs, with dimensions of 10,000 x 10,000 pixels, were too large for direct model input. A data preparation pipeline was implemented:
- Annotation: Using the LabelMe tool, expert annotators meticulously outlined the boundaries of individual ground fissures in the DOMs to create instance segmentation masks.
- Tiling with Overlap: To augment the dataset and ensure that fissures were not truncated at image boundaries, the large DOMs were cropped into smaller patches. A sliding window approach with a 50% overlap was used.
- Standardization: All resulting image patches were resized to a uniform dimension of 640 x 640 pixels, which is the standard input size for YOLOv8 models.
This process resulted in the HYCdata dataset comprising 4,382 annotated image samples. The dataset was subsequently split into training, validation, and test sets with a ratio of 7:2:1, corresponding to 2,752, 1,187, and 443 images, respectively. Key characteristics of the HYCdata dataset are summarized in the table below.
| Parameter | Specification |
|---|---|
| Source | UAV Drone (DJI M300 + P1) |
| Original Resolution | ~1-2 cm/pixel |
| Original Format | Digital Orthophoto Map (DOM) |
| Annotation Type | Instance Segmentation Masks |
| Final Image Size | 640 x 640 pixels |
| Total Samples | 4,382 |
| Training Set | 2,752 samples |
| Validation Set | 1,187 samples |
| Test Set | 443 samples |
Methodology: The Enhanced YOLOv8-PVT Model
The proposed methodology centers on modifying the YOLOv8 instance segmentation architecture to better handle the challenges posed by UAV drone imagery of ground fissures. YOLOv8 follows an anchor-free design and integrates the tasks of bounding box regression, classification, and mask prediction into a single, efficient network. Its neck employs a Path Aggregation Network (PANet) and Feature Pyramid Network (FPN) for multi-scale feature fusion, while the head produces the final outputs.
The standard YOLOv8 backbone, based on the CSPDarknet architecture, relies heavily on stacked convolutional layers. While effective for many objects, convolutional neural networks (CNNs) have a fundamental limitation: their receptive fields are local. Capturing long-range contextual information requires deepening the network, which can lead to optimization difficulties and loss of fine-grained details. For thin,蜿蜒的, and discontinuous fissures observed in UAV drone surveys, understanding the global context (e.g., connecting broken segments) is as important as recognizing local edges.
To overcome this, we integrate a Pyramid Vision Transformer (PVT) as the new backbone. The PVT retains the powerful hierarchical design of CNNs (producing feature maps at multiple scales) but replaces the core convolutional operations with Transformer encoders. The Transformer’s self-attention mechanism allows each pixel (or patch) to interact with all other pixels in the feature map, enabling the model to build global dependencies. This is formulated by the Multi-Head Self-Attention (MHSA) operation:
$$ \text{MHSA}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, …, \text{head}_h)W^O $$
$$ \text{where head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$
$$ \text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
Here, \(Q\), \(K\), and \(V\) are the query, key, and value matrices derived from the input features. The scaling factor \(\sqrt{d_k}\) stabilizes gradients.
A direct application of standard Transformers to dense prediction tasks is computationally prohibitive for high-resolution UAV drone images. The PVT introduces a Spatial-Reduction Attention (SRA) layer to address this. Before computing attention, SRA reduces the spatial dimensions of the Key (K) and Value (V) matrices, drastically cutting computational cost while preserving the global modeling capability. The PVT backbone constructs a four-stage feature pyramid \(\{F_1, F_2, F_3, F_4\}\) with progressively increasing channel dimensions and decreasing spatial resolutions (e.g., \(F_1\) at \(H/4 \times W/4\)). This multi-scale feature extraction is vital for fissures of varying widths and lengths captured by UAV drones.
By replacing the first eight layers of the original CSPDarknet (including Conv and C2f modules) with corresponding PVT stages (Patch Embedding + Transformer Encoder blocks), we create the enhanced YOLOv8-PVT model. This hybrid architecture leverages the global contextual modeling of Transformers and the efficient, multi-scale fusion design of YOLOv8, making it particularly adept at segmenting complex ground fissure networks from UAV drone imagery.
Experimental Setup and Evaluation Metrics
All experiments were conducted on a system with an Intel Core i9-14900K CPU and an NVIDIA RTX 4080 GPU, using the PyTorch framework. To ensure a fair and reproducible comparison, all models were trained from scratch with random initialization (no pre-trained weights) on the HYCdata dataset. The training configuration was unified: input size of 640×640, batch size of 4, initial learning rate of 0.01, and training for up to 1000 epochs with an early stopping patience of 100 epochs based on validation performance.
We compared the proposed YOLOv8-PVT model against several state-of-the-art, lightweight YOLO variants to benchmark its performance. The compared models included the original YOLOv8n (nano version), YOLOv9t (tiny), YOLOv10n, and YOLOv11n. For YOLOv10, which does not have an official instance segmentation variant, its core components were integrated into the YOLOv8 segmentation head to maintain functional parity.
The performance was evaluated using standard metrics for object detection and instance segmentation:
- Precision (P) and Recall (R): Measure the correctness and completeness of detected fissures.
$$ P = \frac{TP}{TP + FP} \times 100\%, \quad R = \frac{TP}{TP + FN} \times 100\% $$
where \(TP\), \(FP\), and \(FN\) are true positives, false positives, and false negatives, respectively. - Average Precision (AP): The area under the Precision-Recall curve. We report \(AP_{0.5}\) (AP at an Intersection over Union (IoU) threshold of 0.5) and \(AP_{0.95}\) (average AP for IoU thresholds from 0.5 to 0.95 in steps of 0.05).
- Mean Average Precision (mAP): The mean of AP over all classes (in our case, one class: fissure). We report \(mAP^{B}_{0.5}\) and \(mAP^{B}_{0.95}\) for bounding box performance, and \(mAP^{M}_{0.5}\) and \(mAP^{M}_{0.95}\) for mask segmentation performance.
- Model Complexity: Measured by Giga Floating Point Operations (GFLOPs) and the number of parameters (Params/M).
The IoU, central to these metrics, is calculated as:
$$ IoU = \frac{Area\ of\ Overlap}{Area\ of\ Union} = \frac{TP}{TP + FP + FN} $$
Results and Analysis
The quantitative results on the HYCdata test set are presented in the table below. The proposed YOLOv8-PVT model demonstrates superior performance across almost all metrics compared to the baseline YOLOv8n and other contemporary models.
| Model | PB (%) | RB (%) | mAPB0.5 (%) | mAPB0.95 (%) | mAPM0.5 (%) | mAPM0.95 (%) | GFLOPs | Params (M) |
|---|---|---|---|---|---|---|---|---|
| YOLOv9t | 65.7 | 62.2 | 65.5 | 40.0 | 58.7 | 21.1 | 7.9 | 2.01 |
| YOLOv10n | 70.5 | 63.1 | 68.3 | 41.6 | 62.8 | 21.9 | 11.7 | 2.84 |
| YOLOv11n | 70.2 | 64.2 | 69.3 | 42.8 | 63.5 | 23.1 | 10.2 | 2.84 |
| YOLOv8n (Baseline) | 73.3 | 64.7 | 70.9 | 45.0 | 66.3 | 24.5 | 12.0 | 3.26 |
| YOLOv8-PVT (Ours) | 75.8 | 68.1 | 74.3 | 48.9 | 69.3 | 25.8 | 17.7 | 5.57 |
The key findings are as follows:
- Performance Gain: Our YOLOv8-PVT model achieves a mAPB0.5 of 74.3% and a mAPM0.5 of 69.3%, representing significant improvements of +3.4 and +3.0 percentage points over the baseline YOLOv8n, respectively. The gains in the more stringent mAPB0.95 and mAPM0.95 metrics are also notable (+3.9 and +1.3 pp), indicating better performance at higher IoU thresholds, which correlates with more precise localization and segmentation.
- Precision and Recall: The model exhibits the highest Precision (75.8%) and Recall (68.1%) for bounding box detection, meaning it identifies more true fissures from UAV drone images while generating fewer false alarms compared to other models.
- Complexity Trade-off: The performance improvement comes with an expected increase in computational cost. The GFLOPs increase from 12.0 to 17.7, and parameters rise from 3.26M to 5.57M. This trade-off is justified for the task, as the enhanced capability to process complex scenes from UAV drones is critical, and the model size remains within a manageable range for potential deployment.
- Qualitative Superiority: Visual inspection of the segmentation results reveals the model’s strengths. In areas with mixed rock and soil, the PVT-enhanced backbone allows the model to better discern subtle fissures from natural rock textures and striations. More importantly, in scenes with partial vegetation cover—a common challenge in UAV drone surveys—the model’s global attention mechanism helps in associating disconnected fissure segments obscured by grass or shadows, reducing instances of under-segmentation (missed fissures) that are evident in the results of other models.
The success of the YOLOv8-PVT model can be attributed to the PVT backbone’s ability to construct feature representations that are both locally attentive and globally coherent. While standard CNNs in the baseline model might break a long fissure into separate, unconnected detections due to local processing, the Transformer’s self-attention in the PVT backbone helps maintain the perceptual continuity of such linear features across the image. This is paramount for accurately mapping the extensive networks of ground fissures typically captured by wide-area UAV drone surveys.
Conclusion
This work addresses the critical need for automated, precise monitoring of mining-induced ground fissures using UAV drone technology. We presented an enhanced instance segmentation model by integrating a Pyramid Vision Transformer (PVT) backbone into the YOLOv8 architecture. This modification equips the model with superior multi-scale feature learning and global contextual understanding capabilities, which are essential for segmenting the thin, elongated, and often fragmented fissures found in complex mining landscapes. The construction of the HYCdata dataset from real UAV drone surveys provided a relevant and challenging benchmark for training and evaluation.
Experimental results demonstrate that the proposed YOLOv8-PVT model outperforms the original YOLOv8 and other latest lightweight YOLO variants, achieving a mAPB0.5 of 74.3%. It shows particular robustness in scenarios with intricate backgrounds and partial occlusions, common in UAV drone imagery. The model provides not just bounding boxes but precise pixel-wise masks for each fissure instance, enabling the direct extraction of morphological parameters such as length, average width, and orientation distribution. This output is invaluable for quantitative risk assessment, guiding remediation efforts like landfilling, and studying the spatiotemporal evolution of ground deformation.
Future work will focus on further optimizing the model’s efficiency for real-time processing onboard UAV drones and extending the methodology to multi-temporal UAV drone image stacks for dynamic fissure monitoring and growth analysis. The integration of this instance segmentation pipeline with UAV drone platforms represents a significant step towards intelligent and automated geohazard monitoring in mining areas.
