Enhanced Instance Segmentation of Mining-Induced Ground Fissures Using UAV Drones and Transformer-Based Vision Models

The sustained demand for coal resources in China, particularly with the intensified development of coal basins in the western regions, has brought to the forefront significant geotechnical and environmental challenges. Intensive extraction of shallow-buried coal seams in these arid and semi-arid areas frequently induces widespread ground fissures. These mining-induced discontinuities severely damage the surface ecological fabric and pose substantial secondary hazards, such as spontaneous combustion of residual coal in goafs and water inrushes during precipitation events. Timely and accurate detection, along with quantitative characterization of these fissures, is therefore a critical prerequisite for effective land reclamation and mining hazard mitigation.

Traditional ground-based inspection methods are not only labor-intensive and inefficient but also pose safety risks in unstable post-mining landscapes. While remote sensing technologies like Synthetic Aperture Radar (SAR) and high-resolution satellites have been applied, they often suffer from limitations in temporal resolution, spatial detail, or cost-effectiveness for frequent, site-specific monitoring. In contrast, UAV drones have emerged as a transformative tool for geohazard monitoring. Their flexibility, ability to capture very high-resolution imagery at a low cost, and short redeployment cycles make them ideally suited for rapidly evolving environments like active mining areas. The imagery acquired by UAV drones provides an unprecedented level of detail, allowing for the visual identification of fissure patterns that are otherwise indiscernible.

The advent of deep learning has further unlocked the potential of UAV drones imagery for automated analysis. Semantic segmentation networks like U-Net and its variants have shown promise in pixel-wise crack identification. However, they treat all pixels of the same class as one entity, failing to distinguish between individual fissure instances—a crucial step for calculating per-fissure morphological parameters like length, width, and orientation. Object detection models, such as the YOLO series, can locate individual fissures but provide only bounding boxes, lacking the precise contour information needed for detailed geometric analysis. Instance segmentation, which combines the strengths of both by detecting objects and segmenting their exact pixel-wise masks, presents the most suitable framework. Yet, its application has been predominantly focused on civil infrastructure like road cracks, with a notable gap in research dedicated to the unique complexities of mining-induced ground fissures in natural, cluttered terrains.

This article addresses this gap by proposing an advanced instance segmentation methodology specifically designed for mining-induced ground fissures in imagery captured by UAV drones. The core innovation lies in the architectural enhancement of the state-of-the-art YOLOv8 instance segmentation model. We integrate a Pyramid Vision Transformer (PVT) module into the backbone network, empowering the model with superior multi-scale feature learning and global contextual understanding capabilities. This is essential for handling the dense, discontinuous, and scale-varying nature of fissures against complex backgrounds of soil, rock, and sparse vegetation. To train and validate this model, a dedicated dataset named HYCdata was curated from high-resolution aerial surveys conducted over a coal mine in the arid and semi-arid region of Western China using UAV drones. Comprehensive experiments demonstrate that our proposed model significantly outperforms the baseline YOLOv8 and other contemporary lightweight YOLO variants in accurately segmenting individual ground fissure instances.

1. Introduction and Background

The exploitation of shallow coal seams, especially in the ecologically fragile Western China, induces rapid and extensive ground movement. The overlying strata, often comprising weakly cemented sandstones and soils, readily fracture under the influence of underground mining activities, leading to the propagation of cracks to the surface. These surface manifestations, known as mining-induced ground fissures, are not merely cosmetic defects. They act as direct conduits connecting the surface atmosphere with the subsurface goaf, facilitating oxygen ingress that can trigger and sustain coal seam fires—a persistent and polluting hazard. Furthermore, during rain events, these fissures channel surface water downward, potentially leading to sudden and dangerous water accumulation in underground workings. Beyond safety, fissures disrupt soil integrity, accelerate water and soil loss, and severely hinder the natural process of ecological restoration in mining areas.

The conventional paradigm for fissure mapping relies heavily on field surveys conducted by geological engineers. This approach is inherently slow, subjective, and struggles to achieve complete spatial coverage, particularly in remote or topographically challenging terrain. The integration of UAV drones into geomatics has revolutionized geospatial data acquisition. Equipped with high-resolution optical sensors, UAV drones can systematically capture centimeter-level imagery, generating detailed Digital Orthophoto Maps (DOMs) and Digital Surface Models (DSMs) of vast mining-affected areas in a single flight mission. The agility of UAV drones allows for rapid response monitoring after significant mining advance or seismic events, providing time-series data critical for understanding the dynamic evolution of surface damage.

The bottleneck, however, has shifted from data acquisition to data interpretation. Manually delineating thousands of fissures from square kilometers of high-resolution imagery is prohibitively time-consuming. Early automated methods utilized image processing techniques like edge detection and thresholding, but their performance was highly sensitive to lighting variations, soil color, and vegetative cover. The rise of deep learning, particularly Convolutional Neural Networks (CNNs), offered a more robust solution. Models like U-Net, with an encoder-decoder structure and skip connections, became the de facto standard for semantic segmentation of cracks, learning to associate low-level edge features with high-level contextual information.

However, for the specific task of monitoring mining subsidence damage, instance-level information is paramount. Land managers need to know not just *where* fissured areas are, but *how many* distinct fissures exist, their individual sizes, and their growth over time. This granular data is essential for prioritizing remediation efforts, calculating backfill volumes, and assessing the effectiveness of control measures. While two-stage instance segmentation models like Mask R-CNN exist, they are often computationally heavier and slower. The YOLO (You Only Look Once) family, known for its speed and efficiency in object detection, was recently extended to perform instance segmentation in a single, streamlined pass with YOLOv8. Yet, its foundational CNN backbone, optimized for general object detection, may lack the specialized capacity to model the long-range dependencies and intricate geometries characteristic of narrow, winding ground fissures. This work posits that integrating the global self-attention mechanism of Vision Transformers into this framework can bridge this capability gap, leading to a more intelligent and precise tool for fissure analysis from UAV drones data.

2. Proposed Methodology: An Improved YOLOv8 with PVT Backbone

2.1. Architectural Overview of YOLOv8 Instance Segmentation

YOLOv8 represents a culmination of advancements in the YOLO series, offering a unified architecture that supports object detection, classification, and instance segmentation. Its instance segmentation head builds upon the anchor-free detection core. The model first predicts a set of bounding boxes and associated class probabilities. For each positive detection that likely contains an object (e.g., a fissure), a dedicated mask prediction branch generates a prototype mask. The final instance mask is produced by combining these prototypes with learned mask coefficients from the detection head. This design enables real-time or near-real-time segmentation performance.

The standard YOLOv8 backbone, often based on a modified CSPDarknet, employs a series of convolutional layers and cross-stage partial (CSP) modules to hierarchically extract features. While effective for many objects, standard convolutions have a fundamental limitation: their receptive field is local and fixed by the kernel size. To model relationships between distant pixels (crucial for following a long, thin fissure across an image), the network must rely on the successive downsampling and deepening of layers, which can lead to loss of fine-grained detail and inefficient learning of long-range patterns.

2.2. Integrating the Pyramid Vision Transformer (PVT)

To overcome the limitations of a purely convolutional backbone, we propose to replace a significant portion of the early and middle stages of the YOLOv8 backbone with a Pyramid Vision Transformer (PVT) structure. The PVT is designed to serve as a general-purpose backbone for dense prediction tasks, combining the progressive pyramid structure of CNNs with the global receptive field of Transformers.

The core operation of a Transformer is the Self-Attention mechanism. For an input feature map, it is first projected into Query (Q), Key (K), and Value (V) matrices. The attention weights, which dictate how much each pixel (or patch) attends to every other pixel, are computed as a scaled dot-product between Q and K. These weights are then used to aggregate information from V. The Multi-Head Self-Attention (MHSA) module runs this process in parallel across multiple “heads,” allowing the model to focus on different types of relationships simultaneously.

$$ \text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_{head}}}\right)V $$
$$ \text{MHSA}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, …, \text{head}_h)W^O $$
where $d_{head}$ is the dimension per head and $W^O$ is a linear projection matrix.

The PVT ingeniously adapts this for high-resolution images by introducing a Spatial-Reduction Attention (SRA) layer. Before computing attention, SRA aggressively reduces the spatial dimensions of the K and V matrices through a reshaping and linear projection operation. This dramatically cuts the computational and memory cost (which scales quadratically with the number of pixels) while preserving the essential global context. The PVT backbone is structured in four stages, producing a classic feature pyramid ${F_1, F_2, F_3, F_4}$ with progressively increasing channel depth and decreasing spatial resolution.

Table 1: Architectural Comparison: Standard vs. Improved YOLOv8 Backbone
Network Component	Standard YOLOv8 (CSPDarknet-based)	Proposed Improved YOLOv8 (PVT-enhanced)
Core Building Block	C2f Module (Cross-Stage Partial network with fusion)	Transformer Encoder with Spatial-Reduction Attention (SRA)
Receptive Field	Local, gradually expanded through stacking	Global from the first stage onwards via Self-Attention
Key Strength	Strong local feature extraction, spatial inductive bias	Superior modeling of long-range dependencies and contextual relationships
Multi-scale Handling	Achieved through stride and pooling	Inherent in the pyramid stage design; features at each stage have a natural global context.
Suitability for Fissures	May struggle with discontinuous, long, thin objects	Excels at connecting disjoint segments and understanding overall fissure geometry.

By integrating the PVT into YOLOv8, we create a hybrid model. The PVT stages capture rich, multi-scale contextual features where each pixel’s representation is informed by the entire image context. These features are then fed into the remaining YOLOv8 neck (Path Aggregation Network – PAN) and head for the final task of bounding box regression, classification, and mask generation. This synergy allows the model to not only detect a fissure locally but also understand its full extent and relationship to surrounding terrain, significantly improving segmentation accuracy for complex fissure networks imaged by UAV drones.

3. Data Acquisition and HYCdata Dataset Curation Using UAV Drones

The performance of any deep learning model is intrinsically linked to the quality and relevance of its training data. To address the lack of a publicly available dataset for mining-induced ground fissures, we created a bespoke dataset, termed HYCdata, from a real-world mining operation.

Data Acquisition: The aerial survey was conducted over active longwall panels in a large surface mine located in a typical arid region. A commercial-grade multi-rotor UAV drone, equipped with a high-precision RGB photogrammetric camera, was deployed as the primary data acquisition platform. Flight parameters were meticulously planned to ensure both safety and data quality. The UAV drones were flown at a low relative altitude (approximately 50-80 meters above ground level) along pre-programmed flight lines with high forward and side overlaps (80% and 70%, respectively). This configuration ensured the collection of imagery with a ground sampling distance (GSD) of 1-2 centimeters, revealing fissures as narrow as a few centimeters in width.

Data Processing: The collected thousands of overlapping images were processed using structure-from-motion (SfM) and multi-view stereo (MVS) photogrammetric software. This pipeline generated high-resolution, georeferenced outputs: a Digital Orthophoto Map (DOM) providing a seamless, top-down view of the terrain, and a Digital Surface Model (DSM) encoding elevation information. The DOM, with its uniform scale and corrected geometry, serves as the primary input for our fissure segmentation task.

Dataset Annotation & Preparation: The study area DOM was carefully examined, and regions exhibiting clear mining-induced ground fissures were selected. Using specialized image annotation software, trained annotators meticulously traced the precise boundaries of every distinct fissure instance, creating polygon masks. This pixel-level labeling is the ground truth for instance segmentation. To format the data for model training, the large DOM was systematically tiled into smaller patches of 640×640 pixels, a suitable size for the model’s input. A 50% overlap between tiles was used during the cropping process to prevent fissures from being cut off at tile boundaries and to augment the effective dataset size. The final HYCdata dataset comprises 4,382 annotated image patches. It was randomly split into training (70%, 2,752 images), validation (20%, 1,187 images), and test (10%, 443 images) sets to ensure fair evaluation.

4. Experimental Analysis and Results

4.1. Evaluation Metrics

The model’s performance was quantitatively assessed using standard metrics for both object detection and instance segmentation. Let $TP$, $FP$, and $FN$ denote True Positives, False Positives, and False Negatives, respectively. For detection, Precision ($P$) and Recall ($R$) are fundamental:

$$ P = \frac{TP}{TP + FP} \times 100\% $$
$$ R = \frac{TP}{TP + FN} \times 100\% $$

The primary metric for object detection is the mean Average Precision (mAP). Average Precision (AP) is the area under the Precision-Recall curve. mAP is computed by averaging AP over all classes (in our case, a single ‘fissure’ class). It is typically evaluated at different Intersection over Union (IoU) thresholds. IoU measures the overlap between a predicted bounding box/mask ($B_p$) and the ground truth ($B_{gt}$):

$$ IoU = \frac{Area(B_p \cap B_{gt})}{Area(B_p \cup B_{gt})} $$

We report:
– $mAP_{0.5}$: mAP at an IoU threshold of 0.5, a common standard for a “good” match.
– $mAP_{0.5:0.95}$: The average mAP for IoU thresholds from 0.5 to 0.95 in steps of 0.05. This is a stricter, more comprehensive metric.
Superscripts $B$ and $M$ denote metrics for Bounding Box and Mask, respectively. Additionally, we report model complexity metrics: Giga Floating Point Operations (GFLOPs) and the number of parameters (Params/M).

4.2. Implementation Details and Baseline Models

All experiments were conducted on a high-performance computing workstation. To ensure a fair and rigorous comparison, we benchmarked our improved YOLOv8 model against several state-of-the-art, lightweight versions of the YOLO family that are suitable for potential deployment on edge devices connected to UAV drones:
– YOLOv8n: The original nano version of YOLOv8.
– YOLOv9t: The tiny version of YOLOv9, known for its programmable gradient information.
– YOLOv10n: The nano version of YOLOv10, optimized for end-to-end efficiency.
– YOLOv11n: A subsequent lightweight variant.
All models were trained from scratch on the HYCdata training set using identical hyperparameters (input size 640, initial learning rate 0.01, batch size 4) for 1000 epochs with early stopping to prevent overfitting.

4.3. Quantitative Results and Analysis

The quantitative results on the HYCdata test set are summarized in the table below. The superiority of the proposed PVT-enhanced YOLOv8n model is evident across nearly all metrics.

Table 2: Performance Comparison of Different YOLO Models on the HYCdata Test Set
Model	$P^B$ (%)	$R^B$ (%)	$mAP^B_{0.5}$ (%)	$mAP^B_{0.5:0.95}$ (%)	$mAP^M_{0.5}$ (%)	$mAP^M_{0.5:0.95}$ (%)	GFLOPs	Params (M)
YOLOv9t	65.7	62.2	65.5	40.0	58.7	21.1	7.9	2.01
YOLOv10n	70.5	63.1	68.3	41.6	62.8	21.9	11.7	2.84
YOLOv11n	70.2	64.2	69.3	42.8	63.5	23.1	10.2	2.84
YOLOv8n (Baseline)	73.3	64.7	70.9	45.0	66.3	24.5	12.0	3.26
Improved YOLOv8n (Ours)	75.8	68.1	74.3	48.9	69.3	25.8	17.7	5.57

Key Observations:
1. Performance Gains: Our model achieves the highest scores in all primary accuracy metrics. Compared to the baseline YOLOv8n, it shows an increase of 3.4% in $mAP^B_{0.5}$ (from 70.9% to 74.3%) and 3.0% in $mAP^M_{0.5}$ (from 66.3% to 69.3%). More importantly, the gains in the stricter $mAP_{0.5:0.95}$ metrics are even more pronounced, particularly for bounding box detection (+3.9%), indicating that our model predicts localization that aligns more precisely with the ground truth geometry.
2. Balance vs. Newer Models: The proposed model outperforms the newer YOLOv10n and YOLOv11n by a significant margin (e.g., +6.0% and +5.0% in $mAP^B_{0.5}$, respectively), demonstrating that the architectural enhancement via PVT is highly effective for this specific domain, outweighing the general improvements in newer YOLO generations.
3. Complexity Trade-off: The performance improvement comes with a predictable increase in computational cost, as seen in the rise of GFLOPs from 12.0 to 17.7 and parameters from 3.26M to 5.57M. This is a direct result of the more powerful Transformer-based backbone. However, this level of complexity remains manageable and is a worthwhile trade-off for the substantial gain in segmentation accuracy, which is critical for reliable hazard assessment based on UAV drones imagery.

4.4. Qualitative Analysis and Visual Comparison

Qualitative results on sample test patches vividly illustrate the advantages of the improved model. In scenarios with complex backgrounds—such as areas intermixed with rocks, soil clumps, and sparse vegetation—the baseline YOLOv8n and other models often exhibit two failure modes: (1) False Negatives (Missed Detections): They fail to detect very fine or short fissure segments, especially when their contrast with the background is low. (2) Incomplete or Fragmented Segmentation: They may detect a fissure but segment it as several disconnected instances, or produce a mask that does not cover its full, sinuous length.

Our PVT-enhanced model demonstrates markedly superior performance in these challenging conditions:
– Superior Connectivity Reasoning: Leveraging its global self-attention mechanism, the model excels at “connecting the dots.” It can infer the continuity of a fissure even when parts of it are occluded by shadows or shallow overburden, producing more complete and cohesive instance masks.
– Robustness to Background Clutter: The model shows a stronger ability to discriminate between true linear fissures and other linear features like tire tracks, rock strata edges, or plant shadows. This leads to fewer false positives.
– Fine Detail Preservation: The multi-scale feature learning of the PVT pyramid helps preserve the fine edges of narrow fissures, resulting in masks that more accurately reflect the true fissure width and branching patterns. This precision is paramount for subsequent quantitative morphological analysis.

These visual outcomes confirm that the integration of Transformer-based global context modeling directly addresses the core challenges of segmenting mining-induced ground fissures from UAV drones imagery, leading to more reliable and geometrically faithful results.

5. Conclusion and Future Work

This study presented a novel, effective solution for the automated instance segmentation of mining-induced ground fissures using very high-resolution imagery acquired by UAV drones. The central contribution is the architectural enhancement of the YOLOv8 instance segmentation model by replacing its standard CNN-based backbone components with a Pyramid Vision Transformer (PVT) network. This hybrid design equips the model with a powerful capacity for learning multi-scale features and, critically, for understanding long-range spatial dependencies through the self-attention mechanism. This capability is essential for accurately delineating the extended, often discontinuous, and geometrically complex patterns of ground fissures.

To support this research, we constructed and publicly introduced the HYCdata dataset, a dedicated collection of annotated UAV drones imagery focusing on mining-induced fissures in an arid environment, which can serve as a valuable benchmark for future studies. Comprehensive experiments validated that our proposed model achieves a state-of-the-art $mAP_{0.5}$ of 74.3% for detection and 69.3% for mask segmentation on this dataset, outperforming the original YOLOv8 and other contemporary lightweight YOLO models. The model demonstrates robust performance in complex field conditions, accurately segmenting fissures amidst significant background clutter.

The practical implications are significant. This technology enables rapid, large-scale, and quantitative assessment of surface damage following mining activities. By automatically extracting precise per-fissure contours, it provides direct inputs for calculating key parameters like fissure length, average width, and total affected area. This data is indispensable for engineers to plan targeted backfilling operations, estimate material volumes, monitor the evolution of damage over time, and assess the stability of reclaimed land. It represents a major step towards intelligent, data-driven mining subsidence management and ecological restoration.

Future work will focus on several promising directions. First, exploring knowledge distillation or more efficient Transformer variants (e.g., MobileViT, EfficientFormer) could help reduce the model’s computational footprint for real-time onboard processing on advanced UAV drones. Second, extending the framework to multi-temporal analysis by integrating change detection algorithms would allow for dynamic monitoring of fissure propagation. Third, fusing optical imagery from UAV drones with other synchronous data streams, such as thermal infrared (for detecting heat from potential coal fires) or LiDAR point clouds (for precise 3D fissure modeling), could create a multi-modal hazard assessment system of even greater utility for the mining industry.