We propose an innovative framework for monitoring geological hazards along oil and gas pipelines using unmanned aerial vehicle (UAV) imagery. Long-distance pipelines traverse complex terrains, making them vulnerable to geohazards like landslides, subsidence, and floods, which pose severe environmental and economic risks. Traditional unmanned aerial vehicle inspection methods struggle with environmental noise (e.g., vegetation, weather), limiting AI-driven image recognition accuracy. Our solution integrates a multi-modal large model (CLIP), a change detection network (BIT), and a classification network (EfficientNet) to achieve robust hazard identification.

1. Methodology
1.1. Technical Workflow
The framework processes temporally separated unmanned aerial vehicle images of the same pipeline location:
- Image Alignment: Corrects spatial discrepancies using keypoint matching.
- Feature Extraction: Uses CLIP to encode aligned images into high-level features.
- Change Detection: Employs BIT to locate hazard regions.
- Hazard Classification: Leverages EfficientNet to identify hazard types.
Table 1: Pipeline Monitoring Workflow
Step | Technique | Function |
---|---|---|
Image Alignment | ORB + Brute-Force + Affine Transform | Corrects UAV positional drift |
Feature Extraction | CLIP (ViT-L/14@336px) | Encodes images into rotation-invariant features |
Change Detection | BIT (Bitemporal Image Transformer) | Outputs pixel-level change masks |
Hazard Classification | EfficientNet-B7 | Classifies hazard types in detected regions |
1.2. Image Alignment
UAV positioning errors (wind, GPS drift) cause misalignment. The ORB algorithm detects and matches keypoints between image pairs:
- FAST Corner Detection: Identifies candidate pixels PP where neighboring pixels p1…p16p1…p16 in a Bresenham circle (radius 3) satisfy:{Ipi>Ip+t(brighter)Ipi<Ip−t(darker){Ipi>Ip+tIpi<Ip−t(brighter)(darker)with threshold t=40t=40. If ≥12 contiguous pixels meet this, PP is a corner.
- BRIEF Descriptor: Generates 256-bit binary fingerprints via intensity comparisons.
- Brute-Force Matching: Computes Hamming distances between descriptors.
- Affine Transformation: Warps images into alignment using matched keypoints.
1.3. Feature Extraction with CLIP
CLIP’s vision transformer (ViT-L/14@336px) extracts features from aligned UAV images. Trained via contrastive learning, it minimizes distance between paired image-text embeddings while maximizing it for mismatches:LCLIP=−logexp(sim(Ii,Ti)/τ)∑k=1Nexp(sim(Ii,Tk)/τ)LCLIP=−log∑k=1Nexp(sim(Ii,Tk)/τ)exp(sim(Ii,Ti)/τ)
where sim(I,T)sim(I,T) is cosine similarity, ττ is temperature, and NN is batch size. CLIP’s zero-shot capability adapts to diverse unmanned aerial vehicle scenes.
1.4. Change Detection via BIT
BIT processes CLIP features to identify geohazard regions:
- Tokenization: Semantic tokenizer compresses features into tokens.
- Transformer Encoding: Models global dependencies:Attention(Q,K,V)=softmax(QKTdk)VAttention(Q,K,V)=softmax(dkQKT)V
- Feature Refinement: Decoder upsamples tokens to pixel space, generating change masks.
Transfer Fusion Module harmonizes CLIP and BIT features:
- Feature Pyramid: Fuses multi-scale outputs from CLIP layers.
- Windowed Attention: Enhances local context within M×MM×M windows.
1.5. Hazard Classification with EfficientNet
Detected regions are classified using EfficientNet, optimized via compound scaling:Depth: d=αϕ,Width: w=βϕ,Resolution: r=γϕDepth: d=αϕ,Width: w=βϕ,Resolution: r=γϕ
where ϕϕ is a scaling coefficient, and α,β,γα,β,γ are constants. This balances accuracy and computational efficiency for unmanned aerial vehicle-based applications.
2. Innovations
2.1. Multi-Modal Model Fusion
Our CLIP+BIT+EfficientNet cascade leverages:
- CLIP’s generalizability from web-scale pretraining.
- BIT’s spatial-temporal modeling for pixel-wise changes.
- EfficientNet’s parameter efficiency.
2.2. Fine-Tuning Strategy
We freeze CLIP during training and use a hybrid dataset:
- Public Data: LEVIR-CD, S2Looking, WHU-CD.
- Proprietary UAV Data: 10-km pipeline segments in Sichuan Basin.
Table 2: Dataset Composition
Task | Dataset | Size | Resolution | Source |
---|---|---|---|---|
Change Detection | LEVIR-CD | 637 image pairs | 0.5 m | UAV |
S2Looking | 5,000 pairs | 0.5–0.8 m | Satellite | |
Proprietary Data | 5 time-series | 0.3 m | Unmanned aerial vehicle | |
Hazard Classification | Public Geohazards | 209,154 images | Variable | Web |
UAV-Captured Hazards | 2,100 images | 0.3 m | Unmanned aerial vehicle |
3. Experiments
3.1. Setup
- Metrics: Intersection-over-Union (IoU), F1-score for change detection; accuracy for classification.
- Baselines: FC-EF, STANet, SNUNet, ChangeFormer, TinyCD.
- Hardware: NVIDIA A100 GPU, batch size 1,000.
3.2. Results
Change Detection:
Table 3: Change Detection Performance (IoU/F1)
Model | IoU | F1 |
---|---|---|
FC-EF | 45% | 0.62 |
ChangeFormer | 66% | 0.80 |
BIT (Ours) | 65% | 0.79 |
CLIP+BIT (Ours) | 75% | 0.86 |
Our method achieves a 15% IoU improvement over BIT and outperforms all baselines.
Hazard Classification:
- Overall Accuracy: 86%
- Precision/Recall: 83%/79%
- Key Hazards:
- Landslides: 86.17%
- Oil spills: 87.32%
- Floods: 84%
Inference Speed: 0.2s/image on a single A100 GPU.
4. Field Applications
The system enables:
- Dynamic monitoring of pipeline corridors using unmanned aerial vehicle time-series images.
- Automated quantification of changes (e.g., “52 changed structures, 6.44% variation”).
- Early warning for landslides, subsidence, and third-party intrusions.
Table 4: System Performance in Pipeline Monitoring
Capability | Metric | Value |
---|---|---|
Change Detection IoU | Pipeline corridors | 75% |
Hazard Classification Accuracy | Landslides/Oil spills | >86% |
Processing Speed | Per image | 0.2 s |
False Positive Suppression | Post-processing (morphology) | >90% |
5. Conclusion
We present an integrated AI framework for unmanned aerial vehicle-based geohazard monitoring of oil/gas pipelines. By fusing CLIP’s feature robustness, BIT’s change sensitivity, and EfficientNet’s classification efficiency, our system achieves:
- 75% IoU in change detection (15% higher than BIT alone).
- 86% accuracy in hazard typing.
- Real-time processing (0.2s/image).
This approach significantly enhances the safety and operational reliability of pipeline infrastructure. Future work will expand to multi-sensor unmanned aerial vehicle platforms (e.g., LiDAR) for all-weather monitoring.