UAV Transmission Line Detection via Cross-Modal Interaction Fusion and Global Feature Calibration

Accurate perception of transmission lines by Unmanned Aerial Vehicles (UAV drones) is a fundamental requirement for numerous critical applications. In the domain of power system operation and maintenance, this capability enables rapid inspection for faults such as corrosion or damage, significantly reducing labor costs and mitigating safety risks associated with manual fieldwork. Equally important is the role of precise transmission line detection in enabling autonomous UAV navigation and real-time obstacle avoidance, which is essential for the safe operation of drones in complex, cluttered airspace. However, achieving reliable detection is notoriously challenging due to various environmental factors—including drastic illumination changes, foggy conditions, and cluttered backgrounds—coupled with the inherent characteristics of transmission lines themselves, such as their slender structure, vast scale variations, and random spatial distribution. These challenges severely limit the performance of conventional detection methods, creating a pressing need for more robust and intelligent solutions.

Existing approaches can be broadly categorized into single-modal and multi-modal methods. Single-modal techniques, relying solely on either visible or infrared data, exhibit significant limitations. Visible-light-based methods, whether using traditional computer vision or modern deep learning, struggle under poor lighting or adverse weather. Conversely, infrared-based methods, while robust to illumination changes, suffer from low resolution and contrast, making precise pixel-level localization difficult. Multi-modal fusion methods, which combine visible and infrared data, offer a promising direction by leveraging complementary information. Common fusion architectures include early fusion (simple channel concatenation), late fusion (combining features at the decoder), and stage-wise fusion (interactive fusion during encoding). While stage-wise fusion is currently dominant and has shown success in related tasks like salient object detection and crowd counting, existing frameworks often lack depth in cross-modal interaction and are not specifically optimized for the unique attributes of transmission line targets, such as their elongated shape and need for both long-range continuity and fine-grained detail.

To overcome these limitations, we propose a novel multi-modal detection framework specifically designed for UAV drone-based transmission line inspection. Our core contribution is a comprehensive technical pipeline that emphasizes deep cross-modal interaction and global feature optimization. The framework is built upon three synergistic components: a Cross-modal Interaction Guided Fusion (CIGF) module for deep, bidirectional feature exchange between modalities; a Global Feature Significance Modulator (GFSM) that acts as a feature hub to calibrate and enhance pivotal multi-scale and long-range contextual information; and a Multi-Receptive Enhanced Decoder (MRED) that progressively reconstructs the fine spatial structure of wires. Extensive experiments on the authoritative VITLD dataset demonstrate that our method achieves a superior balance between high accuracy and real-time processing capability, maintaining robust performance even in challenging scenarios like night, fog, and snow, thereby breaking through the application bottlenecks of traditional methods.

1. Related Work and Motivation

The pursuit of reliable environmental perception for UAV drones has driven extensive research in computer vision. For transmission line detection, the evolution has moved from single-modal to multi-modal strategies.

Single-Modal Detection. Early works relied heavily on handcrafted features from visible images, utilizing techniques like edge detection, Hough transform, or specialized grayscale transformations. These methods are highly sensitive to complex and unseen backgrounds. The advent of deep learning brought significant improvements. Convolutional Neural Networks (CNNs), including fully convolutional networks and adaptations of object detection frameworks like YOLO, have been employed for end-to-end wire identification or segmentation. While these methods improve upon traditional ones in normal conditions, their dependence on visible light makes them vulnerable to low-light and adverse weather. Infrared-only methods offer illumination invariance but are hampered by inherently lower spatial detail, limiting precise pixel-level localization required for thin wires.

Multi-Modal Fusion Architectures. Fusion strategies are typically classified by the stage at which information from different sensors is combined. Early fusion concatenates raw or lightly processed images from multiple modalities into a multi-channel input, which is then processed by a single network. This approach is simple but fails to account for deep modal disparities. Late fusion employs separate network branches for each modality, merging their high-level features or decisions at the final stages. This can lead to feature misalignment due to a lack of intermediate interaction. Stage-wise fusion has emerged as a more effective paradigm, allowing for progressive interaction between modalities throughout the network, often facilitated by attention mechanisms. This architecture better captures the intrinsic relationships between modalities, such as the texture details from visible light and the thermal consistency from infrared data.

Multi-Modal Fusion in Related UAV/ Robotic Tasks. Research in adjacent fields provides valuable insights. For UAV-based crowd counting, networks like DEFNet employ dense enhancement modules. In RGB-D salient object detection, frameworks like JL-DCF use joint learning and dense cooperative fusion. For all-weather traffic detection from UAV drones, methods like FRCPNet incorporate deformable registration and adaptive fusion. In robotic grasping, HFNet utilizes hierarchical feature fusion for precise pose alignment. While these methods excel in their respective domains—focusing on dense statistics, object saliency, vehicle detection, or rigid pose estimation—they are not tailored for the specific challenges of transmission line detection: capturing extremely elongated structures, maintaining pixel-accurate continuity over long distances, and being invariant to complex natural backgrounds. This gap motivates our task-specific design of deep cross-modal interaction and global feature calibration dedicated to UAV drone wire detection.

2. Proposed Methodology

Our proposed framework is designed to address the core challenges in UAV drone wire detection: insufficient modal interaction, weak global feature guidance, and inadequate reconstruction of fine structures. The overall architecture comprises three key stages: the CIGF-based encoding stage, the GFSM-based global calibration stage, and the MRED-based decoding stage.

2.1 Overall Framework

The complete pipeline begins by processing aligned visible (V) and infrared (I) image pairs. In the encoding stage, each modality first passes through a Residual Feature Refinement (RFR) block for initial feature extraction. These features are then fed into the core Cross-modal Interaction Guided Fusion (CIGF) module. The CIGF module outputs two types of features: 1) refined modality-specific features ($V_{out}$, $I_{out}$) that are passed down the encoding pyramid via max-pooling, and 2) a fused cross-modal feature ($F_{fusion}$) that is preserved via skip connections for the decoder. The encoding process continues, extracting multi-scale semantic features.

At the bottleneck connecting the encoder and decoder, the Global Feature Significance Modulator (GFSM) receives the deepest, most abstract features. It functions as a global feature hub, performing multi-scale expansion and long-range context modeling to calibrate and enhance the most discriminative features for the wire detection task.

Finally, the Multi-Receptive Enhanced Decoder (MRED) stage progressively upsamples the calibrated features. At each upsampling step, it concatenates the corresponding skip-connected fused feature from the CIGF stage. The MRED module then processes this concatenated feature with multi-branch dilated convolutions to capture context at various receptive fields, meticulously reconstructing the spatial details of the wires. A final convolutional layer outputs the binary detection map.

We employ a composite loss function to jointly optimize the network:

$$
\mathcal{L}_{total} = \mathcal{L}_{BCE} + \lambda \cdot \mathcal{L}_{Dice}
$$

$$
\mathcal{L}_{BCE} = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]
$$

$$
\mathcal{L}_{Dice} = 1 – \frac{2 \sum y_i \hat{y}_i + \epsilon}{\sum y_i + \sum \hat{y}_i + \epsilon}
$$

where $y_i$ and $\hat{y}_i$ are the ground-truth and predicted probability for pixel $i$, $N$ is the total number of pixels, $\epsilon$ is a smoothing constant (set to 1), and $\lambda$ is a balancing weight (set to 0.5). The Binary Cross-Entropy (BCE) loss ensures pixel-wise classification accuracy, while the Dice loss specifically optimizes for the overlap of the thin, elongated wire structures, improving segmentation completeness.

2.2 Cross-modal Interaction Guided Fusion (CIGF) Module

The CIGF module is engineered to move beyond simple feature addition or concatenation. It implements a bidirectional feature reconstruction mechanism through three cohesive phases to achieve deep, complementary fusion.

1. Modal Channel Interaction Enhancement. This phase establishes a lightweight channel-wise dialogue between modalities. It uses a squeeze-and-excitation style unit to generate guidance weights from one modality to enhance the other. For instance, to enhance infrared features $I$ with visible guidance, we first concatenate $V$ and $I$, then produce a channel attention vector applied to $V$, which is then added as a residual to $I$.

$$
\begin{aligned}
W_I &= \sigma(\mathbf{W}_2 \delta(\mathbf{W}_1 \text{GAP}(\text{concat}(V, I)))) \\
I’ &= I + W_I \odot V
\end{aligned}
$$

A symmetric operation yields $V’$ enhanced by $I$. Here, $\text{GAP}$ is Global Average Pooling, $\delta$ is ReLU, $\sigma$ is Sigmoid, $\mathbf{W}$ are fully-connected layers, and $\odot$ is channel-wise multiplication.

2. Global Modal Weight Assessment. Based on the interactively enhanced features $V’$ and $I’$, this phase evaluates the global significance of each modality. The concatenated features $[V’, I’]$ are fed into a channel attention unit, generating a unified attention vector $A$. This vector is then split into two modality-specific weight vectors, $A_V$ and $A_I$.

$$
\begin{aligned}
A &= \sigma(\mathbf{W}_4 \delta(\mathbf{W}_3 \text{GAP}(\text{concat}(V’, I’)))) \\
[A_V, A_I] &= \text{split}(A)
\end{aligned}
$$

3. Multi-granularity Feature Fusion. This final phase strategically integrates information through two paths. The Alignment Path uses the assessed weights to align the original features: $V \odot A_V$ and $I \odot A_I$. The Interaction Path goes further by computing a cross-modal interaction term via element-wise multiplication: $(V \odot A_V) \odot (I \odot A_I)$. The final fused feature $F_{fusion}$ and the refined modality-specific outputs are computed as:

$$
\begin{aligned}
F_{fusion} &= \underbrace{V \odot A_V + I \odot A_I}_{\text{Alignment Path}} + \underbrace{(V \odot A_V) \odot (I \odot A_I)}_{\text{Interaction Path}} \\
V_{out} &= \frac{1}{2}(V + V \odot A_V + I \odot A_I) \\
I_{out} &= \frac{1}{2}(I + V \odot A_V + I \odot A_I)
\end{aligned}
$$

This design ensures that the fused feature captures deep synergies, while the modality-specific outputs retain their unique characteristics infused with complementary information from the other modality, providing rich features for both the encoder pyramid and the decoder skip connections.

2.3 Global Feature Significance Modulator (GFSM)

Positioned at the network’s bottleneck, the GFSM acts as a sophisticated feature refinery. It processes the high-level feature map $X_g$ by splitting it into $G$ groups (e.g., $G=8$) along the channel dimension, enabling efficient and focused processing.

1. Multi-modal Feature Extraction. For each feature group $X_g$, GFSM employs a dual-branch strategy to capture diverse contextual information. The Multi-scale Branch uses standard convolution, global average pooling, and downsampling (2x, 4x) followed by convolution to capture context at different scales. The Long-range Context Branch employs a chain of dilated convolutions (e.g., rates 3 and 5) to significantly expand the receptive field and model long-distance dependencies crucial for tracing wires. The outputs of these branches are summed: $T_g = f_{multi-scale}(X_g) + f_{long-range}(X_g)$.

2. Adaptive Feature Significance Evaluation. For each group’s extracted feature $T_g$, GFSM computes its statistical mean $\mu_g$ and standard deviation $\varphi_g$. It then generates a group-specific significance weight map through a learnable affine transformation followed by a Sigmoid function:

$$
W_g = \sigma(\gamma_g \cdot \frac{T_g – \mu_g}{\varphi_g} + \beta_g)
$$

where $\gamma_g$ and $\beta_g$ are learnable parameters for scaling and shifting, allowing the module to adaptively calibrate feature significance based on the intrinsic distribution of each group.

3. Selective Feature Calibration. The computed weight $W_g$ is used to modulate the original input feature group $X_g$. All modulated groups are then concatenated to form the final, calibrated output feature $Y$.

$$
Y = \text{GFSM}(X) = \mathop{\text{Concat}}\limits_{g=1}^{G} (X_g \odot W_g)
$$

Through this grouped, multi-scale, and context-aware processing, GFSM effectively suppresses less relevant feature responses and amplifies those critical for identifying and connecting wire segments across the image, providing a powerful feature basis for the subsequent decoder.

2.4 Multi-Receptive Enhanced Decoder (MRED)

The MRED module is responsible for the precise, pixel-level reconstruction of wire masks. It operates on feature maps at each decoder stage, which are the result of upsampling and concatenation with skip-connected $F_{fusion}$ features.

Given an input feature $X$ at a decoder stage, the MRED first applies a $1\times1$ convolution for channel reduction and initial transformation. It then deploys three parallel dilated convolutional branches with different dilation rates (e.g., 1, 2, 3). This multi-branch design captures contextual information at multiple receptive fields, essential for understanding both the local fine structure of a wire and its broader, long-range connectivity. The outputs of these branches are element-wise summed, followed by Batch Normalization (BN) and a LeakyReLU activation ($\delta_{leaky}$). A final $1\times1$ convolution adjusts the channel dimensions, and the result is added to the original input $X$ via a residual connection.

$$
\begin{aligned}
\text{MRED}(X) = X + \text{Conv}_{1\times1}\Bigg( \delta_{leaky}\bigg( \text{BN}\Big( \sum_{i=1}^{3} \text{Conv}_{d_i}( \text{Conv}_{1\times1}(X) ) \Big) \bigg) \Bigg)
\end{aligned}
$$

where $\text{Conv}_{d_i}$ denotes a $3\times3$ convolution with dilation rate $d_i$. This residual multi-receptive field design allows the decoder to progressively refine features, effectively combining high-level semantic guidance from the encoder with detailed spatial information from the skip connections, leading to accurate and well-defined wire segmentation maps.

3. Experiments and Analysis

3.1 Experimental Setup

Dataset. We conduct our experiments on the public VITLD dataset, a benchmark for visible-infrared transmission line detection. It contains 420 aligned visible-infrared image pairs with pixel-level annotations. We follow the standard split: 280 pairs for training, 60 for validation, and 80 for testing. To rigorously evaluate robustness, we augment the test set with challenging conditions simulated on the visible images—including heavy fog, low-light night, and snow occlusion—resulting in an expanded test set of 400 samples. During training, we apply standard data augmentations like rotation, flipping, and color jittering to improve generalization.

Implementation Details. The model is implemented in PyTorch. We use the RAdam optimizer with an initial learning rate of 0.001 and a weight decay of 0.0005. The learning rate is halved every 20 epochs. The model is trained for 200 epochs with a batch size of 4 on an NVIDIA RTX 3080 GPU.

Evaluation Metrics. We employ a comprehensive set of metrics: Intersection over Union (IoU) measures localization accuracy; Object Recognition Rate (Robj) evaluates the completeness of detecting individual wire instances; Recall and Precision assess detection coverage and correctness; F1-Score provides their harmonic mean; and Frames Per Second (FPS) measures inference speed, critical for real-time UAV drone applications.

3.2 Comparison with State-of-the-Art Methods

We compare our method against several relevant baselines and state-of-the-art multi-modal fusion methods from related vision tasks. For a focused comparison on fusion strategy, when evaluating methods designed for other tasks (e.g., DEFNet for crowd counting, JL-DCF for saliency detection), we integrate their core fusion module into our framework’s fusion stage. This ensures a fair assessment of their cross-modal fusion efficacy on the wire detection task.

Quantitative Results. The overall performance comparison is summarized in Table 1. Our proposed method achieves the best balance between accuracy and efficiency. It attains the highest scores in IoU (67.79%), Robj (80.62%), Recall (76.57%), and F1-Score (76.26%), demonstrating superior detection completeness and precision-recall balance. In terms of FPS (47.62), our method maintains a high, practical inference speed suitable for real-time UAV drone operation, significantly faster than other high-accuracy multi-modal approaches.

Comparative Method	IoU (%)	R_obj (%)	Recall (%)	Precision (%)	F1-Score (%)	FPS
UNet (Visible only)	52.92	74.31	73.22	70.96	72.07	68.03
UMFNet (Early Fusion)	57.07	73.98	72.12	73.01	72.56	52.63
DEFNet Fusion Module	61.22	74.68	74.31	75.89	75.09	46.08
JL-DCF Fusion Module	61.38	79.51	76.03	75.83	75.93	51.02
HFNet Fusion Module	65.14	78.27	76.01	75.16	75.58	38.61
FRCPNet Fusion Module	65.45	74.97	72.24	75.72	73.94	36.63
TCAINet Fusion Module	65.74	79.01	75.07	76.29	75.58	52.91
Our Proposed Method	67.79	80.62	76.57	75.95	76.26	47.62

Table 1: Overall performance comparison on the VITLD dataset.

Robustness in Challenging Conditions. To visually analyze performance under adversity, we plot bubble charts for night, snowy, and foggy conditions. The x-axis represents different algorithms, the y-axis shows the IoU, and the bubble size corresponds to the Robj score. A dashed line marks the best IoU among comparison methods. As shown in these visualizations, our method consistently achieves the highest IoU (largest vertical position) and the largest Robj (largest bubble size) across all three harsh environments. This demonstrates the exceptional robustness conferred by our deep cross-modal fusion and feature calibration strategy, enabling reliable UAV drone operation in low-light, occlusion, and low-visibility scenarios where other methods falter.

3.3 Ablation Studies

We conduct thorough ablation experiments to validate the contribution of each proposed component.

Module-level Ablation. Table 2 shows the incremental performance gain when adding each module. The baseline model without our specialized modules performs poorly. Adding the CIGF module brings a substantial boost (IoU from 53.74% to 62.51%), proving the effectiveness of deep bidirectional fusion. Incorporating the GFSM further improves performance (IoU to 66.61%), highlighting the importance of global multi-scale feature calibration. Employing the full framework with the MRED decoder yields the best results (IoU 67.79%), confirming that all three components work synergistically to enhance UAV drone detection capability.

CIGF	GFSM	MRED	IoU (%)	R_obj (%)	F1-Score (%)
—	—	—	53.74	75.46	71.63
✓	—	—	62.51	77.69	74.20
✓	✓	—	66.61	78.56	75.07
✓	✓	✓	67.79	80.62	76.26

Table 2: Ablation study on the proposed modules.

Fine-grained Ablation on GFSM. We further dissect the GFSM by disabling its key branches. Results in Table 3 show that removing the multi-scale branch causes a notable drop in IoU, as capturing wires at different scales is crucial for accurate localization. Removing the long-range context branch leads to a more significant drop in Robj, as modeling long-distance dependencies is essential for completely tracing the wire instance. Using both branches together yields the optimal performance, confirming their complementary roles in serving the UAV drone detection task.

GFSM Configuration	IoU (%)	R_obj (%)	F1-Score (%)
W/o Multi-scale Branch	64.90	76.70	74.26
W/o Long-range Branch	66.94	77.73	75.12
Full GFSM (Ours)	67.79	80.62	76.26

Table 3: Ablation on the components of the GFSM module.

4. Conclusion

In this work, we have addressed the critical and challenging problem of transmission line detection for UAV drones by proposing a novel multi-modal framework based on visible and infrared data fusion. Our key innovation lies in the design of three dedicated components: the Cross-modal Interaction Guided Fusion (CIGF) module for deep, bidirectional feature exchange; the Global Feature Significance Modulator (GFSM) for multi-scale and long-context feature calibration; and the Multi-Receptive Enhanced Decoder (MRED) for fine-grained spatial reconstruction. Together, they form a cohesive pipeline that effectively leverages complementary sensor information. Comprehensive experiments on the VITLD dataset demonstrate that our method achieves state-of-the-art performance, offering an optimal balance between high detection accuracy (leading in IoU, Robj, Recall, and F1-Score) and practical inference speed (47.62 FPS). More importantly, it exhibits remarkable robustness in extreme conditions such as night, fog, and snow, which are common yet challenging scenarios for UAV drone operations. This work provides a robust and effective solution for autonomous power line inspection and UAV navigation safety, with significant theoretical and practical value for the field of aerial robotics and computer vision.