Enhancing Lighting UAV Tracking with Retinex-Inspired Image Enhancement

Visual tracking is a fundamental task in numerous lighting UAV applications, such as target tracking, autonomous landing, and self-localization. Given the initial position of an object, the tracker is expected to estimate its location in subsequent frames. In recent years, visual tracking has made significant progress, with emerging methods continuously refreshing the state-of-the-art (SOTA) on large-scale benchmarks. However, these advancements have primarily been achieved on daytime sequences captured under favorable lighting conditions. In practical applications of lighting drones, visual systems must deliver robust performance throughout the day, yet recent studies indicate that SOTA trackers struggle to maintain their superiority under low-light conditions. Consequently, unstable tracking performance at night severely hinders the expansion of related lighting UAV applications. Typically, images captured at night exhibit low brightness and contrast. Under such circumstances, feature extractors trained on daytime data lose effectiveness, leading to poor target tracking performance. Besides insufficient illumination, high levels of noise can corrupt structural details in images, further degrading tracking performance. Compounding the issue, there are few publicly available nighttime tracking benchmarks with comprehensive annotations for tracker training.

Under low exposure, captured images suffer from high noise levels, low contrast, and low brightness, posing severe challenges for object tracking. In extremely low-light conditions where even the human eye struggles to distinguish details, convolutional neural networks (CNNs) lose their efficacy in feature extraction. Without discriminative target features, maintaining robust tracking becomes difficult. However, due to various tracking-related tasks, lighting drones inevitably operate at night. Thus, the widespread application of UAV tracking has been limited by illumination conditions to date.

To address the impact of low-light illumination on lighting UAV tracking, we propose a novel Retinex-inspired image enhancer called SCA-Lighter, which mitigates these effects iteratively. Simultaneously, we construct a lightweight SCA-Net based on dual spatial and channel attention mechanisms to jointly estimate illumination and noise maps. Unlike traditional methods, SCA-Net does not require paired images and can be efficiently trained and inferred using conventionally captured images. Experiments on numerous nighttime UAV tracking scenarios demonstrate that SCA-Lighter significantly enhances baseline tracking performance. This research aims to improve the robustness of lighting drone tracking in low-illumination environments, offering broad application prospects.

Background and Related Work

Object tracking methods can generally be categorized into correlation filter-based approaches and CNN-based methods. Due to their outstanding tracking performance, CNN-based trackers have become the current trend in visual tracking. Among them, Siamese-based methods stand out due to their balance of accuracy and efficiency, making them reliable choices for lighting UAV tracking. Typically, Siamese-based methods can be divided into anchor-based and anchor-free trackers. Anchor-based trackers borrow ideas from object detection, introducing region proposal networks (RPNs) to generate precise proposals, achieving remarkable tracking accuracy. To avoid the hyperparameters associated with RPNs, anchor-free trackers directly regress bounding box offsets. Additionally, SiamAPN proposes an anchor proposal network for adaptive anchors. Recently, online learning strategies have been introduced to address the poor generalization of Siamese trackers and fully leverage background appearance information. However, despite these trackers excelling under normal lighting conditions, their robustness drastically decreases in low-light environments. The question then arises: how to illuminate low-light images for trackers?

Low-light enhancement has garnered widespread attention to adjust illumination and improve the visual quality of dark images. Most previous work on low-light enhancement primarily aimed to enhance aesthetic quality. However, to date, low-light enhancement for high-level tasks such as object tracking has received little attention. According to Retinex theory, an observed image can be decomposed into reflectance and illumination maps. Reflectance, representing the wavelength of light reflected by objects, should remain constant as it is determined by the object’s properties. Thus, reflectance can be considered the “true color” of the object. Naturally, tracking objects using their “true color” regardless of lighting conditions would yield more robust performance. Recent studies have shown the effectiveness of attention modules in image restoration tasks. Since low-light enhancement in this context aims to facilitate high-level perception tasks rather than merely pixel-level illumination adjustment, modeling global information is relatively more critical. We draw inspiration from a dual attention module based on spatial and channel mechanisms (CBAM). Building on Retinex theory and attention mechanisms, we propose a Retinex-inspired image enhancer based on spatial and channel attention mechanisms, named SCA-Lighter, to promote lighting UAV tracking.

Methodology

Our main idea is based on the classical Retinex theory, which separates an observed image $S \in \mathbb{R}^{w \times h \times 3}$ into a reflectance map $R \in \mathbb{R}^{w \times h \times 3}$ and an illumination map $L \in \mathbb{R}^{w \times h}$:

$$S = R \odot L$$

Here, the symbol $\odot$ denotes element-wise multiplication, and $w$ and $h$ represent the width and height of the image, respectively. The illumination map $L$ reflects the brightness of the environment, while the reflectance map $R$ depends on the physical properties of the object and is independent of illumination. Considering the presence of strong noise in low-light images, a noise term $N \in \mathbb{R}^{w \times h}$ is added to the classical Retinex model:

$$S = R \odot L + N$$

Once the illumination map $L$ and noise term $N$ are estimated, the reflectance map $R$ can be decomposed from the input image as:

$$R = (S – N) / L$$

Here, the symbol $/$ denotes element-wise division. Assuming that the illumination and noise maps for a scene observed under normal lighting conditions are an all-ones matrix and an all-zeros matrix, respectively, the resulting reflectance map $R$ can be regarded as the output of Retinex-based methods. Since directly decomposing the input image using Equation (3) may lead to unrealistic results, we propose a Retinex-based low-light enhancer, SCA-Lighter, which iteratively strips away illumination and noise maps:

$$S_i = (S_{i-1} – N_i) / L_i$$

where $i$ denotes the $i$-th iteration, $S_0$ represents the initial image $S_i$ (with $S_i > 0$), and $S_i$ denotes the intermediate result. To avoid division by zero, Equation (4) is reformulated as:

$$S_i = (S_{i-1} – N_i) \odot E_i$$

where $E_i = 1 / L_i$. Given a low-light image $S_0$, the SCA-Lighter algorithm progressively eliminates illumination and noise. After $I$ iterations, the illumination and noise maps become relatively clean, and the final output $S_I$ is considered the reflectance map $R$ (in this work, we adopt $I=8$).

SCA-Net Architecture

Each iteration in Equation (5) requires estimating $N_i$ and $E_i$. Therefore, we train SCA-Net to estimate image-specific illumination and noise maps. Specifically, SCA-Net comprises seven convolutional layers and a dual spatial and channel attention mechanism layer. The network structure is summarized in Table 1.

Table 1: SCA-Net Network Structure
Layer	Output Size
Input Layer	256×256×3
Convolutional Layer 1	256×256×32
Convolutional Layer 2	256×256×64
Convolutional Layer 3	256×256×64
Convolutional Layer 4	256×256×64
Convolutional Layer 5	256×256×64
Spatial and Channel Dual Attention Layer	256×256×64
E Output	256×256×8
N Output	256×256×8

The specific workflow is illustrated in Figure 1. The initial low-light image is fed into the network, and after passing through multiple convolutional layers, a set of illumination and noise maps $[E_1, E_2, \ldots, E_I]$ and $[N_1, N_2, \ldots, N_I]$ are obtained for iterative decomposition. The reflectance map $R$ is computed iteratively according to Equation (5) to produce the enhanced image.

Spatial and Channel Dual Attention Mechanism

The spatial and channel attention mechanism is depicted in Figure 2. After feature extraction via convolution, the features are processed through channel attention and spatial attention modules to produce output features of the same size as the input features.

The channel attention module is shown in Figure 3. The input features undergo max pooling and average pooling over the spatial dimensions $(w \times h)$, where $w$ and $h$ represent the width and height of the input features, respectively. After pooling, features of size $(1 \times 1 \times C)$ are obtained, where $C$ denotes the channel dimension. These features are passed through an MLP, added together, and then processed through a sigmoid function to obtain the final channel attention.

The spatial attention module is illustrated in Figure 4. The input features have a size of $w \times h \times C$. Max pooling and average pooling are performed along the channel dimension $C$, resulting in features of size $w \times h \times 1$. The two pooled output features are concatenated along the channel dimension to form features of size $w \times h \times 2$. These are then processed through a convolution operation and a sigmoid activation function to obtain the final spatial attention.

Loss Functions

SCA-Lighter performs illumination enhancement on template and search images. Since the tracking object is located at the center of the template image and typically appears in the central region of the search image, it is worthwhile to focus more on enhancing the central area. We design a spatial weight map $W \in \mathbb{R}^{p \times q}$ to attract the enhancer’s attention to the center region. Specifically, the enhanced image is divided into non-overlapping patches of size $16 \times 16$. The average intensity value of each patch is normalized to approximate a well-lit value $l$. The values in $W$ are computed as $W_{j,k} = \exp\left(-\frac{(j – c_j)^2 + (k – c_k)^2}{2\sigma^2}\right)$, where $j$ and $k$ are assigned based on the spatial position of the corresponding patch, and $e$ is Euler’s number. Based on this, the center-focused brightness loss is formulated as:

$$L_{\text{cen}} = \frac{1}{P} \sum_{j=1}^{p} \sum_{k=1}^{q} W_{j,k} \left( Y_{j,k} – l \right)^2$$

where matrix $Y$ consists of each patch $y$ arranged in the corresponding position. In implementation, the well-lit value $l$ is set to 0.6.

To ensure that the values of adjacent pixels change monotonically between iterations, we add an illumination regularization loss to constrain the smoothness of each illumination component $E_i$, defined as:

$$L_{\text{ill}} = \sum_{i=1}^{I} \left( \| \Delta_x E_i \|_2^2 + \| \Delta_y E_i \|_2^2 \right)$$

where $\Delta$ is the first-order differential operator.

High-level noise inevitably appears in sequences captured under low light. Given that CNNs already struggle to extract discriminative features in low-light conditions, such high-level noise can easily mislead trackers. Therefore, correctly guiding the mapping between the image and noise map is crucial. We add a noise estimation loss to SCA-Lighter to limit the overall intensity of the noise map $N_i$, defined as:

$$L_{\text{noi}} = \sum_{i=1}^{I} \| N_i \|_1$$

The total loss function is as follows:

$$L_{\text{tol}} = \lambda_1 L_{\text{cen}} + \lambda_2 L_{\text{ill}} + \lambda_3 L_{\text{noi}}$$

where $\lambda_1$, $\lambda_2$, and $\lambda_3$ are hyperparameters.

Figure 5 presents a qualitative comparison of features extracted from original images and those enhanced by SCA-Lighter. Features extracted from original images appear cluttered and difficult to distinguish, whereas features extracted after SCA-Lighter enhancement are prominent and discriminative, which is crucial for lighting UAV tracking.

Experiments and Results

Experimental Setup and Evaluation Metrics

Our experimental environment is as follows: SCA-Lighter is implemented using PyTorch on a Windows machine equipped with an NVIDIA GeForce RTX 3090 GPU. We use images from the SICE dataset to train SCA-Net. The weights for each loss function are set to $\lambda_1 = 10$, $\lambda_2 = 0.001$, and $\lambda_3 = 50$, respectively. During training, images are resized to $256 \times 256$. We employ the ADAM optimizer with a fixed learning rate of 0.0001 to optimize SCA-Net, with a batch size of 16 and a total of 200 epochs.

Since our goal is to promote nighttime tracking performance for lighting drones through low-light enhancement, we adopt evaluation metrics from high-level vision tasks, specifically object tracking. Following one-pass evaluation (OPE), precision and success rate are included as metrics. Precision is determined by the center location error (CLE) between tracking results and ground truth, with the percentage of frames where CLE is below a given threshold used as precision (typically using a 20-pixel threshold for ranking trackers). For success rate, we use the intersection over union (IoU) between estimated and ground truth bounding boxes.

Experimental Comparisons and Results

To validate the effectiveness and robustness of our method, we implement SCA-Lighter on three advanced trackers: PrDiMP50, SiamAPN, and SiamRPN++. Comprehensive experiments are conducted on the currently only UAV low-light tracking benchmark, UAVDark135. This dataset contains 135 challenging sequences, including various objects (e.g., people, cars, buildings) captured in poorly lit environments, with over 125,000 frames covering most low-light tracking scenarios.

The experimental results of SCA-Lighter on different trackers are shown in Figure 6. Integrating SCA-Lighter with various trackers significantly improves precision and success rates in nighttime scenarios. For instance, SiamAPN, when augmented with SCA-Lighter, shows increases of 9.2% in precision and 8.9% in success rate. This validates the generalization performance of our proposed SCA-Lighter for lighting drone applications.

Figure 7 displays some tracking screenshots with and without SCA-Lighter enabled. In these low-light scenes, SCA-Lighter enhances tracking reliability. In the figure, green arrows indicate the target object, solid lines represent tracking results with SCA-Lighter, and dashed lines represent the original tracker performance. Different colors denote different trackers: yellow for PrDiMP50, red for SiamRPN++, and blue for SiamAPN. The visual comparisons clearly show that various trackers improve their tracking performance when augmented with SCA-Lighter.

We also compare SCA-Lighter with other low-light enhancers, EnlightenGAN and Zero-DCE, to demonstrate its advantages in lighting UAV low-light tracking. The comparative results are shown in Figure 8. On the basis of original trackers, SCA-Lighter outperforms the comparison methods in both precision and success rate metrics.

Conclusion

Addressing the poor tracking performance of lighting drones in low-light scenarios, we propose a universal low-light enhancer based on spatial and channel attention mechanisms. By training SCA-Net to predict noise and illumination maps and iteratively removing noise and illumination components from low-light scene images, we obtain enhanced images. Experiments on different trackers demonstrate that the designed SCA-Lighter mitigates the impact of poor illumination on lighting UAV tracking, verifying its compatibility and effectiveness across various tracking methods. Comparisons with current SOTA low-light enhancers show that SCA-Lighter possesses superiority in promoting the robustness of lighting drone tracking under low-light conditions, contributing positively to the development of UAV tracking technology.

Future work will focus on extending SCA-Lighter to other lighting UAV applications, such as obstacle avoidance and surveillance, and exploring real-time implementation on embedded systems for practical deployment. Additionally, we plan to investigate adaptive mechanisms for dynamically adjusting enhancement parameters based on environmental conditions to further improve tracking robustness in diverse lighting scenarios for lighting drones.