An Improved ByteTrack Algorithm for Multi-Object Tracking in UAV Aerial Imagery

Abstract

Multi-object tracking (MOT) from UAV aerial imagery faces significant challenges including small target scale, irregular camera motion, and frequent occlusions. To address these issues, I propose an improved ByteTrack algorithm that incorporates an appearance feature extraction module based on OSNet, a camera motion compensation module, and a Gaussian process trajectory interpolation module. The original ByteTrack framework is enhanced to adapt to the unique characteristics of UAV footage. Experiments on the VisDrone2019-MOT dataset demonstrate that my method achieves a MOTA improvement of 1.5% and reduces ID switches by 578 compared to the baseline ByteTrack, reaching state-of-the-art performance among real-time trackers. This work systematically validates the effectiveness of each module through ablation studies, and provides a robust solution for UAV-based multi-object tracking in complex environments.

1. Introduction

With the rapid development of UAV drone technology and the reduction of manufacturing costs, vision-based multi-object tracking from UAV aerial imagery has become a critical research area. UAV drones are widely deployed in logistics, precision agriculture, disaster rescue, and urban surveillance, where real-time and accurate target tracking is essential. Unlike ground-level tracking, UAV drones often operate at varying altitudes and speeds, resulting in small and fast-moving targets, frequent occlusions by buildings or trees, and complex background motion caused by the drone’s own movement. These factors degrade the performance of conventional tracking algorithms.

In recent years, tracking-by-detection paradigms have dominated the field. ByteTrack is a representative algorithm that associates detection boxes of both high and low confidence scores, demonstrating strong performance on standard benchmarks. However, ByteTrack was primarily designed for pedestrian tracking from static cameras and does not exploit appearance features. Furthermore, its motion model assumes a simple linear motion, which is insufficient for the erratic motion patterns observed in UAV drone footage. To overcome these limitations, I propose an improved ByteTrack algorithm tailored for UAV scenarios. My contributions are threefold:

I integrate an OSNet-based appearance feature extraction module that learns discriminative embeddings for targets, enabling robust re-identification across frames.
I enhance the motion model by replacing the original state vector with one that includes width w and its velocity, and by introducing a camera motion compensation step that aligns detections using a computed affine transformation between consecutive frames.
I employ a Gaussian process trajectory interpolation method to recover missing tracks caused by temporary detection failures, which outperforms conventional linear interpolation in handling non-linear motion.

Extensive experiments on the VisDrone2019-MOT dataset confirm that my method significantly improves tracking accuracy and consistency. The ablation study further quantifies the contribution of each module.

2. Related Work

Multi-object tracking can be categorized into two main paradigms: tracking-by-detection (TbD) and joint detection and tracking (JDT). TbD methods first apply an object detector to each frame and then perform data association (e.g., using Hungarian algorithm) to link detections across time. ByteTrack belongs to this category and stands out by handling low-confidence detections that often correspond to occluded or blurred objects. Nevertheless, ByteTrack does not employ deep appearance features, and its motion model is based on a standard Kalman filter with a fixed state vector (x, y, r, h, vx, vy, vr, vh), where r is the aspect ratio. In UAV drone contexts, the aspect ratio is not stable due to perspective changes; hence I replace it with the bounding box width w and its velocity vw.

Camera motion is another challenge for UAV drones. Several works have attempted to compensate for ego-motion using optical flow or feature matching, but these are often applied separately from the tracking pipeline. I integrate a lightweight camera motion compensation directly into the Kalman prediction step, which aligns predictions to the current frame coordinate system before data association.

3. Proposed Method

My algorithm builds upon the ByteTrack framework and adapts it for UAV aerial imagery. The overview is described as follows.

3.1 Overall Framework

At each time step t, an object detector (YOLOv5) generates detection boxes with associated confidence scores. Detections are divided into high-confidence (score > 0.5) and low-confidence (0.1 < score ≤ 0.5) sets. For tracklets from the previous frame t-1, I first apply camera motion compensation to transform their predicted states into the current frame coordinate system. Then, Kalman filters provide state predictions for all existing tracks. Gaussian process interpolation fills gaps when tracks suffer from missing detections. In the first association round, I compute a combined cost matrix using both motion (Mahalanobis distance) and appearance (cosine distance from OSNet embeddings) for high-confidence detections. Hungarian algorithm yields preliminary matches. Unmatched high-confidence detections and all low-confidence detections are then associated with unmatched track predictions using only IoU-based motion cost. Unmatched low-confidence detections are discarded. Finally, matched detections update the corresponding tracks, and new tracks are initialized from unmatched high-confidence detections.

3.2 Appearance Feature Extraction Module

I adopt OSNet, a lightweight convolutional neural network designed for person re-identification, as the appearance feature extractor. OSNet consists of residual blocks built with depthwise separable convolutions and a channel attention mechanism (OSBlock). The input to OSNet is a cropped target region resized to 128×64 pixels. After global average pooling, a fully-connected layer outputs a 512-dimensional feature vector F. The cosine similarity between two feature vectors F_i and F_j is defined as:

$$ \cos(\theta) = \frac{F_i \cdot F_j}{\|F_i\|\|F_j\|} = \frac{\sum_{k=1}^{n} F_{i,k} F_{j,k}}{\sqrt{\sum_{k=1}^{n} F_{i,k}^2} \sqrt{\sum_{k=1}^{n} F_{j,k}^2}} $$

3.3 Motion Model with Camera Motion Compensation

The original ByteTrack state vector is (x, y, r, h, vx, vy, vr, vh). For UAV drones, the aspect ratio r is highly variable because the drone’s altitude and viewing angle change rapidly. I replace r with the width w and its velocity vw. The modified state vector becomes:

$$ x_k = (x, y, w, h, v_x, v_y, v_w, v_h)^T $$

To compensate for camera motion, I estimate an affine transformation between frames t-1 and t using sparse optical flow on background keypoints. Keypoints are extracted via ORB (or FAST) and tracked using Lucas-Kanade. RANSAC refines the affine matrix A = [M | T], where M ∈ R^2×2 and T ∈ R². Let the Kalman prediction at time t (without compensation) be:

$$ \hat{x}_{t|t-1} = (x, y, w, h, v_x, v_y, v_w, v_h)^T $$

Define a block-diagonal matrix M_k|k-1 and translation vector T’_k|k-1:

$$ M_{k|k-1} = \begin{bmatrix} M & 0 & 0 & 0 \\ 0 & M & 0 & 0 \\ 0 & 0 & M & 0 \\ 0 & 0 & 0 & M \end{bmatrix} \in \mathbb{R}^{8 \times 8} $$

$$ T’_{k|k-1} = (T, 0, 0, 0, 0, 0, 0, 0)^T \in \mathbb{R}^8 $$

The compensated prediction is:

$$ \hat{x}’_{t|t-1} = M_{k|k-1} \hat{x}_{t|t-1} + T’_{k|k-1} $$

The covariance matrix is similarly transformed:

$$ P’_{t|t-1} = M_{k|k-1} P_{t|t-1} M_{k|k-1}^T $$

Then the standard Kalman update is performed using $\hat{x}’_{t|t-1}$ and $P’_{t|t-1}$.

3.4 Gaussian Process Trajectory Interpolation

Detection failures due to occlusion or small target size lead to broken trajectories. Linear interpolation between endpoints is simple but ignores motion dynamics. I instead apply Gaussian process (GP) regression to model the target’s trajectory as a smooth non-linear function. Given a track segment of length L with observed positions p(t) at frames t, I assume:

$$ p(f) = g^{(i)}(f) + \theta, \quad \theta \sim \mathcal{N}(0, \sigma^2) $$

where g⁽ⁱ⁾ is drawn from a GP with zero mean and radial basis function (RBF) kernel:

$$ k(t, t’) = \exp\left(-\frac{\|t – t’\|^2}{2\lambda^2}\right) $$

Given observed frames F and their positions P, the predicted positions P_* at new frames F_* are:

$$ P_* = K(F_*, F) \big( K(F, F) + \sigma^2 I \big)^{-1} P $$

The length-scale parameter λ is adapted to the track length l as:

$$ \lambda = t \cdot \log\left( \frac{3}{l} \right) $$

For a gap of up to 30 frames, GP interpolation is applied; longer gaps are discarded as permanent termination.

3.5 Combined Matching Metric

For the first association round, I compute a weighted cost C between detection j and track i as:

$$ C = \lambda \, d_{\cos}(i, j) + (1 – \lambda) \, d_{\text{Mah}}(i, j) $$

where d_cos is the cosine distance of OSNet features, and d_Mah is the Mahalanobis distance between the Kalman state and detection:

$$ d_{\text{Mah}}(i, j) = (d_j – y_i)^T S^{-1} (d_j – y_i) $$

with y_i as the predicted state mean and S as innovation covariance. I set λ = 0.98 to emphasize appearance similarity.

4. Experiments

4.1 Dataset and Configurations

I evaluate my method on the VisDrone2019-MOT dataset, which contains 79 video sequences (56 training, 7 validation, 16 test) with 33,366 frames covering five object categories: pedestrian, car, van, bus, and bicycle. All experiments use YOLOv5 as the object detector, trained on the training set with default hyperparameters. OSNet is pre-trained on VisDrone cropped objects. The tracker runs on an Intel i5-13600KF CPU, 32GB RAM, and RTX 4060 Ti GPU with PyTorch 1.11 and CUDA 11.7.

4.2 Comparison with State-of-the-Art

I compare my improved ByteTrack with SORT, DeepSORT, DeepMOT, and the original ByteTrack. All trackers use the same YOLOv5 detector. Results on the test set are shown in Table 1.

Table 1: Performance Comparison on VisDrone2019-MOT Test Set

Algorithm	MOTA (%)	MOTP (%)	IDS ↓	IDF1 (%)	FPS
SORT	23.3	69.1	3377	25.8	24.53
DeepSORT	18.9	71.2	3071	29.0	18.13
ByteTrack (original)	23.9	69.8	1685	29.7	16.51
DeepMOT	15.7	67.5	3266	26.7	19.11
Improved ByteTrack	25.4	72.0	1107	32.6	14.97

My method achieves the highest MOTA (25.4%), MOTP (72.0%), IDF1 (32.6%) and the fewest ID switches (1107). Although FPS drops slightly (14.97 vs 16.51 for baseline), the tracking quality gains are substantial. The reduction of 578 ID switches (from 1685 to 1107) demonstrates the effectiveness of the appearance module and motion compensation in maintaining identity consistency across frames.

4.3 Ablation Study

I conduct a stepwise ablation to isolate the contribution of each component: (a) baseline ByteTrack, (b) + OSNet, (c) + camera motion compensation, (d) + GP interpolation. All experiments use the same detector. Table 2 presents the results.

Table 2: Ablation Study on VisDrone2019-MOT Test Set

Configuration	MOTA (%)	IDS ↓
Baseline ByteTrack	23.9	1685
+ OSNet	24.8	1134
+ Camera compensation	25.1	1231
+ GP interpolation	25.4	1107

Adding the OSNet appearance module improves MOTA by +0.9% and reduces IDS by 551, confirming that appearance cues are crucial for re-identifying targets after occlusion. Camera motion compensation further boosts MOTA by +0.3% but slightly increases IDS compared to the previous step; this is because compensation can introduce small alignment errors that affect short-term associations. Finally, GP interpolation improves MOTA by another +0.3% and cuts IDS by 124, demonstrating its advantage in recovering smooth trajectories during detection gaps.

5. Conclusion

In this work, I have presented an improved ByteTrack algorithm specifically designed for multi-object tracking from UAV aerial imagery. By integrating an OSNet-based appearance feature extractor, a camera motion compensation module, and a Gaussian process trajectory interpolation mechanism, my method addresses the core challenges of small targets, irregular camera movement, and occlusion. Extensive experiments on the VisDrone2019-MOT dataset show that the proposed method outperforms the original ByteTrack and other leading trackers in terms of MOTA, MOTP, IDF1, and ID switches. The ablation study verifies the individual and combined benefits of each module. My approach provides a reliable and efficient solution for UAV drone applications in complex real-world scenarios, and future work may explore end-to-end joint detection and tracking with further speed optimization for deployment on embedded platforms.

References

[1] Zhang Y, Sun P, Jiang Y, et al. ByteTrack: Multi-object tracking by associating every detection box. ECCV, 2022.

[2] Wojke N, Bewley A, Paulus D. Simple online and realtime tracking with a deep association metric. ICIP, 2017.

[3] Zhou K, Yang Y, Cavallaro A, et al. Omni-scale feature learning for person re-identification. ICCV, 2019.

[4] VisDrone: A benchmark for vision-based drone tracking. IEEE T-PAMI, 2021.