Improved ByteTrack for UAV Drone Multi-object Tracking

In recent years, the rapid advancement of UAV drone technology has significantly lowered manufacturing costs, making aerial surveillance and monitoring increasingly accessible. One of the core applications of UAV drones is visual multi-object tracking (MOT), which plays a vital role in autonomous navigation, traffic monitoring, agricultural inspection, disaster rescue, and urban surveillance. However, tracking multiple objects from a UAV drone perspective presents unique challenges: objects appear small and distant, the relative motion between the camera and targets is highly irregular, and occlusions caused by buildings, trees, or other obstacles are frequent. These factors collectively degrade tracking accuracy and robustness.

To address these issues, I propose an improved multi-object tracking algorithm based on ByteTrack, a state-of-the-art tracking-by-detection paradigm. My approach enhances both the motion feature module and the appearance feature module to adapt to the specific characteristics of UAV drone aerial imagery. The original ByteTrack algorithm does not incorporate deep appearance features; it relies solely on motion cues, which limits its performance under occlusion and fast motion. I introduce an OSNet-based appearance feature extraction module to capture robust visual embeddings of targets. Additionally, I modify the Kalman filter state vector to better represent bounding box dimensions for UAV drone perspectives, integrate a camera motion compensation step to handle ego-motion, and employ Gaussian process regression for trajectory interpolation to recover broken tracks caused by missed detections. Experimental results on the VisDrone2019-MOT dataset demonstrate that my algorithm outperforms the original ByteTrack by 1.5% in MOTA and reduces ID switches by 578, achieving state-of-the-art performance for UAV drone multi-object tracking.

The rest of this paper is organized as follows. Section 1 reviews related work on multi-object tracking and UAV drone-specific challenges. Section 2 details the proposed method, including the overall framework, appearance feature extraction, motion compensation, trajectory interpolation, and combined matching metric. Section 3 presents experimental setup, results, and ablation studies. Finally, Section 4 concludes the paper.

1. Related Work

Multi-object tracking algorithms can be broadly categorized into two paradigms: tracking-by-detection (TBD) and joint detection and tracking (JDT). TBD methods first detect objects in each frame using an object detector, then associate detections across frames to form tracks. This decoupled design allows independent optimization of detection and association modules. Recent advances in object detection, such as YOLO series, have made TBD approaches achieve state-of-the-art results on public benchmarks. ByteTrack, proposed by Zhang et al., is a representative TBD tracker that associates every detection box, including low-confidence ones, through a two-stage association strategy. Its simplicity and efficiency make it a strong baseline.

However, ByteTrack was primarily designed for fixed-camera or pedestrian tracking scenarios. Directly applying it to UAV drone aerial videos leads to performance degradation due to several factors: (1) the Kalman filter state vector uses aspect ratio and height, which are less stable under perspective changes; (2) no camera motion compensation is applied, so ego-motion introduces large prediction errors; (3) missed detections caused by small targets or occlusions break trajectories, and simple linear interpolation fails to capture nonlinear motion; (4) lack of appearance feature metric makes the tracker vulnerable to identity switches when targets move close or occlude each other.

Several works have attempted to improve UAV drone multi-object tracking. Su et al. proposed a motion-model-based method for UAV ground target tracking, but it could not handle fast relative motion. Niu et al. designed a two-stage matching tracker. Wu et al. introduced an end-to-end attention-based tracker using SwinTransformer. Wang et al. combined ResNet and Kalman filter for tracking. However, none of these methods fully address all the core difficulties of aerial scenes. My work builds upon ByteTrack and systematically improves both motion and appearance modules, achieving superior performance.

In the following, I describe the proposed method in detail.

2. Proposed Method

My improved ByteTrack algorithm retains the overall tracking-by-detection pipeline but introduces four key enhancements: (1) a modified Kalman filter state vector suitable for UAV drone perspectives; (2) an OSNet-based appearance feature extraction module; (3) camera motion compensation via homography estimation; (4) Gaussian process regression for trajectory interpolation. The overall framework is illustrated conceptually through the inserted figure below.

The algorithm proceeds as follows. First, a YOLOv5 detector localizes objects in each frame, producing bounding boxes with detection confidence scores. Detections are split into high-confidence and low-confidence groups. For each tracklet from the previous frame, the camera motion compensation module warps its predicted state to the current frame coordinate system. Then, the Gaussian process regression module fills in missing detections (if any) to recover the trajectory. After Kalman filtering, predicted states are obtained. In the first association stage, I compute a combined matching cost matrix based on both motion (Mahalanobis distance) and appearance (cosine similarity from OSNet features) for high-confidence detections and predicted tracklets. The Hungarian algorithm is used for assignment. Unmatched high-confidence detections and unmatched track predictions are then passed to a second association stage using only IoU-based motion similarity with low-confidence detections. Finally, unmatched low-confidence detections are discarded, and track management is updated (e.g., new tracks initialized, lost tracks terminated).

2.1 Modified Motion State Vector

The original ByteTrack uses a Kalman filter with state vector:
$$
\mathbf{x}_k = (x, y, r, h, v_x, v_y, v_r, v_h)
$$
where $x, y$ are the center coordinates, $r$ is the aspect ratio (width/height), $h$ is the height, and the $v$ terms are their respective velocities. In UAV drone aerial images, the aspect ratio $r$ is unstable due to perspective distortion and camera roll. Moreover, the width $w$ itself is a more direct measure. Therefore, I replace $r$ and $v_r$ with $w$ and $v_w$. The new state vector becomes:
$$
\mathbf{x}_k = (x, y, w, h, v_x, v_y, v_w, v_h)
$$
This modification allows the Kalman filter to directly track the width and its change, which is more natural for objects viewed from above where the bounding box can scale anisotropically.

2.2 OSNet-based Appearance Feature Extraction

ByteTrack does not use any appearance features, making it rely solely on motion association. This is problematic when objects cross paths or are temporarily occluded. To address this, I integrate an OSNet architecture to extract discriminative appearance embeddings for each detected target. OSNet is a lightweight deep network based on depthwise separable convolutions and channel-wise attention, designed for person re-identification but generalizable to other objects. The extraction pipeline is as follows:

For each detection box, I crop the corresponding image region and resize it to a fixed size (e.g., 128×64 pixels).
The resized patch is fed into OSNet, which outputs a feature map. After global average pooling and a fully connected layer, a 512-dimensional feature vector $\mathbf{F}$ is obtained.
The appearance similarity between two detections $i$ and $j$ is computed using cosine similarity:
$$
\cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}
$$
where $\mathbf{A}$ and $\mathbf{B}$ are feature vectors.

To build a gallery of appearance features for existing tracks, I maintain a feature pool for each tracklet. Specifically, for each track, I store the most recent 100 feature vectors. When matching a new detection to a track, I compare the detection’s feature with all stored features of that track and take the maximum cosine similarity as the appearance cost. This design is robust to appearance variations over short periods.

2.3 Camera Motion Compensation

UAV drone cameras often undergo complex ego-motion (translation, rotation, scaling) due to flight dynamics. This motion shifts the image coordinates of all objects, causing large prediction errors if not compensated. I adopt a two-frame image registration approach to estimate a global homography transformation. Specifically:

In the previous frame and current frame, I detect keypoints (e.g., using ORB or Shi-Tomasi) and compute sparse optical flow using Lucas-Kanade method.
A RANSAC-based algorithm fits a homography matrix $\mathbf{A} \in \mathbb{R}^{2 \times 3}$ that transforms points from the previous frame to the current frame. The matrix consists of a rotation-scale part $\mathbf{M} \in \mathbb{R}^{2 \times 2}$ and a translation part $\mathbf{T} \in \mathbb{R}^{2}$:
$$
\mathbf{A} = [\mathbf{M} \;|\; \mathbf{T}]
$$
For each Kalman predicted state $\hat{\mathbf{x}}_{t|t-1} = (x, y, w, h, v_x, v_y, v_w, v_h)^\top$, I apply the homography to the spatial components. I construct a transformation matrix $\mathbf{M}_{k|k-1} \in \mathbb{R}^{8 \times 8}$:
$$
\mathbf{M}_{k|k-1} = \begin{bmatrix}
\mathbf{M} & 0 & 0 & 0 \\
0 & \mathbf{M} & 0 & 0 \\
0 & 0 & \mathbf{M} & 0 \\
0 & 0 & 0 & \mathbf{M}
\end{bmatrix}
$$
and a translation vector $\mathbf{T}’_{k|k-1} = (\mathbf{T}, 0, 0, 0, 0, 0, 0)^\top \in \mathbb{R}^8$. The warped state is:
$$
\hat{\mathbf{x}}’_{t|t-1} = \mathbf{M}_{k|k-1} \hat{\mathbf{x}}_{t|t-1} + \mathbf{T}’_{k|k-1}
$$
The covariance matrix is also transformed:
$$
\mathbf{P}’_{t|t-1} = \mathbf{M}_{k|k-1} \mathbf{P}_{t|t-1} \mathbf{M}_{k|k-1}^\top
$$
These warped state and covariance are then used as the prior for the Kalman update step.

This compensation effectively removes the global background motion, making the Kalman filter’s predictions more accurate.

2.4 Gaussian Process Regression Trajectory Interpolation

Missed detections are common in UAV drone tracking due to small object sizes, occlusions, or abrupt camera motion. When a target is undetected for a few frames, its trajectory is broken. Simple linear interpolation between the last known and first re-detected position ignores the object’s motion pattern and can cause large errors. I employ Gaussian process regression (GPR) to model the underlying nonlinear motion and interpolate missing positions smoothly.

Let the available trajectory points from a track be represented as a set of frame indices $F = \{t_1, t_2, \dots, t_L\}$ and corresponding positions $P = \{p_{t_1}, p_{t_2}, \dots, p_{t_L}\}$ where $p_t = (x_t, y_t)$. I assume the position at any frame $f$ follows a Gaussian process:
$$
p_f = g^{(i)}(f) + \theta, \quad \theta \sim \mathcal{N}(0, \sigma^2)
$$
where $g^{(i)}$ is a latent function and $\sigma^2$ is noise variance. I use a zero-mean GP prior: $g^{(i)} \sim \mathcal{GP}(0, k(\cdot, \cdot))$ with a radial basis function (RBF) kernel:
$$
k(t, t’) = \exp\left(-\frac{\|t – t’\|^2}{2 \lambda^2}\right)
$$
The hyperparameter $\lambda$ controls the smoothness of the interpolated trajectory. To adapt to varying trajectory lengths, I set $\lambda$ adaptively:
$$
\lambda = t \cdot \log\left(\frac{3}{l}\right)
$$
where $t$ is the current frame index and $l$ is the length of the known trajectory segment. This heuristic ensures smoother interpolation for longer trajectories and more flexible interpolation for shorter ones.

Given observed frames $F$ and positions $P$, the predicted positions $\mathbf{P}_*$ at a new set of frames $F_*$ (the missing frames) are:
$$
\mathbf{P}_* = K(F_*, F) \left[ K(F, F) + \sigma^2 I \right]^{-1} P
$$
where $K(\cdot, \cdot)$ is the covariance matrix computed using the RBF kernel. This interpolation not only fills gaps but also provides a probabilistic uncertainty estimate, which can be useful for downstream tasks. In practice, I apply GPR only when the gap is less than 30 frames to avoid excessive extrapolation.

2.5 Combined Matching Metric

In the first association stage, I combine motion and appearance costs into a single cost matrix. The motion cost between detection $j$ and track prediction $i$ is given by the Mahalanobis distance:
$$
d^{(1)}(i, j) = (\mathbf{d}_j – \mathbf{y}_i)^\top \mathbf{S}^{-1} (\mathbf{d}_j – \mathbf{y}_i)
$$
where $\mathbf{d}_j$ is the detection state vector (x,y,w,h), $\mathbf{y}_i$ is the predicted state, and $\mathbf{S}$ is the innovation covariance from Kalman filter. The appearance cost is the cosine distance:
$$
d_{\text{cos}}(i, j) = 1 – \cos(\mathbf{F}_i^{\text{gallery}}, \mathbf{F}_j^{\text{det}})
$$
I use the maximum cosine similarity over the feature pool of track $i$. The combined cost is:
$$
C = \lambda \, d_{\text{cos}}(i, j) + (1 – \lambda) \, d^{(1)}(i, j)
$$
with $\lambda = 0.98$ to heavily favor appearance similarity, as motion can be noisy under UAV drone ego-motion. A gate threshold is applied to the Mahalanobis distance: if $d^{(1)}(i,j) > 9.21$ (corresponding to 99% confidence interval of chi-square with 4 degrees of freedom), the cost is set to infinity.

In the second association stage, only motion cost (IoU-based) is used:
$$
d_{\text{IoU}}(i,j) = 1 – \text{IoU}(b_i, b_j)
$$
Low-confidence detections that remain unmatched are used only in this stage; if still unmatched, they are discarded.

3. Experimental Results

3.1 Experimental Setup

I evaluate my method on the VisDrone2019-MOT dataset, which contains 79 video sequences (33,366 frames) captured from UAV drones. The training set has 56 sequences (24,198 frames), the validation set 7 sequences (2,846 frames), and the test set 16 sequences (6,322 frames). Object categories include pedestrian, car, van, bus, and bicycle. I use YOLOv5x as the object detector, trained on the VisDrone training set with 640×640 input size. The OSNet model is pre-trained on ImageNet and fine-tuned on VisDrone crops extracted from training frames, using a triplet loss with margin 0.3. All experiments run on an Intel Core i5-13600KF with 32 GB RAM and an NVIDIA RTX 4060 Ti GPU. PyTorch framework is used.

Evaluation metrics follow the standard MOT Challenge: MOTA (Multiple Object Tracking Accuracy), MOTP (Multiple Object Tracking Precision), IDS (ID Switches), IDF1 (ID F1 score), and FPS (frames per second).

3.2 Comparison with State-of-the-Art Methods

I compare my improved ByteTrack against four baseline trackers: SORT, DeepSORT, ByteTrack (original), and DeepMOT. All methods use the same YOLOv5x detector with identical weights for fair comparison. Results on the VisDrone2019-MOT test set are summarized in Table I.

**Table I: Comparison of MOT algorithms on VisDrone2019-MOT test set**
Algorithm	MOTA (%)	MOTP (%)	IDS ↓	IDF1 (%)	FPS
SORT	23.3	69.1	3377	25.8	24.53
DeepSORT	18.9	71.2	3071	29.0	18.13
ByteTrack (original)	23.9	69.8	1685	29.7	16.51
DeepMOT	15.7	67.5	3266	26.7	19.11
Improved ByteTrack (Ours)	25.4	72.0	1107	32.6	14.97

From Table I, my improved ByteTrack achieves the highest MOTA (25.4%), MOTP (72.0%), IDF1 (32.6%), and the lowest IDS (1107). Compared to the original ByteTrack, MOTA improves by 1.5% and IDS reduces by 578. The FPS is slightly lower (14.97 vs. 16.51) due to the extra computation of OSNet features and camera motion compensation, but it still runs at real-time speed (above 14 fps). The results confirm the effectiveness of the proposed enhancements for UAV drone multi-object tracking.

3.3 Ablation Studies

To isolate the contribution of each component, I conduct ablation experiments by incrementally adding the three main modules: OSNet appearance feature extraction, camera motion compensation, and Gaussian process regression interpolation. The results are shown in Table II.

**Table II: Ablation study on VisDrone2019-MOT test set**
Config	MOTA (%)	IDS ↓
Baseline (ByteTrack)	23.9	1685
+ OSNet	24.8	1134
+ OSNet + Camera Motion Comp.	25.1	1231
+ OSNet + Camera Motion Comp. + GP Interp. (Full)	25.4	1107

Adding the OSNet appearance module alone improves MOTA by 0.9% and dramatically reduces IDS from 1685 to 1134 (a reduction of 551). This shows that appearance cues are crucial for maintaining identity consistency in UAV drone scenes where objects frequently overlap. Adding camera motion compensation further boosts MOTA to 25.1%, though IDS slightly increases to 1231 compared to the previous config (1134). This is possibly because the compensation introduces slight misalignments for fast-moving objects, leading to occasional mismatches. However, the overall tracking accuracy improves. Finally, incorporating Gaussian process regression interpolation gives an additional 0.3% MOTA gain and reduces IDS to 1107, indicating that smooth trajectory interpolation helps recover tracks after missed detections and reduces identity switches caused by broken trajectories. Overall, the full model achieves the best performance.

3.4 Qualitative Analysis

Figure 1 (the inserted image) shows several example tracking frames from the VisDrone2019-MOT dataset. As seen, my algorithm can reliably track both moving vehicles (cars, buses) and small pedestrians even under challenging conditions such as partial occlusions and cluttered backgrounds. The improved motion state vector with width instead of aspect ratio allows better handling of bounding box scale changes when the UAV drone changes altitude. The camera motion compensation effectively stabilizes the predictions despite the drone’s erratic movement. The OSNet appearance features ensure that objects with similar motion but different visual appearances are correctly re-identified after occlusion.

Compared to the original ByteTrack, my method exhibits fewer ID switches. For example, in a sequence where multiple pedestrians cross paths, the original ByteTrack tends to swap identities, while my tracker maintains consistent IDs due to the appearance metric. This is particularly important for applications like crowd counting or tracking specific vehicles in traffic surveillance.

4. Conclusion

In this work, I presented an improved ByteTrack algorithm tailored for UAV drone multi-object tracking. The key contributions are: (1) a modified Kalman filter state vector that directly uses width and height, better suited for aerial perspectives; (2) an OSNet-based appearance feature extraction module that provides robust visual embeddings for reliable re-identification; (3) a camera motion compensation module based on optical flow and homography to mitigate ego-motion effects; (4) a Gaussian process regression interpolation method to smoothly fill missing trajectory segments caused by missed detections. Comprehensive experiments on the VisDrone2019-MOT dataset demonstrate that the proposed method outperforms the original ByteTrack and other state-of-the-art trackers, achieving a MOTA of 25.4% and reducing ID switches by 578. The ablation study confirms the effectiveness of each component.

Future work will explore integrating more advanced detection backbones and employing end-to-end learning for feature extraction and association. I also plan to evaluate the algorithm on other UAV drone datasets, such as UAVDT and Drone-vs-Bird, to further validate its generalization. The proposed improvements provide a solid baseline for practical UAV drone tracking systems in real-world applications.