Improved ByteTrack for Multi-Object Tracking in UAV Drone Aerial Imagery

I propose an enhanced multi‑object tracking (MOT) algorithm tailored for UAV drone aerial imagery by improving both the motion and appearance modules of the original ByteTrack framework. The baseline ByteTrack relies solely on detection confidence and motion cues, which leads to degraded accuracy and robustness when targets are small, the camera moves irregularly, or occlusions occur – common challenges in drone‑captured sequences. I introduce an appearance feature extraction module based on OSNet to provide reliable identity information, and I modify the motion model to better suit the UAV drone perspective. Specifically, I replace the aspect‑ratio state with an explicit width state, incorporate camera motion compensation through affine transformations estimated from background keypoints, and employ Gaussian process regression for trajectory interpolation when detections are missing. Comprehensive experiments on the VisDrone2019‑MOT dataset demonstrate that the proposed approach outperforms the original ByteTrack by 1.5% in MOTA and reduces identity switches by 578 times, achieving state‑of‑the‑art performance for UAV drone multi‑object tracking.

1. Introduction

With the rapid development of UAV drone technology and the decreasing cost of manufacturing, vision‑based multi‑object tracking (MOT) from aerial platforms has become a critical enabling technology for applications such as autonomous delivery, precision agriculture, disaster response, and urban surveillance. Accurate and real‑time tracking of multiple objects from a moving drone requires addressing several unique challenges: objects often appear very small (only tens of pixels), the relative motion between the camera and the targets is highly irregular due to drone maneuvers, and frequent occlusions caused by buildings, trees, or other vehicles. Existing MOT algorithms designed for stationary or ground‑level cameras often fail under these conditions. For instance, the classic tracking‑by‑detection paradigm relies on accurate detections and smooth motion assumptions, which are violated in UAV drone scenarios.

ByteTrack is a recent tracking‑by‑detection algorithm that achieves competitive performance by associating every detection box, including low‑confidence ones, using motion‑only IoU matching. However, it does not leverage appearance features, making it vulnerable to identity switches when objects have similar motion patterns or when the detector momentarily misses a target. Additionally, its motion model uses a fixed aspect‑ratio assumption that is unsuitable for objects with varying width caused by perspective changes in drone imagery. To address these limitations, I propose an improved ByteTrack algorithm that integrates three key components: (1) an appearance feature extraction module built on OSNet, providing discriminative embeddings for re‑identification; (2) a camera motion compensation module that aligns predicted trajectories across frames using estimated affine transforms; and (3) a Gaussian‑process‑based trajectory interpolation module that fills in missing detections caused by occlusion or detector failures. Combined with a comprehensive matching metric that fuses motion (Mahalanobis distance) and appearance (cosine similarity), the algorithm significantly boosts tracking accuracy and robustness for UAV drone applications.

2. Improved ByteTrack Algorithm

2.1 Overall Framework

The proposed framework follows the tracking‑by‑detection paradigm. For each video frame, a YOLOv5 detector produces bounding boxes with confidence scores. The detections are split into high‑confidence and low‑confidence groups. Meanwhile, tracklets from the previous frame are propagated to the current frame using a Kalman filter enhanced by camera motion compensation and Gaussian process interpolation. The first data association stage uses a combined metric of IoU and appearance cosine similarity to match high‑confidence detections with predicted tracklets. Unmatched high‑confidence detections and all low‑confidence detections then go through a second stage that uses only IoU (motion) to associate with remaining tracklets. Finally, new tracklets are initiated from unmatched high‑confidence detections, and tracklets that remain unmatched for a certain number of frames are terminated.

2.2 Camera Motion Compensation and State Vector Modification

The original ByteTrack uses a state vector:

$$ (x, y, r, h, v_x, v_y, v_r, v_h) $$

where $x, y$ are the center coordinates, $r = w/h$ is the aspect ratio, $h$ is the height, and the $v$ terms are their velocities. For UAV drone imagery, the aspect ratio is not constant because both width and height change with perspective. I therefore replace $r$ and $v_r$ with explicit width $w$ and its velocity $v_w$, resulting in the modified state vector:

$$ (x, y, w, h, v_x, v_y, v_w, v_h) $$

To handle camera motion, I compute an affine transformation between consecutive frames. I extract ORB keypoints from the background (non‑target regions) using OpenCV, perform sparse optical flow tracking, and robustly estimate the affine matrix $A = [M \;|\; T] \in \mathbb{R}^{2\times3}$ using RANSAC. Here $M \in \mathbb{R}^{2\times2}$ encodes scale and rotation, and $T \in \mathbb{R}^{2}$ encodes translation. Let the Kalman predicted state be $\hat{x}_{k|k-1}$. I define the transformation matrix and translation vector for the full state as:

$$ M_{k|k-1} =
\begin{bmatrix}
M & 0 & 0 & 0 \\
0 & M & 0 & 0 \\
0 & 0 & M & 0 \\
0 & 0 & 0 & M
\end{bmatrix}
\in \mathbb{R}^{8\times8} $$

$$ T’_{k|k-1} = [T^T, 0,0,0,0,0,0,0]^T \in \mathbb{R}^{8} $$

The corrected prediction is:

$$ \hat{x}’_{t|t-1} = M_{k|k-1} \hat{x}_{t|t-1} + T’_{k|k-1} $$

and the corresponding covariance is updated as:

$$ P’_{t|t-1} = M_{k|k-1} P_{t|t-1} M_{k|k-1}^T $$

These compensated estimates are then used in the standard Kalman update equations:

$$ K_k = P’_{k|k-1} H_k^T (H_k P’_{k|k-1} H_k^T + R_k)^{-1} $$
$$ \hat{x}_k = \hat{x}’_{k|k-1} + K_k (z_k – H_k \hat{x}’_{k|k-1}) $$
$$ P_k = (I – K_k H_k) P’_{k|k-1} $$

2.3 Appearance Feature Extraction with OSNet

ByteTrack lacks an appearance model; I therefore add an OSNet‑based feature extractor. For each detection with high confidence, the image patch is resized to a fixed size (e.g., 128×64) and forwarded through an OSNet backbone. OSNet uses Lite3×3 blocks and attention‑guided (AG) modules to produce a compact feature vector $F \in \mathbb{R}^{512}$ after global pooling and a fully connected layer. The cosine similarity between two feature vectors $A$ and $B$ is computed as:

$$ \cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|} = \frac{\sum_{i=1}^n A_i B_i}{\sqrt{\sum_{i=1}^n A_i^2} \cdot \sqrt{\sum_{i=1}^n B_i^2}} $$

During data association, this similarity is used together with the motion cost.

2.4 Gaussian Process Trajectory Interpolation

In UAV drone scenes, targets frequently become undetected due to occlusion or small size. Linear interpolation between the last known and first known positions introduces large errors because it ignores motion dynamics. I apply Gaussian process regression to model the underlying continuous trajectory. Given a track of length $L$ with observed positions $p(t)$ at frames $t$, we assume the positions are generated by a latent function $g$ with additive Gaussian noise $\theta \sim \mathcal{N}(0, \sigma^2)$:

$$ p_f = g^{(i)}(f) + \theta $$

The function $g$ is assigned a Gaussian process prior with zero mean and a radial basis function kernel:

$$ k(t, t’) = \exp\left(-\frac{\|t – t’\|^2}{2\lambda^2}\right) $$

The smoothing lengthscale $\lambda$ is made adaptive to the trajectory length $l$ by:

$$ \lambda = t \cdot \log\left(\frac{3t}{l}\right) $$

Given observed frames $F$ and corresponding positions $P$, the predicted positions $P_*$ for a new set of frames $F_*$ are obtained by the standard GP prediction formula:

$$ P_* = K(F_*, F) \bigl( K(F, F) + \sigma^2 I \bigr)^{-1} P $$

where $K(\cdot,\cdot)$ is the kernel matrix. This interpolation provides smooth and realistic trajectories for gaps in detections, improving the continuity of tracks.

2.5 Combined Matching Metric

In the first data association stage, I combine motion and appearance costs. The motion cost is the Mahalanobis distance between the Kalman predicted state $y_i$ and the detection $d_j$:

$$ d^{(1)}(i,j) = (d_j – y_i)^T S^{-1} (d_j – y_i) $$

where $S$ is the innovation covariance. The appearance cost is the cosine distance derived from the OSNet features. The combined cost is:

$$ C = \lambda \, d_{\cos}(i,j) + (1-\lambda) \, d_{\text{mah}}(i,j) $$

I set $\lambda = 0.98$ to give more weight to appearance, which is crucial for distinguishing objects with similar motions in cluttered UAV drone scenes. The Hungarian algorithm is then used to minimize the total cost.

3. Experiments

3.1 Experimental Setup

All experiments are conducted on the VisDrone2019‑MOT dataset, which consists of 79 video sequences (33,366 frames) captured by UAV drones at various altitudes and environments. The training set contains 56 sequences (24,198 frames), the validation set 7 sequences (2,846 frames), and the test set 16 sequences (6,322 frames). I use YOLOv5 as the detector, trained on the five object categories (pedestrian, car, van, bus, bicycle) from the dataset. The OSNet feature extractor is trained on the cropped target patches from the training set. The hardware platform is an Intel Core i5‑13600KF CPU, 32 GB RAM, and an NVIDIA RTX 4060 Ti GPU with CUDA 11.7. The framework is implemented in PyTorch 1.13 and Python 3.8.

3.2 Comparison with State‑of‑the‑Art Methods

I compare my improved ByteTrack against four mainstream MOT algorithms: SORT, DeepSORT, DeepMOT, and the original ByteTrack. All methods use the same YOLOv5 detector with identical weights. The evaluation metrics are MOTA (Multiple Object Tracking Accuracy), MOTP (Multiple Object Tracking Precision), IDS (Identity Switches), IDF1 (Identification F1 score), and FPS (frames per second). Results are summarized in Table 1.

**Table 1: Comparison on VisDrone2019‑MOT Test Set**
Algorithm	MOTA (%)	MOTP (%)	IDS ↓	IDF1 (%)	FPS
SORT	23.3	69.1	3377	25.8	24.53
DeepSORT	18.9	71.2	3071	29.0	18.13
ByteTrack (original)	23.9	69.8	1685	29.7	16.51
DeepMOT	15.7	67.5	3266	26.7	19.11
Improved ByteTrack (Ours)	25.4	72.0	1107	32.6	14.97

As shown in Table 1, my proposed algorithm achieves the highest MOTA (25.4%) and MOTP (72.0%), the lowest number of identity switches (1107), and the best IDF1 (32.6%). Although the FPS drops slightly to 14.97 due to the added appearance extraction and camera compensation, it remains sufficient for near‑real‑time UAV drone applications. The improvements over the original ByteTrack are particularly notable: +1.5% MOTA and −578 IDS, confirming the effectiveness of the proposed modules.

3.3 Ablation Study

To isolate the contribution of each added component, I conduct an ablation study with four configurations:

Experiment 1: Original ByteTrack (baseline).
Experiment 2: Baseline + OSNet appearance module.
Experiment 3: Baseline + OSNet + camera motion compensation.
Experiment 4: Baseline + OSNet + camera motion compensation + Gaussian process interpolation (full proposed method).

Results are presented in Table 2.

**Table 2: Ablation Study on VisDrone2019‑MOT Test Set**
Configuration	MOTA (%)	IDS
Exp. 1: Original ByteTrack	23.9	1685
Exp. 2: + OSNet appearance	24.8	1134
Exp. 3: + OSNet + motion compensation	25.1	1231
Exp. 4: + OSNet + motion comp. + GP interpolation (full)	25.4	1107

Adding the OSNet appearance module alone (Exp. 2) improves MOTA by 0.9% and reduces IDS by 551, demonstrating the strong discriminative power of learned appearance features for UAV drone targets. Further adding camera motion compensation (Exp. 3) yields another +0.3% MOTA, but IDS increases slightly compared to Exp. 2 because the affine transform may occasionally misalign small objects; however, the combined effect with GP interpolation (Exp. 4) reduces IDS to the lowest level. The Gaussian process trajectory interpolation (Exp. 4) recovers many fragmented tracks that would otherwise cause identity switches, confirming its crucial role in maintaining track consistency. Overall, the full model outperforms each individual component.

4. Conclusion

In this work, I have presented an improved ByteTrack algorithm that effectively addresses the challenges of multi‑object tracking from UAV drone aerial imagery. By incorporating a lightweight OSNet‑based appearance feature extractor, a camera motion compensation module that aligns predictions with the drone’s egomotion, and a Gaussian process interpolation scheme for handling missing detections, the algorithm significantly boosts both accuracy and robustness. The modified state vector that uses explicit width instead of aspect ratio better captures the geometry of objects in drone views. Extensive experiments on the challenging VisDrone2019‑MOT dataset demonstrate that the proposed method outperforms the original ByteTrack and other state‑of‑the‑art trackers, with improvements of 1.5% in MOTA and a 34% reduction in identity switches. The method maintains a practical processing speed of about 15 FPS, making it suitable for real‑time UAV drone applications. Future work will explore lightweight transformer‑based feature extractors and end‑to‑end joint detection‑tracking frameworks to further improve efficiency under severe resource constraints.