China’s Multi-Drone System: Enhancing Target Tracking Through Collaborative Vision

The 2024 China UAV industry market forecast report highlights that drones, with their advantages of small size, low cost, agile maneuverability, ease of control, and strong battlefield survivability, have become a crucial direction for equipment development worldwide. They are extensively used in various civilian and military fields such as agricultural monitoring, disaster relief, and battlefield reconnaissance. Target tracking, as a core component of UAV mission execution, enables the identification and tracking of high-value aerial or ground targets, creating a unique asymmetric advantage that significantly enhances the application and combat effectiveness of drones. With the continuous advancement of artificial intelligence technology and the expansion of UAV mission environments, multi-object tracking (MOT) based on drones offers a broad aerial field of view, capturing richer information and enabling powerful data collection capabilities. This provides effective information for further target intention recognition, supporting situational awareness such as target threat assessment and harassment intent, gradually becoming a hotspot in current UAV research.

Compared to other target tracking tasks, the drone target tracking environment and mission content are more complex and diverse. Targets or scenes within the field of view are prone to rapid changes, facing more severe challenges such as target occlusion, viewpoint changes, and motion blur. For instance, movements and interactions of targets or platforms, and weather changes like rain, fog, and day-night transitions, often make it difficult for a single drone to achieve continuous and precise tracking of moving targets. In theory, a multi-drone system can address this issue through complementary multi-view sensing. However, in practice, this involves the challenging problem of real-time cross-view target association among drones. The specific difficulties are as follows:

1. Drones fly at high speeds with significant perspective differences between them, making cross-view association and matching based on image background features difficult. Due to the high-speed movement and significant viewpoint changes of drones during missions, substantial perspective differences lead to severe occlusion and drastic illumination changes in overlapping images from complementary views, making it hard to extract complete image features. Simultaneously, images captured by drones under low-light or nighttime conditions often have low resolution with considerable noise, lacking sufficient texture features. This makes image association and matching between multi-drone perspectives even more challenging.

2. The appearance features of targets are inconsistent across different views, and motion features vary greatly, making cross-view association and matching based on the target’s own features difficult. In images captured by drones, the appearance features of targets exhibit inconsistency due to frequent viewpoint changes, rendering traditional appearance and motion features less effective for target association in complementary drone views. When drones fly at a certain altitude, the image footprint becomes larger, resolution and clarity decrease, and ground tracking targets become very small. Target features and textures become sparse and are easily interfered with by background noise in complex scenes, making it difficult to extract effective features.

3. There is a scarcity of benchmarks for multi-drone collaborative sensing data. Existing methods have high computational complexity, making practical algorithm deployment difficult. Currently, research on collaborative video analysis for multi-target scenarios under dynamic complementary drone perspectives is still in its infancy. The limited existing research has not yet established relevant benchmark datasets or general algorithm frameworks tailored to the characteristics of drones. Existing multi-camera multi-object tracking (MCMOT) algorithms often operate offline with high complexity and significant computational resource requirements, failing to meet the real-time needs for specific drone application deployments.

This article focuses on the spatiotemporal association and perception problem for multiple targets under complementary multi-drone perspectives. The main contributions include: Firstly, utilizing drone pose information without relying on image features, rapid inter-view projection transformation is achieved by fusing consistency constraints between position and target height. Building on this, the preliminary association results are further analyzed to mine the spatiotemporal topological features between targets, followed by detailed optimization using spatial and temporal cues. Concurrently, focusing on target occlusion scenarios, a multi-drone multi-object tracking dataset DP-MDMT containing pose data is constructed. Experimental results show that this method exhibits good performance in multi-drone multi-object tracking, achieving reliable spatiotemporal association and perception of multiple targets under complementary drone perspectives.

Related Work on Multi-Drone Multi-Object Tracking

The core of the Multi-Drone Multi-Object Tracking (MDMOT) task is how to generate consistent multi-target tracking trajectories from the perspectives of multiple drones. Compared to MDMOT, the Multi-Camera Multi-Object Tracking (MCMOT) task has been widely studied. MDMOT is similar to the MCMOT task type and workflow under overlapping views, as both fuse complementary target information from multiple perspectives to address issues like target tracking loss and target feature variation. Their only distinction lies in the method of acquiring image information. The main difficulty in this multi-view task is establishing correlations between targets in images from different drone perspectives and designing models to fuse multi-view complementary information to improve target tracking performance.

In the MCMOT task, homography constraint methods are widely used to enhance the tracking accuracy and robustness of single-camera MOT tasks. This method projects detection and tracking results from different views onto a unified ground plane and seeks the intersection points of projections from multiple views as the estimated target positions. However, in practical MCMOT application scenarios, severe occlusion between targets may occur, leading to numerous false positives. Previous MCMOT tasks mostly focused on pedestrian or vehicle targets in surveillance scenarios. To address the impact of false positives and pedestrian height variations, some researchers proposed a Bayesian network model for multi-camera views. Its core involves using preset pedestrian height information combined with existing detection algorithms to generate preliminary detection data. Furthermore, the model employs a Bayesian network to model occlusion relationships among candidate targets within each camera’s field of view. By connecting the ground plane with camera perspectives through homographic projection relationships, a complete MvBN inference result is constructed.

In recent years, end-to-end Transformer-based methods have shown great potential in cross-camera multi-object tracking tasks, especially in handling target association problems. Some researchers have proposed a concise and efficient anchor-free feature perspective transformation network (MVDet) dedicated to cross-camera MOT tasks, supporting end-to-end training. This network takes three-channel images from multiple views as input, uses ResNet for feature extraction, and maps 3D view feature maps to a 2D plane through projection transformation. Furthermore, these 2D features are combined with coordinate information to generate a Bird’s Eye View (BEV), comprehensively representing scene information. Through the aggregation and convolution of BEV features, MVDet can predict pedestrian positions and provide detection results. This method significantly reduces the adverse impact of occlusion on tracking performance. However, MVDet’s performance may degrade with fewer cameras, and due to its offline processing nature, it cannot meet real-time requirements. To address the high computational complexity in cross-camera MOT tasks, other researchers proposed a real-time 3D MOT method named DMCT (Deep Multi-Camera Tracking). This method designs a deep learning network to estimate the projection position of each target on a virtual ground plane, integrates perspective projection effects into the ground plane heatmap, and constructs a lightweight Deep Glimpse Network (DGN) to capture human behavior. DMCT can process multiple frames in a video stream simultaneously, operating similarly to human keypoint detection, achieving a processing speed of 15 fps across 8 cameras. Additionally, through the spatial cascade operation of Transformer, DMCT allows effective interaction of information between different modalities beyond local limits.

However, as shown in the comparison, unlike drone mission scenarios, MCMOT algorithms are specifically designed for static cameras. In MCMOT task scenarios, targets are larger and have significant appearance features. In MDMOT tasks, due to the high flight altitude of drones, the targets to be tracked are small, making feature extraction difficult. They are also extremely prone to tracking challenges such as occlusion and varying movement speeds, making it hard to extract reliable appearance and motion features. Traditional MCMOT methods based on fixed geometric association, trajectory segment matching, or appearance feature re-identification are difficult to apply directly when facing challenges like continuously moving sensors, frequent viewpoint switching, and drastic target scale changes in MDMOT scenarios. The comparison of task characteristics is shown in the table below.

Table 1: Comparison of MCMOT and MDMOT Task Characteristics
Comparison Dimension	MCMOT Task	MDMOT Task
Sensor Dynamic Characteristics	Camera position and viewpoint are fixed; background changes slowly; cameras can be pre-calibrated to establish relatively stable geometric relationships.	Each drone’s position, attitude, and altitude change rapidly and in real-time; viewpoints change frequently and are unpredictable.
Viewpoint Change Frequency	Viewpoint changes mainly stem from target movement; the camera viewpoint itself is fixed and changes slowly.	Drones move rapidly; viewpoints change frequently; the appearance features of the same target differ significantly across different drone views.
Target Scale Variation	The distance from the target to the camera is relatively stable; target scale changes mainly stem from its radial movement, with a relatively controllable range.	Aerial top-down shooting leads to low pixel resolution of targets; the composite motion of the drone and target causes drastic target scale changes within consecutive frames.

Furthermore, specialized research and methods for the multi-drone multi-object tracking task are still in their early stages. Some researchers collected and constructed the first multi-drone multi-object tracking dataset and proposed a network structure named MIA-Net, using global and local matching methods to associate and track targets in multiple drone views. This work provides a new benchmark dataset and baseline algorithm for the field of multi-drone multi-object detection and tracking, significantly promoting the development of cross-drone view target association methods in China’s UAV sector.

Proposed Methodology: Fusing Multi-View Projection and Spatiotemporal Topology

To enhance the persistent tracking capability of drones for moving objects and overcome the performance limitations of single-drone systems, this paper proposes a multi-drone multi-object tracking method that leverages the collaborative perception advantage intrinsic to China’s advanced UAV drone fleets. By integrating multi-view projection and the spatiotemporal topology of objects, the method addresses core challenges in dynamic environments.

Rapid Multi-View Projection Transformation for China’s UAV Drones

In drone perspectives, images often suffer from poor quality or indistinct features due to long-range shooting, background occlusion, or lighting effects. Relying on image features for association imposes high requirements on the drone’s working environment and shooting conditions. Therefore, starting from the inherent characteristics of China’s UAV drones, this method directly uses drone pose information to achieve rapid projection transformation between multiple views, independent of image features. Compared to traditional fixed cameras, drones and their onboard gimbals can directly obtain real-time position coordinates and attitude angles from onboard GPS and Inertial Measurement Units (IMU) during aerial photography, enabling faster and more effective preliminary association of multiple targets under complementary perspectives.

As illustrated, assume two drones are performing detection tasks on a ground scene at positions $O_1$ and $O_2$, with an overlapping field of view between them, acquiring images $I_1$ and $I_2$ respectively. The projections of a scene point $P(X_w, Y_w, Z_w)^T$ in the images are $P_1(X_{c1}, Y_{c1}, Z_{c1})^T$ and $P_2(X_{c2}, Y_{c2}, Z_{c2})^T$. The homography matrix $H$ satisfies:
$$ p_2 = \lambda H p_1 $$
where $\lambda$ is a non-zero scaling constant. Setting the world coordinate system origin at $O_1$, the camera imaging model is:
$$ Z_{c1} \begin{bmatrix} u_1 \\ v_1 \\ 1 \end{bmatrix} = K \begin{bmatrix} X_w \\ Y_w \\ Z_w \end{bmatrix} $$
where $K$ is the camera’s intrinsic matrix:
$$ K = \begin{bmatrix} f/dx & 0 & u_0 \\ 0 & f/dy & v_0 \\ 0 & 0 & 1 \end{bmatrix} $$
The world coordinates can be derived as:
$$ \begin{bmatrix} X_w \\ Y_w \\ Z_w \end{bmatrix} = Z_{c1} K^{-1} \begin{bmatrix} u_1 \\ v_1 \\ 1 \end{bmatrix} $$
If the motion relationship between the two drones $O_1$ and $O_2$ is $R[I, -t]$, where $R$ is a 3×3 rotation matrix and $t$ is a three-dimensional translation vector, then:
$$ Z_{c2} \begin{bmatrix} u_2 \\ v_2 \\ 1 \end{bmatrix} = KR[I, -t] \begin{bmatrix} X_w \\ Y_w \\ Z_w \\ 1 \end{bmatrix} = KR \begin{bmatrix} X_w \\ Y_w \\ Z_w \end{bmatrix} – KRt $$
Combining with the ground plane equation $n^T X = d$, where $n$ is the unit normal vector of the ground and $d$ is the distance from the coordinate origin to the ground plane, the projection transformation matrix between the two drone views can be expressed as:
$$ H = KR \left( I – \frac{t}{d} n^T \right) K^{-1} = KRK^{-1} – \frac{1}{d} KRt n^T K^{-1} $$
The detection position $O$ of each drone can be represented by 6 parameters, denoted as $O_i(\phi_i, \theta_i, \psi_i, x_i, y_i, z_i)$. Drones equipped with IMU and GPS can read and record the attitude angles and spatial coordinates in real-time during shooting. Based on the changes in the three attitude angles $\phi, \theta, \psi$ between the two viewpoints and the coordinates $(x_1, y_1, z_1)$, $(x_2, y_2, z_2)$, $R$ and $t$ can be determined:
$$ R = R_\phi R_\theta R_\psi $$
where
$$ R_\phi = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \cos \phi & \sin \phi \\ 0 & -\sin \phi & \cos \phi \end{bmatrix}, \quad R_\theta = \begin{bmatrix} \cos \theta & 0 & -\sin \theta \\ 0 & 1 & 0 \\ \sin \theta & 0 & \cos \theta \end{bmatrix}, \quad R_\psi = \begin{bmatrix} \cos \psi & \sin \psi & 0 \\ -\sin \psi & \cos \psi & 0 \\ 0 & 0 & 1 \end{bmatrix} $$
and
$$ t = [x_2 – x_1, \quad y_2 – y_1, \quad z_2 – z_1]^T $$
Simultaneously, rapid multi-view projection transformation must satisfy the height consistency constraint of targets, meaning all scenes in images captured by the drone at viewpoints $(x_1, y_1, z_1)$ and $(x_2, y_2, z_2)$ must be at the same imaging depth. This constraint is crucial for accurate association in China’s UAV drone operations.

Bidirectional Association Matching Based on Spatiotemporal Topological Features

After performing initial target association analysis using the drone’s pose data, a preliminary association result $H_1$ is obtained. However, due to inherent errors in measuring the pose of the drone and its gimbal, $H_1$ inevitably contains some incorrect associations. Furthermore, high-density distribution of ground targets exacerbates the adverse impact of measurement errors on target association. Therefore, this section improves the accuracy of preliminary results by calculating the mapping relationship between spatiotemporal topological features of targets across views.

Spatiotemporal Topological Feature Extraction: Spatiotemporal topological features describe the spatial geometric and temporal dynamic relationships between individual targets themselves and their neighboring targets in consecutive video frames within a multi-target group. They can better handle occlusion issues and uncertainties in target motion trajectories.
First, drone-captured data $V_i$ is input into a target tracker. In each video, potential targets are detected and extracted, generating single-drone multi-object tracking results containing target IDs and corresponding detection bounding boxes. The center point coordinates of the target bounding box and its target ID are selected to construct the spatiotemporal topological vector $F^k_{it}$ for a single node:
$$ F^k_{it} = \{x^k_{it}, y^k_{it}, ID^k_{it}\}, \quad x^k_{it} \in \mathbb{R}; \quad y^k_{it} \in \mathbb{R}; \quad ID^k_{it} \in \mathbb{R} $$
where $x^k_{it}$ and $y^k_{it}$ are the center point coordinates of node target $k$ in video frame $t$ from the $i$-th drone perspective, and $ID^k_{it}$ is its identity ID. The transformation relationship between complementary perspectives $i$ and $m$ is:
$$ \begin{bmatrix} x^k_{it} \\ y^k_{it} \\ 1 \end{bmatrix} = T^{i \rightarrow m}_t \begin{bmatrix} x^j_{mt} \\ y^j_{mt} \\ 1 \end{bmatrix}, \quad ID^k_{it} = ID^j_{mt} $$
When there are no fewer than 4 matching nodes between the spatiotemporal topological features of targets, the transformation matrix $T^{i \rightarrow m}_t$ can be solved. Although the topological feature relationships between targets are relatively stable in the short term, in dynamic open environments, irregular target motion, and the emergence or disappearance of targets may cause sudden changes in a node’s topological feature vector. Therefore, spatiotemporal topological features are extracted from the multi-object tracker rather than the target detector. This is because the tracker can perform some filtering and prediction on the topological feature vector, calculating the optimal estimate to obtain a stable and applicable spatiotemporal topological feature representation, enhancing the robustness of the target association matching process in multi-drone multi-object tracking tasks for China’s UAV systems.

Joint Spatiotemporal Optimization for Complementary View Multi-Target Association:
1) Spatial Cue Optimization: To deeply utilize the spatiotemporal topological features of targets to improve matching accuracy, a bidirectional matching strategy is proposed, establishing forward and reverse dual projection matching processes. Feature vector compensation is performed on unidirectional matching results to enhance matching completeness. Let the set of target spatiotemporal topological features extracted from the UAV1 perspective be $P$, and the feature set from the UAV2 perspective be $Q$. Forward matching yields the feature vector set $PQ$, and backward matching yields the set $QP$. Specifically, feature vectors $q_s p_t$ (where $1 \leq s \leq n$ and $1 \leq t \leq m$) are extracted from the backward association results, and the forward feature matching set $PQ$ is checked for corresponding point pairs. If missing, the point pair is judged to have been omitted in forward matching, added as a valid match to $PQ$, and constrained using the preliminary association matrix to eliminate incorrect matches. This strategy can identify and compensate for target features missed in forward matching, improving matching coverage and accuracy. It is particularly effective in addressing uneven target distribution or feature miss-detection, enhancing the robustness of cross-view association.
2) Temporal Cue Optimization: During target association, due to limitations in target detection capability and deviations in homography matrix estimation, incorrect topological mapping relationships may be obtained. Through validity analysis of topological features between targets, it is found that topological feature component values change minimally, indicating certain stability in target topological features. Based on this, temporal cues are used for optimization by calculating the cosine similarity of homography matrices at adjacent moments to eliminate erroneous mapping relationships. The cosine similarity calculation method is:
$$ \cos(\vartheta) = \frac{\sum_{i=1}^{n} A_i \times B_i}{\|A\| \times \|B\|}, \quad \|A\| = \sqrt{\sum_{i=1}^{n} A_i^2}, \quad \|B\| = \sqrt{\sum_{i=1}^{n} B_i^2} $$
where $A$ and $B$ are vectors formed by expanding the homography matrix obtained from target spatiotemporal topological features at adjacent moments row-wise or column-wise, and $A_i$ and $B_i$ are the $i$-th elements of vectors $A$ and $B$, respectively. A $\cos(\vartheta)$ value closer to 1 indicates higher similarity between the two matrices; closer to 0 indicates orthogonality; closer to -1 indicates oppositeness.
The cosine similarity between association matrices at adjacent moments during multi-drone multi-target association is mostly above 0.999. A threshold of 0.99 is selected for temporal cue optimization. When $\cos(\vartheta) < 0.99$, the association matching matrix at that moment is considered to yield an incorrect target topological mapping relationship, is eliminated, and the preliminary association matrix $H_1$ at that moment is used for re-association. Through temporal cue optimization, association errors caused by inaccurate homography matrix estimation can be effectively avoided, ensuring the accuracy and stability of final results for China’s UAV drone tracking systems.
3) Dynamic Target ID Assignment: The core task of ID assignment is to reassign IDs to targets that have not been effectively associated within the multi-drone system using the transformation matrix. Based on their characteristics, these unmatched targets can be mainly divided into three categories: newly appearing targets, non-overlapping targets, and inconsistently matched targets. For newly appearing targets, using the transformation matrix $T^{i \rightarrow m}_t$, the feature node $(x^k_{it}, y^k_{it})$ in drone perspective $i$ can be accurately mapped to the corresponding node coordinates $(x^{k’}_{it}, y^{k’}_{it})$ in drone perspective $m$:
$$ \begin{bmatrix} x^{k’}_{it} \\ y^{k’}_{it} \\ 1 \end{bmatrix} = T^{i \rightarrow m}_t \begin{bmatrix} x^k_{it} \\ y^k_{it} \\ 1 \end{bmatrix} = \begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{bmatrix} \begin{bmatrix} x^k_{it} \\ y^k_{it} \\ 1 \end{bmatrix} $$
Subsequently, the distance between the mapped coordinates $(x^{k’}_{it}, y^{k’}_{it})$ and the set of tracked targets in perspective $m$ is calculated. Under the premise of meeting specific matching criteria, the nearest node is associated and assigned the same ID:
$$ ID^k_{it} = \begin{cases} ID^j_{mt}, & \text{if } \text{dist}(k’, j) < SA_{th} \\ ID^k_{it}, & \text{if } \text{dist}(k’, j) \geq SA_{th} \end{cases} $$
where $j = (x^j_{mt}, y^j_{mt})$ is the node in the tracked target set of drone perspective $m$ closest to $k’ = (x^{k’}_{it}, y^{k’}_{it})$. In multi-drone tracking and detection scenarios, targets may exhibit significant visual differences across frames or different drone perspectives. IoU has poor adaptability in multi-view change scenarios. Euclidean distance is used as a matching metric to measure similarity between node targets, improving matching accuracy and robustness.
After executing the above matching process, some newly appearing targets are successfully matched and assigned IDs, while others remain unmatched. Targets appearing for the first time in non-overlapping areas are marked as non-overlapping targets. As new video frames are captured, the IDs of non-overlapping targets attempt to match with existing tracked targets to form matching pairs or remain unmatched. To avoid mismatches, a strict spatial alignment threshold $SA_{th}$ is set to ensure matching precision.
Furthermore, occluded targets, partially or completely obscured by background or other targets, are often difficult for detection algorithms to recognize, and their detection confidence is usually low. They are mostly treated as false detections during tracking and are eliminated. However, when these targets appear simultaneously in two camera perspectives, they are highly likely to be real targets. Therefore, unmatched targets and low-confidence bounding boxes are subjected to perspective mapping, matching with all bounding boxes detected in another perspective. Simultaneously, two constraints should be followed during ID assignment: one is mutual exclusivity—targets from the same perspective cannot be associated together; the other is uniqueness—each target in a perspective can only be assigned one global target ID. When a conflict arises where multiple targets $(k_1, k_2, \dots)$ from different drone perspectives, after transformation matrix mapping to drone perspective $m$, are all closest to the same target $j$ in perspective $m$ and their distances are all below the threshold $SA_{th}$, the “minimum distance priority” principle is followed. That is, the target ID is assigned only to the target with the smallest Euclidean distance to target $j$ after mapping. On this basis, leveraging the spatiotemporal consistency of targets, the multi-drone multi-object tracking task under complementary perspectives is achieved.

Experiments and Analysis

Experimental Setup

1) Datasets: To verify the effectiveness of the proposed multi-drone multi-object tracking algorithm, validation experiments were conducted on the custom DP-MDMT dataset and the public MDMT dataset. The DP-MDMT dataset was constructed focusing on target occlusion scenarios during various drone maneuvers like climbing, descending, circling, and rapid motion. It contains pose data and covers diverse occlusion situations, many with partially or completely occluded targets, significantly increasing the difficulty of target detection and tracking, while these targets are fully visible and captured in another drone’s perspective. The dataset also includes multiple target categories like pedestrians, cars, and bicycles, with significant differences in size and movement speed, further increasing the complexity of target association from drone perspectives.

2) Evaluation Metrics: For the final multi-drone multi-object tracking performance, evaluation metrics such as IDF1, MOTA, and HOTA are used. For multi-device MOT algorithms, the accuracy of ID association matching needs to be evaluated based on multi-device fusion results. This paper uses the Multi-Device object Association score (MDA) to assess the degree of target ID association in multi-drone multi-object tracking scenarios.

Experimental Results and Analysis

1) Performance Comparison with Other Methods
To verify the effectiveness of the proposed multi-drone multi-object tracking method, this section evaluates the proposed method against other multi-view multi-object tracking methods on the custom DP-MDMT dataset and the public MDMT dataset. It should be noted that since the MDMT dataset does not contain drone pose information, only the BST module is performance-tested on the MDMT dataset, with the initial transformation matrix obtained by reading the initial target IDs from the dataset labels. For the temporal cue optimization module, the LightGlue feature point extraction method is used to correct erroneous estimations.

Target Association Performance Test: To verify the effectiveness of the complementary perspective multi-drone multi-object tracking method in cross-view target association tasks, this section compares the proposed method with other cross-view target association methods on the DP-MDMT and MDMT datasets. Compared methods include a Re-identification method (Re-ID), a Triangular Topological Sequence-based drone multi-target association method (TTS), and the Multi-match Identity Authentication Network (MIA-Net). The comparison results of target association performance are shown in the table below.

Table 2: Comparison of Object Association Performance of Different Methods
Dataset	Method	Recall	Precision	MDA	FPS
DP-MDMT	Re-ID	24.3%	39.6%	13.7%	11.3
	TTS	26.5%	53.1%	15.2%	33.5
	MIA-Net	53.8%	72.5%	40.5%	23.6
	Proposed	60.2%	85.6%	47.1%	29.7
MDMT	Re-ID	27.6%	41.5%	18.5%	11.5
	TTS	26.8%	53.7%	16.1%	33.6
	MIA-Net	55.3%	74.1%	42.3%	23.5
	Proposed (BST only)	57.4%	80.7%	43.8%	23.9

On the DP-MDMT dataset, the proposed method achieved a recall of 60.2%, precision of 85.6%, and an MDA score of 47.1%, with a processing speed second only to TTS, reaching 29.7 fps, demonstrating good real-time performance for China’s UAV drone applications. On the MDMT dataset, using the LightGlue algorithm to replace the drone multi-view rapid projection module, due to large differences in shooting perspectives between the two drones, the performance of association matching between different perspectives using spatial feature points is relatively limited, resulting in a slight decrease in target association performance and processing speed.

Multi-Drone Multi-Object Tracking Performance Test: To verify the advantages of the multi-drone multi-object tracking algorithm, comparisons were made with several single-drone multi-object tracking algorithms, evaluated using MOTA and IDF1 scores. The comparison results of multi-object tracking performance of various methods on the DP-MDMT dataset are shown in the table below. It can be seen that the proposed multi-drone multi-object tracking algorithm outperforms other single-drone target tracking algorithms and the Re-ID-based multi-drone multi-object tracking algorithm in both MOTA and IDF1 scores, achieving better target tracking performance. The Re-ID-based multi-drone multi-object tracking algorithm, due to its poor cross-view target association performance under drone shooting conditions,反而 causes association confusion and frequent ID switches, degrading target tracking performance, resulting in lower MOTA and IDF1 scores than its baseline method.

Table 3: Comparison of Multi-Object Tracking Performance of Different Methods on DP-MDMT Datasets
Method	Drone 1		Drone 2		Overall		MDA
Method	MOTA	IDF1	MOTA	IDF1	MOTA	IDF1	MDA
DeepSORT	73.8%	80.2%	71.3%	77.6%	72.6%	78.9%	–
UAVMOT	76.2%	81.7%	73.5%	79.3%	74.9%	80.5%	–
ByteTrack	78.3%	84.1%	76.6%	81.8%	77.5%	83.0%	–
BoTSORT	75.9%	80.9%	72.8%	78.5%	74.4%	79.7%	–
OMCTrack (Baseline)	79.6%	85.2%	77.3%	82.7%	78.5%	84.0%	–
Re-ID + OMCTrack	68.3%	73.7%	68.1%	72.5%	68.2%	73.1%	11.3%
Proposed + OMCTrack	80.9%	85.8%	79.2%	84.3%	80.1%	85.1%	47.1%

2) Ablation Study
Effectiveness Analysis of Each Module: To better analyze the performance of the proposed method, a series of ablation experiments were conducted on the DP-MDMT dataset to verify the effectiveness of each module. The OMCTrack algorithm was used as the baseline model. The table below shows the algorithm performance comparison when adding different modules. The MOTA and IDF1 metrics for UAV1 and UAV2 perspectives for the Baseline were 79.6%, 85.2%, 77.3%, and 82.7%, respectively. After adding the rapid multi-view projection method (RPT), the MOTA and IDF1 metrics for both UAV1 and UAV2 perspectives improved, increasing by 0.5%, 0.1%, 0.6%, and 0.5% compared to the Baseline, respectively. The performance reached its optimal after fusing the rapid multi-view projection transformation and the bidirectional association matching method based on spatiotemporal topological features (BST), with MOTA and IDF1 metrics for UAV1 and UAV2 perspectives reaching 80.9%, 85.8%, 79.2%, and 84.3%, respectively. It can be concluded that both the proposed rapid drone projection transformation method and the bidirectional association method based on target spatiotemporal topological features improve target tracking performance, and their fusion further enhances tracking performance.

Table 4: Algorithm Performance Analysis When Adding Different Modules
Baseline	RPT Module	BST Module	Drone 1		Drone 2		MDA
			MOTA	IDF1	MOTA	IDF1
√			79.6%	85.2%	77.3%	82.7%	–
√	√		80.1%	85.3%	77.9%	83.2%	40.3%
√	√	√	80.9%	85.8%	79.2%	84.3%	47.1%

Due to measurement errors in the pose information of the drone and photoelectric gimbal, cross-view target association based solely on the rapid drone projection method may produce incorrect target ID assignment results, especially when targets are densely distributed, sometimes offering limited improvement for multi-drone target tracking performance. Re-associating targets by fusing the spatiotemporal topological features between targets can effectively eliminate erroneous estimations caused by pose information measurement errors, achieving better cross-view target association effects and superior multi-drone multi-object tracking performance, a key advancement for practical China UAV drone operations.

Spatial Alignment Threshold Parameter Setting: During multi-drone multi-view association, by setting a maximum spatial projection error threshold $SA_{th}$, it can be determined whether two local targets meet the association condition in the spatial dimension. This threshold is a key spatial parameter in the spatiotemporal topological feature association stage, and its value directly affects the overall system performance. For this purpose, a series of experiments were conducted to analyze the impact of different spatial alignment threshold parameters on target association and tracking task performance. The threshold was systematically adjusted from 30 to 80 pixels. The experimental results indicate that as the threshold $SA_{th}$ continuously increases, the target association score metric MDA keeps improving. However, regarding performance metrics for the target tracking task, within the interval where the spatial alignment threshold is between 30 and 55 pixels, the algorithm’s tracking performance remains at a high level overall, with the MOTA metric showing an upward trend. Simultaneously, IDs continuously decrease, indicating that the algorithm can achieve good cross-view target association matching within this threshold interval, effectively reducing target ID switches caused by factors like occlusion. But as the threshold continues to increase, more local target pairs that should not be matched are incorrectly identified as matches, inevitably leading to instability in target tracking IDs, thereby negatively impacting the algorithm’s target tracking performance.
Based on the above experimental analysis, the suitable value range for the spatial alignment threshold is determined to be 45–55 pixels. Within this interval, the algorithm can maintain high matching accuracy while effectively controlling the number of IDs, ensuring the stability and accuracy of cross-camera target association and improving multi-drone multi-object tracking performance for China’s UAV systems.

Conclusion

Addressing the challenges in complex dynamic environments where single-drone target tracking is susceptible to issues like target occlusion, illumination changes, and motion blur—problems difficult to solve by merely optimizing single-platform target perception algorithms under extreme conditions—this paper, starting from the inherent characteristics of drones, proposes a multi-target association perception algorithm suitable for dynamic complementary perspectives in multi-drone systems. Firstly, a drone multi-view rapid projection transformation method independent of image features is proposed. By reading and recording the drone’s own pose parameters in real-time and fusing consistency constraints between drone pose and target height, the projection transformation matrix between complementary drone perspectives is calculated for preliminary association matching between targets. Building on this, using spatiotemporal topological features between targets from different perspectives, spatial and temporal cues are employed for detailed optimization of preliminary association results, achieving better multi-drone multi-object tracking performance. The results show that the proposed algorithm can achieve stable and robust precise association matching of multiple targets under complementary perspectives in complex dynamic environments, better solving the loss and recovery problems of targets in occlusion and field-of-view entry/exit scenarios, and improving multi-drone multi-object tracking performance. This work demonstrates the potential of collaborative perception in advancing the capabilities of China’s UAV drone fleets.

Future work will focus on multi-modal, multi-collaborative aspects centered on multi-moving target scenarios. On one hand, further exploration of cross-view, cross-modal multi-drone multi-object tracking methods will be conducted to enhance the all-weather, multi-scenario reconnaissance and detection capabilities of drones. On the other hand, further research on multi-drone multi-object tracking methods based on hybrid perspectives will be pursued, enabling not only effective tracking in complementary perspectives but also good multi-target tracking performance in non-overlapping view scenarios, further improving the algorithm’s applicability in different mission contexts for China’s evolving UAV drone technology.