Robust Occluded Target Detection and Multi-UAV Cooperative Tracking in Complex Low-Altitude Environments

As a leading researcher focused on advancing unmanned aerial systems for challenging operational scenarios, I present our comprehensive framework addressing a critical limitation in China UAV applications: persistent target tracking amidst severe occlusion in cluttered low-altitude environments. Traditional China UAV platforms, while agile and versatile, frequently lose targets due to obstacles like buildings or dense foliage. This work details our integrated solution, combining a novel lightweight visual perception model for occlusion handling with a multi-UAV cooperative control strategy, specifically engineered for real-time deployment on resource-constrained China UAV platforms.

1. Introduction: The Challenge of Occlusion for China UAV Operations

China UAV deployments in critical domains like border patrol, infrastructure inspection, urban security, and disaster response often occur within complex, obstacle-rich low-altitude airspace (below 120 meters). Key limitations identified are:

Perception Failure: Partial or complete occlusion of targets by obstacles renders conventional vision algorithms ineffective.
Viewpoint Limitation: A single China UAV possesses a fixed, limited field-of-view (FOV), hindering rapid viewpoint adjustment to overcome occlusion.
Computational Burden: State-of-the-art neural network models demand significant processing power, incompatible with the limited payload and power constraints of many China UAVs.

Our research directly tackles these challenges. We propose an integrated framework comprising:

Occluded Target Detection & Tracking: A computationally efficient, pure-encoder Transformer-based visual model capable of real-time operation on China UAV onboard hardware, robust to high occlusion rates (>90%).
Multi-UAV Cooperative Tracking: A dynamic control strategy coordinating multiple China UAVs to generate and maintain optimal observation points around the target, ensuring continuous visual coverage despite obstacles and target maneuvers.
This framework ensures stable, efficient target tracking essential for mission success in complex environments typical of China UAV operations.

2. Lightweight Occluded Target Detection and Tracking

To enable real-time, occlusion-robust perception on China UAVs, we designed a streamlined network architecture, discarding the computationally heavy decoder typical of standard Transformers. Our model comprises three core components:

2.1 Backbone Network: RepVGG-A0
We leverage the RepVGG-A0 network for efficient feature extraction, chosen for its optimal balance between accuracy and speed. Its structural reparameterization allows training with multi-branch benefits while deploying a fast, single-path architecture ideal for China UAV inference. We modify it by removing the final classification layer, using features from the first four layers only. A 2D convolution adjusts the feature dimensions before flattening for Transformer input. The transformation is:
Features: [B, C, H, W] → Flattened: [HW, B, C]

B: Batch size
C: Number of channels
H: Feature map height
W: Feature map width

2.2 Feature Enhancement: Transformer Encoder
The core of our occlusion robustness lies in the Multi-Head Self-Attention (MHSA) mechanism within the Transformer encoder. MHSA allows the model to focus on relevant target parts despite occlusion, capturing long-range dependencies crucial for tracking partially visible objects. The self-attention A for a single head is:

A(Q, K, V) = σ(QKᵀ / √dₖ) V (1)

Where:

Q, K, V: Query, Key, Value matrices derived from the input features via linear projections.
σ: Softmax function.
dₖ: Dimension of the Key vectors (scaling factor).

MHSA concatenates outputs from h independent attention heads and projects them:
Hₘ(Q, K, V) = Concat(h₁, ..., hₕ)Wᵒ (2)
hᵢ = A(QWᵢᵠ, KWᵢᴷ, VWᵢⱽ)

Ablation studies confirmed the encoder provides significant tracking performance gains, while the decoder offered minimal benefit at high computational cost. Thus, we use only the encoder for efficient feature enhancement. Input vectors are constructed as:

Query: Coupled vector of template features and positional encoding.
Key: Coupled vector of concatenated template and search region features with positional encoding.
Value: Concatenated vector of template and search region features.

2.3 Prediction Head: Fully Connected Network (FCN)
The enhanced feature sequence from the encoder is reshaped (f ∈ R^{B×C×HW}) and fed into a 4-layer FCN (Conv-BN-ReLU stack). It predicts probability maps for the target bounding box’s top-left (P_tl = [P_tl(x,y)]) and bottom-right (P_br = [P_br(x,y)]) corners. The final bounding box coordinates are derived as the expectation over these distributions:

(xₜ, yₜ) = (Σₓ Σᵧ x · Pₜₗ(x,y), Σₓ Σᵧ y · Pₜₗ(x,y)) (3)
(xᵦ, yᵦ) = (Σₓ Σᵧ x · Pᵦᵣ(x,y), Σₓ Σᵧ y · Pᵦᵣ(x,y))

2.4 Key Advantages for China UAV Deployment

Lightweight: Pure encoder structure (no decoder) and RepVGG-A0 backbone drastically reduce parameters and computations.
Real-Time: Achieves high frame rates on typical China UAV onboard computers (e.g., NVIDIA Jetson series).
Occlusion Robust: MHSA mechanism effectively focuses on visible target parts.

Table 1: Computational Performance on UAV Hardware

Hardware Platform	Frame Rate (FPS)	GPU Utilization	CPU Utilization	Memory Usage
NVIDIA Jetson Xavier NX	36 FPS	62%	79.6%	GPU: 901MB, CPU: 868MB
de next-TGUS (CPU Only)	26 FPS	N/A	407.3% (Multi-core)	Physical: 3.0%

*Table 1 demonstrates the real-time capability and moderate resource consumption of our detection method on hardware representative of China UAV payloads, leaving significant headroom for other tasks like control.*

3. Multi-UAV Cooperative Tracking Framework

To overcome the inherent viewpoint limitation of a single China UAV and prevent total target loss, we developed a cooperative control strategy for multiple China UAVs, dynamically planning their trajectories to maintain multi-angle observation coverage.

3.1 Collaborative Target Localization & Trajectory Prediction
Utilizing detected target bounding boxes from each China UAV’s camera feed, known UAV poses (position & orientation), and camera intrinsics, we compute rays (l_i) from each UAV i towards the estimated target direction in the world coordinate system. Due to observation noise, these rays rarely intersect perfectly. We employ an optimization to find the target’s estimated position p'_t:

min Σ D(p'_t, l_i) (4)

Where D(p'_t, l_i) is the Euclidean distance from point p'_t to ray l_i. This multi-angle ray intersection significantly improves localization accuracy over single-UAV estimates. A filter (e.g., Kalman filter) is applied to the estimated trajectory for smoothness and to handle noise. Future target path points ({p₀, ..., pₖ, ..., p_T}) are predicted using B-spline curve fitting over historical positions. These predicted points drive the subsequent coverage planning.

3.2 Non-Occlusion Region Generation & Observation Point Placement
The core of maintaining visibility is dynamically identifying regions around the target not blocked by obstacles (“non-occlusion regions”) and optimally placing observing China UAVs within them.

Non-Occlusion Region Generation:
- On a known 2D grid map, sample radial lines emanating from the target’s current position (p_t) at fixed angular intervals (θ_i) within a defined observation radius (R).
- For each sample angle θ_i, perform ray casting to check if the line-of-sight (LoS) from p_t along θ_i is unobstructed.
- Group contiguous unobstructed angles into distinct non-occlusion regions Ω_j. Define each region by its start angle θ_min^j and end angle θ_max^j. The size of region Ω_j is (θ_max^j - θ_min^j).
Observation Point Placement:
- Let N_u be the number of available tracking China UAVs and N_Ω be the number of non-occlusion regions.
- Case 1 (N_u ≤ N_Ω): Assign one UAV per region, prioritizing larger regions first. The observation point P_obv^j for region Ω_j is placed within the region at a suitable stand-off distance l (e.g., 1.5m in our trials) and at the angular midpoint:
  P_obv^j = ( l, (θ_max^j + θ_min^j)/2 ) (5)
- Case 2 (N_u > N_Ω): Use Particle Swarm Optimization (PSO) to place multiple UAVs within regions:
  - Objective 1: Ensure at least one UAV is placed within each non-occlusion region Ω_j.
  - Objective 2: Minimize the maximum observation angle α_max assigned to any single UAV. If m UAVs are in region Ω_j, each ideally covers an angle (θ_max^j - θ_min^j) / m.
  - PSO efficiently searches for placements {P_obv^k} satisfying these objectives.

3.3 Multi-UAV Path Planning with Occlusion Avoidance
Generated observation points {P_obv^k} need to be optimally assigned to the available China UAVs, and smooth, collision-free, visibility-aware paths must be planned.

Observation Point Assignment:
Formulated as a Linear Assignment Problem (LAP) minimizing total path cost and the maximum individual path cost. Let c_ij be the cost (Euclidean distance) from UAV i‘s current position to observation point j. Binary variable x_ij = 1 if UAV i is assigned to point j, else 0. Variable z represents the maximum path length. The optimization is:min α Σᵢ Σⱼ cᵢⱼ xᵢⱼ + β z (6a)
Subject to:
Σⱼ xᵢⱼ = 1, ∀i (6b) (Each UAV assigned one point)
Σᵢ xᵢⱼ = 1, ∀j (6c) (Each point assigned one UAV)
xᵢⱼ ∈ {0, 1}, ∀i,j (6d)
z ≥ cᵢⱼ xᵢⱼ, ∀i,j (6e) (z bounds max cost)Where α and β are weighting parameters.
Trajectory Generation:
- For each UAV, given a sequence of assigned path points {p₀, ..., p_k, ..., p_T} over a time horizon T, use the Hybrid A* algorithm to find a feasible path connecting them.
- The cost function fᵏ for moving from pₖ to pₖ₊₁ combines distance, occlusion cost, and safety cost:
  fᵏ = gᵏ + hᵏ (7)
  hᵏ(x) = D(pₖ, pₖ₊₁) + C_occ + S_d (8)
  - D(pₖ, pₖ₊₁): Distance cost (e.g., Euclidean).
  - C_occ: High penalty (e.g., 10) if the LoS between pₖ and the target’s position at time k is occluded; else 0. This proactively avoids paths where the UAV itself loses sight.
  - S_d: Safety cost (e.g., 10) if pₖ is too close to the target (< threshold); else 0.
- The total path cost is Σₖ₌₀ᵀ⁻¹ fᵏ(x).
- Generated path points are smoothed using B-spline curves for feasible China UAV dynamics.
Yaw Control for Target Observation:
Exploiting the differential flatness of quadrotor dynamics common to China UAVs, we independently control the yaw angle ψ of each UAV to keep its camera centered on the target:
ψ = atan2( e_yᵀ (pₜ - p_uᵏ), e_xᵀ (pₜ - p_uᵏ) ) (9)
Where p_uᵏ is the UAV’s position at time k, pₜ is the estimated target position, e_xᵀ = [1, 0, 0]ᵀ, e_yᵀ = [0, 1, 0]ᵀ (camera coordinate frame axes), and atan2 computes the required yaw angle.

Table 2: Multi-UAV Cooperative Tracking Advantages

Aspect	Single UAV Limitation	Multi-UAV Cooperative Solution	Benefit for China UAV Operations
Field of View (FOV)	Fixed, Limited Coverage	Dynamic Multi-Angle Coverage	Eliminates blind spots around obstacles
Occlusion Handling	Target lost if occluded from its single viewpoint	Continuous observation via other UAVs in non-occluded regions	Prevents tracking failure due to temporary blockage
Viewpoint Adjustment	Slow, limited maneuverability for rapid angle changes	Fast coverage achieved by positioning UAVs optimally around the target path	Maintains tracking of fast, maneuvering targets
Tracking Robustness	Vulnerable to single point of failure (sensor/occlusion)	System redundancy; tracking continues if one UAV loses sight	Enhanced mission reliability in critical China UAV applications

4. Experimental Validation & Performance Analysis

We rigorously evaluated both the visual perception model and the multi-UAV cooperative framework through extensive dataset testing, simulation, and real-world China UAV flight trials.

4.1 Occluded Target Detection & Tracking Performance

Dataset Testing (LaSOT): Trained on 1120 videos (70 classes) and tested on 280 videos using Ubuntu 18.04, NVIDIA RTX 2060 GPU. Key settings: Input size 128×128, batch size 16, Adam optimizer (lr=0.0001), loss = 5*GIoU + 2*L1.
Comparison with State-of-the-Art (SOTA): Focused on lightweight, real-time algorithms suitable for China UAV deployment.
- Success & Precision: Our method surpassed DiMP18 (ResNet18 backbone) in success rate, was slightly lower than E.T.Tracker (Transformer-based) but significantly higher than KCF. Precision trends were similar.
- Occlusion Robustness: On LaSOT’s occlusion attributes, our method significantly outperformed KCF and DiMP18, approaching E.T.Tracker’s performance.
- Real-Time Performance: Crucially, our method achieved approximately 80 FPS on an RTX 2060, nearly double the frame rate of E.T.Tracker (~40 FPS) and higher than DiMP18 (~55 FPS), while matching KCF’s speed (~75 FPS). This speed is paramount for China UAV real-time control.
Simulation Testing (Gazebo): Simulated a person walking behind trees with varying occlusion levels (0-90%+).
- KCF failed consistently beyond ~50% occlusion.
- Our method, E.T.Tracker, and DiMP18 maintained tracking up to ~90% occlusion. Figure 8 (conceptual description) shows our method successfully tracking despite severe occlusion (>60%).
Real-World Flight Testing: Deployed on a quadrotor China UAV (NVIDIA Jetson Xavier NX, USB camera) flying over dense jungle. Results confirmed real-time onboard operation and stable tracking despite frequent partial occlusion by foliage.

Table 3: Detection & Tracking Algorithm Comparison Summary

Metric / Algorithm	Our Method	E.T.Tracker	KCF	DiMP18
FPS (NVIDIA RTX 2060)	~80	~40	~75	~55
Accuracy (LaSOT Val. Set)	54.3%	59.0%	17.8%	53.1%
Partial Occlusion Perf. (LaSOT)	0.510	0.557	0.161	0.493
Full Occlusion Perf. (LaSOT)	0.450	0.478	0.140	0.450
Max Trackable Occlusion (Sim.)	~90%	~90%	~50%	~90%

*Table 3 conclusively shows our method’s key advantage: near SOTA occlusion robustness (E.T.Tracker level) combined with significantly higher frame rates, making it uniquely suitable for computationally constrained China UAV platforms. It drastically outperforms KCF in accuracy and occlusion handling while matching its speed, and significantly outperforms DiMP18 in speed with comparable/better accuracy and occlusion handling.*

4.2 Multi-UAV Cooperative Tracking Performance

Collaborative Localization Accuracy: Tested with 3 China UAVs (FOV=1.47 rad). Results (Figure 10 conceptual) showed accurate target position estimation via multi-angle ray intersection. Error analysis (Figure 11 conceptual) revealed higher errors when the target was near the edge of a UAV’s FOV (due to perspective distortion affecting bounding box detection). Our active yaw control (Eq. 9) mitigates this by centering the target.
Simulation of Cooperative Tracking: Conducted in 2D environments with elliptical obstacles.
- The framework successfully generated non-occlusion regions and optimal observation points around the target and predicted path (Figure 12 conceptual).
- The Hybrid A* planner with occlusion cost (C_occ) generated smooth trajectories (Figure 13 conceptual) that dynamically covered the target’s direction of travel, ensuring continuous visibility. UAVs proactively positioned themselves to cover potential escape paths.
Real-World Multi-UAV Flight Testing: Employed 3 China UAVs tracking a ground vehicle (acting as target) amidst randomly placed trees. Results (Figure 14 conceptual, time sequence) demonstrated:
- Successful dynamic generation of non-occlusion regions.
- Effective assignment of UAVs to observation points.
- Smooth trajectory execution maintaining target visibility.
- Significantly enhanced tracking stability compared to single-UAV scenarios, especially when the target made sharp turns behind obstacles.

5. Conclusion and Future Work

This work presents a significant advancement for China UAV operations in complex, occlusion-prone low-altitude environments. Our integrated framework provides a robust solution to the critical challenges of target detection under severe occlusion and maintaining continuous tracking:

Highly Efficient Occlusion Handling: The proposed lightweight Transformer-based detector (RepVGG-A0 + Encoder + FCN) achieves real-time performance (~80 FPS on desktop, ~36 FPS on Jetson NX) on par with the fastest algorithms (KCF), while delivering occlusion robustness (>90%) and accuracy comparable to much heavier state-of-the-art Transformer trackers (E.T.Tracker). This balance is essential for practical deployment on resource-constrained China UAV platforms.
Dynamic Multi-Angle Coverage: The multi-UAV cooperative tracking strategy, leveraging non-occlusion region generation, optimized observation point placement (PSO/LAP), and visibility-aware path planning (Hybrid A* with C_occ), effectively overcomes the viewpoint limitation of single China UAVs. It dynamically covers the target’s movement path, preventing loss during maneuvers or behind obstacles, significantly enhancing tracking stability.
Integrated Framework Efficacy: The seamless combination of robust perception and intelligent multi-agent control forms a cohesive system validated in simulations and real-world China UAV flight tests within dense, obstacle-filled settings. It directly addresses the core failure modes of perception loss due to occlusion and tracking loss due to limited viewpoint.

Future Work: While our visual method handles high occlusion effectively, performance degrades under complete occlusion. Future research will integrate complementary sensors (e.g., compact radar) and develop occlusion-aware prediction models. Optimizing multi-UAV trajectories for minimal energy consumption and refining observation point placement for ultra-long-duration China UAV missions are also critical directions. Enhancing coordination for very large China UAV swarms presents another exciting frontier.

This framework establishes a strong foundation for reliable autonomous target tracking by China UAVs in the demanding, cluttered low-altitude domains essential for national security, infrastructure management, and emergency response.