ZY-DETR: A Lightweight Transformer-Based Detector for Small Objects in Unmanned Aerial Vehicle Imagery

The rapid advancement of unmanned aerial vehicle (UAV) technology has revolutionized remote sensing, offering unparalleled capabilities for wide-area surveillance, dynamic scene capture, and high-resolution data acquisition. This has cemented its role as a cornerstone for critical applications such as traffic monitoring, aerial search and rescue, smart city management, and precision agriculture. Central to these applications is the task of robust and efficient object detection from drone-captured imagery. However, this task presents formidable challenges distinct from conventional ground-level vision: targets are often extremely small, occupying minuscule portions of the high-resolution frame; scenes exhibit extreme scale variations, with objects ranging from distant vehicles to large infrastructure within the same image; and backgrounds are frequently cluttered with complex terrain, shadows, and occlusions. These characteristics render standard object detectors, designed for natural images, largely ineffective, struggling with poor accuracy for small objects and high computational costs unsuitable for edge deployment on unmanned drone platforms.

Over the past three years, research has converged on three primary avenues to tackle unmanned drone detection: lightweight Convolutional Neural Network (CNN) adaptations, Transformer-based detector optimizations, and innovations in feature fusion networks. CNN-based approaches, epitomized by adaptations of the YOLO series like Drone-YOLO and TPH-YOLOv5, focus on adding specialized detection layers for small scales and incorporating attention mechanisms. While they achieve high inference speeds, their homogeneous backbone design offers weak synergy between shallow and deep features, and their reliance on Non-Maximum Suppression (NMS) often leads to missed detections for densely packed small objects. The emergence of Detection Transformer (DETR) models promised an end-to-end, NMS-free paradigm. Among them, Real-Time DETR (RT-DETR) broke the traditional DETR’s slow convergence barrier, achieving a compelling speed of 76.28 FPS and establishing a strong baseline for unmanned drone sensing. Yet, its multi-scale fusion mechanism is not optimized for the redundancy and scale extremes of aerial imagery, and it underutilizes high-resolution shallow features, resulting in subpar small object accuracy. Feature fusion networks like FPN, PANet, and BiFPN enhance detection by combining features across layers but operate with quadratic $$O(n^2)$$ complexity and rely on “downsample-upsample” logic that irreversibly loses fine-grained details crucial for small objects in unmanned drone views.

In summary, the core deficiencies of current state-of-the-art can be distilled into three intertwined problems: 1) The inherent trade-off between preserving fine-grained details of small objects and maintaining a lightweight model footprint, as homogeneous backbones fail to specialize for both detail retention and semantic refinement; 2) Inefficient multi-scale fusion, where standard attention mechanisms are computationally prohibitive for processing the vast number of tokens in high-resolution unmanned drone images; 3) A lack of scenario-specific innovation in leveraging high-resolution features, where traditional fusion logic is not tailored to the sparse nature of small object features in aerial contexts. Furthermore, existing works often lack a rigorous quantitative analysis of the compute-accuracy trade-off and comparable cross-dataset validation.

To address these pivotal challenges, we propose ZY-DETR (Zoomed-Yield Detection Transformer), an enhanced RT-DETR-based algorithm specifically designed for small object detection in unmanned aerial vehicle remote sensing imagery. Our core contributions are threefold:

We design GradNet, a heterogeneous backbone network with specialized shallow and deep layers. The shallow layers employ a Cross-Stage Partial network (C2f) module to preserve edge and texture details of small objects. The deep layers feature a novel C2f-CGSA module that fuses a Convolutional Gated Linear Unit (CGLU) with Single-Head Self-Attention (SHSA) to achieve lightweight, local-to-global feature refinement.
We introduce a Token Statistical Self-Attention (TSSA) mechanism and construct the IST-Fusion (Inter-Scale Fusion based on Token Statistical Self-Attention) module. Building upon the Adaptive Information Fusion (AIFI) module for channel redundancy compression, we replace the standard Multi-Head Self-Attention (MHSA) with our TSSA. This reduces the complexity of cross-scale feature alignment from $$O(n^2)$$ to linear $$O(n)$$, enabling efficient processing of unmanned drone image features.
We devise a FineGrained-Detect head, a remote-sensing specialized component that directly connects to high-resolution shallow features, bypassing destructive downsampling. Paired with a lightweight fusion unit and scale-specific detection branches, it balances detail preservation and computational overhead explicitly for unmanned drone scenarios.

Under unified experimental conditions, ZY-DETR demonstrates superior performance. On the challenging VisDrone2019 dataset, it achieves an inference speed of 59.31 FPS and elevates the average precision AP@[0.5:0.95] from the RT-DETR baseline’s 20.3% to 23.5%. Notably, precision for small (AP_s), medium (AP_m), and large (AP_l) objects improved by 3.0, 3.6, and 7.7 percentage points, respectively. Cross-domain validation on the DOTA dataset test set shows an AP of 60.0% at 85.61 FPS, with balanced gains across all scales. These results confirm that ZY-DETR effectively mitigates small object miss-detection and inefficient multi-scale fusion in unmanned drone remote sensing.

Methodology

The overall architecture of the proposed ZY-DETR algorithm is illustrated below. It consists of three core components working in synergy: the GradNet heterogeneous backbone, the IST-Fusion linear-complexity cross-scale fusion module, and the FineGrained-Detect high-resolution detection head, followed by an end-to-end decoder. The system takes a 640×640 unmanned drone image as input and outputs target bounding box coordinates and class probabilities. The entire model contains 15.46M parameters and runs at 59.31 FPS, meeting the deployment requirements of resource-constrained unmanned drone edge devices. The协同 logic is as follows: GradNet performs hierarchical heterogeneous feature extraction, providing dedicated feature support for objects at different scales; IST-Fusion achieves linear-complexity cross-scale feature alignment, compressing both channel and spatial redundancy; FineGrained-Detect directly connects to the highest-resolution shallow features to maximize small object detail retention; and the decoder performs set prediction via query selection, eliminating NMS-induced latency.

GradNet: A Heterogeneous Backbone Network

GradNet adopts a four-level hierarchical residual structure (C2 to C5), achieving 32x downsampling through dedicated modules at each level. Its core innovation is a lightweight dual-path feature optimization strategy. The design goal is to balance small object detail preservation with medium/large object semantic refinement under a lightweight constraint, moving away from traditional homogeneous backbones. Shallow layers (C2/C3) focus on retaining fine-grained details like edges and textures, while deep layers (C4/C5) specialize in lightweight local-to-global feature refinement. The initial embedding stage is also optimized to minimize early feature loss.

Initial Embedding Stage: An input 640×640 unmanned drone image is processed by two consecutive 3×3 convolutional layers (with strides of 1 and 2, and no pooling). This quickly downsamples the image to C2 features at 320×320 resolution. The use of small-strided convolution and the absence of pooling prevents the loss of small object details typical of early pooling operations, effectively preserving edge and texture information critical for subsequent small object detection in unmanned drone imagery.

Shallow Feature Processing (C2/C3): Both C2 and C3 layers utilize a lightweight C2f (Cross-Stage Partial network) module. The processing flow is: 1×1 convolution for channel reduction -> feature split -> dual-branch bottleneck structure (with residual shortcut) -> feature concatenation -> 1×1 convolution for channel restoration. The parallel multi-branch design of the C2f module retains key fine-grained features of small objects with low computational overhead, providing ample feature support for detecting objects in crowded unmanned drone scenes. The C2 layer outputs 320×320 features, while the C3 layer downsamples to 160×160 via a stride-2 convolution, enhancing semantic capacity while preserving detail.

Deep Feature Refinement (C4/C5): The C4 and C5 layers employ a composite C2f-CGSA module. Building upon feature extraction by the C2f module, it sequentially integrates CGLU and SHSA components in a “local-to-global” order for efficient refinement, controlling parameters through redundancy compression. The flow is: C2f feature extraction -> CGLU local feature refinement -> SHSA global context modeling -> output refined features.

CGLU (Convolutional Gated Linear Unit): Employs 3×3 depthwise convolution to extract local contour features of targets. A dynamic channel gating mechanism then generates channel weights, selectively retaining effective local features and suppressing redundant background interference for the subsequent global modeling stage.
SHSA (Single-Head Self-Attention): After layer normalization of the CGLU output, it models long-range spatial dependencies via Q-K dot-product attention, capturing the global distribution of objects within complex unmanned drone scenes. The C5 layer outputs 40×40 features, where the SHSA’s receptive field is further expanded to strengthen the capture of medium and large aerial targets.

Experimental results show the GradNet backbone has a total of 13.55M parameters, which is 31.8% fewer than the RT-DETR backbone (19.88M), and a computational cost of 50 GFLOPs, reduced by 12.2%. This achieves a balance between lightweight design and feature representation power, suitable for the resource constraints of unmanned drone platforms.

Neck Network: IST-Fusion Module

The core objective of the IST-Fusion module is to solve the problems of $$O(n^2)$$ complexity in traditional fusion networks and cross-scale redundancy in unmanned drone imagery. Building upon the original AIFI module, it collaboratively replaces the standard MHSA with our proposed TSSA, achieving efficient multi-scale feature alignment and fusion with linear $$O(n)$$ computational complexity. The input consists of multi-scale features C2-C5 from GradNet (C2:320×320, C3:160×160, C4:80×80, C5:40×40), where low-level features (C2/C3) contain fine-grained small object details and high-level features (C4/C5) contain global semantics for larger objects. IST-Fusion operates in three stages: channel redundancy compression, token statistical modeling, and multi-scale cross-layer fusion.

Adaptive Information Fusion (AIFI): To address semantic redundancy in the top-level C5 features, the AIFI module performs channel-wise redundancy compression by dynamically modeling inter-channel dependencies. The logic for generating channel weights and dimension mapping is defined as:
$$Z_{\text{AIFI}} = \sigma(\text{MLP}(\text{GlobalAvgPool}(X_{C5}))) \odot X_{C5}$$
where $$X_{C5}$$ is the original C5 feature with dimensions $$[B, C, H, W]$$ (B: batch size, C=512 channels, H=W=40 resolution). The $$\text{GlobalAvgPool}(\cdot)$$ operation pools over spatial dimensions $$(H, W)$$, outputting $$[B, C, 1, 1]$$ for global spatial aggregation. The $$\text{MLP}(\cdot)$$ employs a lightweight “1×1 Conv + ReLU + 1×1 Conv” structure with intermediate dimensions $$C \rightarrow C/4 \rightarrow C$$ (512→128→512), modeling non-linear channel dependencies with low cost. $$\sigma(\cdot)$$ is the Softmax function, normalizing the MLP output across the channel dimension to generate channel weights $$W_{\text{AIFI}} \in [B, C, 1, 1]$$. The $$\odot$$ denotes element-wise multiplication, where weights are broadcast to match $$X_{C5}$$’s dimensions, amplifying important channels and suppressing redundant ones. $$Z_{\text{AIFI}}$$ is the compact, compressed feature for C5, retaining its original dimensions $$[B, C, H, W]$$. The compression ratio in the MLP’s intermediate layer $$C/r$$ uses the baseline model’s setting, balancing redundancy reduction and feature expressiveness without introducing new hyperparameters.

Token Statistical Self-Attention (TSSA): To tackle the $$O(n^2)$$ computational complexity of the original MHSA, we propose TSSA as a replacement. It flattens features into tokens and models global context via second-order statistics, explicitly detailing dimension mapping and weight generation to achieve $$O(n)$$ complexity.

Tokenization & Projection: The output $$Z_{\text{AIFI}} \in [B, C, H, W]$$ is flattened and transposed into a token sequence $$T \in [B, n, C]$$, where $$n = H \times W = 1600$$ is the number of tokens. Each token corresponds to a spatial location. A projection matrix $$W_P \in \mathbb{R}^{C \times p}$$ (C=512, p=64) maps tokens to a lower-dimensional space: $$T_P = T \cdot W_P$$, where $$T_P \in [B, n, p]$$.
Second-Order Statistic Computation: Compute the mean ($$\mu$$) and variance ($$\sigma^2$$) of the projected tokens $$T_P$$ to capture global distribution characteristics:
$$\mu = \frac{1}{n} \sum_{i=1}^{n} T_P[:, i, :], \quad \sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (T_P[:, i, :] – \mu)^2$$
Here, $$\mu \in [B, 1, p]$$ represents the global average feature intensity per dimension, and $$\sigma^2 \in [B, 1, p]$$ represents the global feature dispersion. The variance is normalized via a Z-score transformation followed by adding a small constant $$\epsilon=10^{-6}$$ for stability: $$\sigma’ = \sqrt{\max(\sigma^2_{\text{normalized}}, \epsilon)}$$.
Attention Weight Generation & Weighted Transformation: Concatenate $$\mu$$ and $$\sigma’$$ along the channel dimension to form $$S = [\mu, \sigma’] \in [B, 1, 2p]$$ (2p=128), aggregating both “average intensity” and “distribution spread” information. A linear mapping matrix $$W_A \in \mathbb{R}^{2p \times n}$$ transforms S, followed by GELU activation and Softmax normalization over the token dimension to generate attention weights $$W_{\text{TSSA}} \in [B, 1, n]$$:
$$W_{\text{TSSA}} = \text{Softmax}(\text{GELU}(S \cdot W_A))$$
These weights are broadcast and applied to the projected tokens: $$T_R = T_P \odot W_{\text{TSSA}}$$, yielding refined tokens $$T_R \in [B, n, p]$$ that emphasize important tokens. Finally, a restoration matrix $$W_R \in \mathbb{R}^{p \times C}$$ maps $$T_R$$ back to the original channel dimension, and the result is transposed and reshaped into a 40×40 feature map $$X_{\text{TSSA}} \in [B, C, H, W]$$.

The computational complexity of TSSA is $$O(2Cpn + Bnp)$$, which, with fixed p=64 and single-head attention, simplifies to linear $$O(n)$$—a significant reduction from the standard self-attention’s $$O(n^2 C)$$. The key advantage of replacing MHSA with TSSA is this linear complexity reduction. This is empirically validated (see Ablation Study, Table 2): while the overall GFLOPs remain similar (57.1 vs. 57), TSSA accomplishes more (channel compression and feature alignment) within a comparable computational budget, leading to faster inference (93.52 vs. 76.28 FPS) — a clear efficiency gain.

Multi-Scale Cross-Layer Fusion: The refined feature $$X_{\text{TSSA}}$$ is fused with C2-C4 features to complement fine-grained details with global semantics.

Upsampling Alignment: $$X_{\text{TSSA}}$$ is upsampled by factors of 2, 4, and 8 to match the resolutions of C4 (80×80), C3 (160×160), and C2 (320×320), yielding $$X^{C4}_{\text{TSSA}}$$, $$X^{C3}_{\text{TSSA}}$$, $$X^{C2}_{\text{TSSA}}$$.
Channel Alignment: A 1×1 convolution unifies the channel count of the upsampled features and the corresponding C2-C4 features to a common $$C_{out}$$:
$$X^{\text{align}}_i = \text{Conv}_{1\times1}(X^{Ci}_{\text{TSSA}}) + \text{Conv}_{1\times1}(X_{Ci}), \quad i \in \{2,3,4\}$$
where $$X_{Ci}$$ are the original features, and $$X^{\text{align}}_i \in [B, C_{out}, H_i, W_i]$$.
Feature Fusion & Enhancement: The aligned features are passed through a lightweight RepBlock residual module, composed of two 3×3 Conv-ReLU residual units, to enhance feature interaction and produce the final fused features $$X^{\text{fusion}}_i$$:
$$X^{\text{fusion}}_i = \text{RepBlock}(X^{\text{align}}_i)$$
These final fused features are fed directly into the FineGrained-Detect head.

FineGrained-Detect: A Remote-Sensing Specialized High-Resolution Head

The FineGrained-Detect head is designed to preserve small object details while balancing computational cost and accuracy. Unlike traditional approaches (FPN, PANet, YOLO series) which simply reuse “shallow feature fusion” ideas, it introduces three layers of customization for the unmanned drone remote sensing scenario: fusion logic, fusion mechanism, and detection branch design. The overall architecture is: direct connection of shallow C2 features -> remote-sensing specialized lightweight fusion unit -> scale-specific detection branches -> bounding box and class prediction. The core differences from traditional methods are summarized in Table 1.

Table 1: Core Differences between FineGrained-Detect and Traditional Shallow Feature Fusion Methods
Dimension	FPN/PANet/BiFPN	YOLO Series	FineGrained-Detect (Ours)
Fusion Logic	Downsample + Upsample, indirect fusion	Semi-direct fusion after downsampling shallow features	No downsampling, direct connection of original high-res features
Fusion Mechanism	Generic homogeneous convolution, no redundancy suppression	Simple concatenation, not scenario-adapted	Specialized channel attention gating + depthwise separable convolution
Detection Branch	Shared branches across scales, feature mismatch	Multi-scale but homogeneous branches, no small-object customization	Scale-specific branches, customized for small objects

Innovation in Fusion Logic: Direct Connection of Original High-Resolution Features
Breaking away from the traditional “downsample-upsample” indirect logic, FineGrained-Detect directly connects to the 320×320 resolution C2 features from GradNet, completely skipping any form of destructive downsampling (pooling, strided convolution). Compared to FPN’s indirect path or YOLO’s downsampled shallow layers, this approach avoids detail loss at its source, significantly benefiting the detection of ultra-small objects (<16×16 pixels) prevalent in unmanned drone imagery.

Innovation in Fusion Mechanism: Remote-Sensing Specialized Lightweight Unit
Tailored for the sparse small object features and high background redundancy of unmanned drone images, we design a lightweight fusion unit comprising “Channel Attention Gating + Depthwise Separable Convolution” to replace generic convolutions or simple concatenation. The key optimizations are: 1) Channel Attention Gating: Selectively filters shallow C2 features at the channel level, suppressing background noise and enhancing edge/texture channels of small objects to improve feature discriminability. 2) Depthwise Separable Convolution: Replaces standard 3×3 convolutions for fusing shallow C2 features with the IST-Fusion output, achieving a lightweight design that controls computational overhead.

Innovation in Detection Branch: Scale-Specific Specialized Branches
We construct “scale-specific detection branches” to ensure precise matching between features and detection tasks, solving the “feature mismatch” and “structural homogeneity” of traditional methods.

320×320 Small-Object Branch: Built directly upon the high-res C2 features. The number of anchor boxes is increased from the typical 3 to 5, with sizes (relative to 640×640 input) covering 8×8, 12×12, 16×16, 20×20, 24×24. These sizes are based on statistics from VisDrone2019 and DOTA, where most small objects are <0.01 of image area (<16×16). The 8×8 and 12×12 anchors cater to ultra-small objects, while 16×16–24×24 bridges the scale gap. The bounding box regression loss is optimized to CIoU-Loss for better precision.
160×160 / 80×80 Medium/Large Object Branches: Built upon IST-Fusion’s output features. They employ a lightweight decoupled head that separates feature extraction for classification and regression, removing redundant fully connected layers to reduce computation.
Cross-Scale Feature Interaction Gate: A lightweight gating mechanism is introduced between branches. It dynamically transmits only core semantic features of medium/large objects to the small-object branch, preventing cross-scale interference from background redundancy and improving multi-scale detection consistency.

Furthermore, the FineGrained-Detect head incorporates multiple lightweight optimizations: using depthwise separable convolutions, retaining only channel attention (avoiding $$O(n^2)$$ spatial attention), and employing efficient branch design.

Experiments and Analysis

Experimental Datasets and Configuration

Datasets:

VisDrone2019 Dataset: A mainstream benchmark for unmanned drone detection, containing 10 categories like pedestrians, cars, and buses. Images feature complex backgrounds (day/night, sparse/dense scenes) and extreme scale variation, with most small objects <0.01 of image area. It includes 6,471 training, 548 validation, and 1,610 test images for training and evaluation.
DOTA Dataset: A large-scale aerial image benchmark with 15 categories (vehicles, buildings, etc.). It features large scale variation, long-tailed class distribution, and background clutter (vegetation, water, structures), complementing VisDrone’s dense small-object scenes. It contains 2,806 training, 1,418 validation, and 1,723 test images for cross-domain generalization validation.

Unified Experimental Configuration: To ensure comparability, all tests use identical hardware and parameters for both datasets.

Hardware: NVIDIA RTX3090 GPU, Intel i9-12900K CPU, 64GB RAM.
Software: PyTorch 2.2.2, CUDA 11.8, OpenCV 4.8.0.
Training: Input size 640×640, batch size 4, epochs 200. Optimizer: AdamW (lr=1e-4, weight decay=1e-4). Scheduler: Cosine annealing (min lr=1e-5). Object queries are initialized from RT-DETR.
Testing: Batch size 1, no augmentation, single-GPU inference. Metrics include FPS, GFLOPs, Params, AP@[0.5:0.95] (overall), AP@0.5, and AP_s/AP_m/AP_l for small (<32×32), medium (32×32–96×96), and large (>96×96) objects (COCO standard).

Ablation Study

To validate the effectiveness of the three core modules (GradNet, IST-Fusion, FineGrained-Detect) and the compute-accuracy trade-off of the detection head, we conduct an ablation study on the VisDrone2019 test set. Using RT-DETR as baseline (A), we incrementally add GradNet (B), IST-Fusion (C), and FineGrained-Detect (D). Results are in Table 2.

Table 2: Ablation Results of ZY-DETR Modules on VisDrone2019 Test Set
Baseline	GradNet	IST-Fusion	FineGrained-Detect	GFLOPs	Params	AP	AP50	AP_s	AP_m	AP_l	FPS	Model Size
✓				57.0	19.88M	0.203	0.355	0.112	0.300	0.352	76.28	77.0MB
✓	✓			50.0	13.55M	0.208	0.362	0.117	0.304	0.387	75.76	52.9MB
✓		✓		57.1	19.75M	0.206	0.358	0.114	0.300	0.374	93.52	76.5MB
✓	✓	✓		50.1	13.40M	0.213	0.368	0.119	0.311	0.372	92.65	52.4MB
✓	✓	✓	✓	116.0	15.46M	0.235	0.402	0.142	0.336	0.429	59.31	60.6MB

Adding GradNet alone (A+B): Compared to baseline, parameters drop 31.8% and GFLOPs drop 12.2%, while speed remains stable. AP increases by 0.5%, with AP_l notably gaining 3.5%. This confirms GradNet’s heterogeneous design achieves lightweighting while enhancing semantic refinement for larger unmanned drone targets, while preserving basic small-object details.

Adding IST-Fusion alone (A+C): This replaces MHSA with TSSA. While Params and GFLOPs remain similar, FPS jumps to 93.52, and AP improves by 0.3%. This validates TSSA’s efficiency: the linear attention mechanism provides more useful computation per FLOP, yielding faster inference and slightly better accuracy without increasing raw compute.

Adding both GradNet and IST-Fusion (A+B+C): Parameters stay low (13.4M), AP rises to 0.213 (+1.0% over baseline), with gains across all scales (AP_s +0.7%, AP_m +1.1%, AP_l +2.0%). Speed remains high (92.65 FPS). This shows strong synergy: IST-Fusion efficiently fuses the high-quality hierarchical features from GradNet.

Full ZY-DETR (A+B+C+D): With all modules, Params are 15.46M, GFLOPs increase to 116, and FPS is 59.31. AP sees a substantial jump to 0.235 (+3.2%), with AP_s, AP_m, and AP_l improving by 3.0, 3.6, and 7.7 percentage points, respectively. Model size reduces to 60.6MB. This conclusively validates the effectiveness of the FineGrained-Detect head. Its high-resolution direct connection and scale-specific branches dramatically boost small object accuracy, while the lightweight fusion unit controls the added overhead, striking a balance between accuracy, speed, and model size for unmanned drone applications.

Comparison with State-of-the-Art Detectors

To comprehensively evaluate ZY-DETR’s competitiveness and the innovation of its core modules, we compare it against current advanced detectors on the VisDrone2019 test set. The compared models span the three main detection paradigms: one-stage (YOLO series), two-stage (Faster R-CNN, Cascade R-CNN), and Transformer-based (DINO, RT-DETR). Results are in Table 3.

Table 3: Performance Comparison of Different Models on VisDrone2019
Model	Input Shape	GFLOPs	Params	AP	AP50	AP_s	AP_m	AP_l
YOLO8m	(640,640)	78.7	25.85M	0.190	0.332	0.060	0.294	0.417
YOLO8s	(640,640)	28.5	11.13M	0.173	0.307	0.078	0.269	0.372
YOLO10m	(640,640)	58.9	15.32M	0.195	0.345	0.097	0.300	0.414
YOLO10s	(640,640)	21.4	7.22M	0.179	0.323	0.086	0.278	0.361
YOLO11m	(640,640)	67.7	20.04M	0.203	0.350	0.098	0.312	0.413
YOLO11s	(640,640)	21.3	9.42M	0.176	0.313	0.080	0.272	0.364
YOLO12m	(640,640)	67.2	20.11M	0.192	0.336	0.094	0.298	0.386
YOLO12s	(640,640)	21.2	9.23M	0.176	0.312	0.081	0.274	0.356
RT-DETR-R18	(640,640)	57.0	19.88M	0.203	0.355	0.112	0.300	0.352
Faster R-CNN-R50-FPN	(768,1344)	208	41.39M	0.194	0.329	0.095	0.309	0.429
Cascade R-CNN-R50-FPN	(768,1344)	236	69.29M	0.197	0.326	0.099	0.309	0.406
DINO	(750,1333)	274	47.56M	0.253	0.445	0.150	0.371	0.503
ZY-DETR (Ours)	(640,640)	116.0	15.46M	0.235	0.402	0.142	0.336	0.429

ZY-DETR achieves an AP of 23.5%, second only to DINO (25.3%). However, DINO’s compute cost (274 GFLOPs) and parameters (47.56M) are 2.3x and 3.0x higher than ZY-DETR’s, making it impractical for resource-limited unmanned drone platforms. Compared to its baseline RT-DETR-R18, ZY-DETR improves AP by 3.2% while reducing parameters by 22.2%, demonstrating a favorable efficiency-accuracy balance. The specific advantages are threefold:

Leading Small Object Performance: ZY-DETR’s AP_s (14.2%) surpasses all one-stage YOLO models (YOLO11m’s best is 9.8%) and is only 0.8% behind DINO’s 15.0%, while being 57.6% lighter in computation. This directly addresses the “small object detail loss” problem in unmanned drone imagery, attributable to GradNet’s detail retention and the FineGrained-Detect head’s design.
Significantly Lower Complexity than Traditional Detectors: Two-stage models (Faster/Cascade R-CNN) are excessively heavy (208-236 GFLOPs) and perform poorly on small objects (AP_s <10%). ZY-DETR reduces GFLOPs by 44.2% compared to Faster R-CNN while improving AP by 4.1% and AP_s by 4.7%, showcasing a clear advantage over traditional paradigms, thanks to IST-Fusion’s linear-complexity TSSA.
Balanced Multi-Scale Performance: Compared to RT-DETR-R18, ZY-DETR improves AP_s, AP_m, and AP_l by 3.0, 3.6, and 7.7 percentage points, respectively. Unlike some models that excel only on small objects, ZY-DETR maintains competitive large-object accuracy (42.9%), confirming its ability to handle the extreme scale variations typical of unmanned drone scenes, a result of the synergistic GradNet and IST-Fusion design.

Generalization Experiment

We evaluate the generalization capability of ZY-DETR by comparing it with the baseline RT-DETR on the DOTA dataset test and validation sets under the same unified configuration. Results focus on detection accuracy, computational efficiency, and deployment feasibility, as shown in Table 4.

Table 4: Analysis of Generalization Experiments on DOTA Dataset
Model	Split	GFLOPs	Params	AP	AP50	AP_s	AP_m	AP_l	FPS	Model Size
RT-DETR	Test	57.0	19.89M	0.574	0.797	0.356	0.597	0.724	81.13	77.0MB
ZY-DETR	Test	116.1	15.47M	0.600	0.808	0.403	0.624	0.731	85.61	60.6MB
RT-DETR	Val	57.0	19.89M	0.588	0.818	0.362	0.622	0.662	78.36	77.0MB
ZY-DETR	Val	116.1	15.47M	0.615	0.830	0.401	0.658	0.678	82.03	60.6MB

The key results confirm ZY-DETR’s strong generalization and robustness on DOTA:

Comprehensive Accuracy Gain: On the test set, AP improves by 2.6% (0.574→0.600); on the validation set, by 2.7% (0.588→0.615). Significantly, small object AP_s on the test set jumps by 4.7%, while large object AP_l remains high (0.724→0.731), demonstrating balanced full-scale improvement.
Favorable Efficiency: With 15.47M parameters (22.2% fewer than RT-DETR) and a fast inference speed of 85.61 FPS on the test set, ZY-DETR validates its adaptability to different unmanned drone remote sensing scenarios and its strong potential for deployment on resource-constrained aerial platforms.

Conclusion

This work addresses the core problems of small object feature sparsity, high multi-scale fusion complexity, and the challenging balance between resource consumption and detection accuracy in unmanned aerial vehicle remote sensing imagery. We propose ZY-DETR, a small object detection algorithm that synergistically integrates heterogeneous feature extraction, linear-complexity fusion, and a remote-sensing specialized detection head to achieve more comprehensive and accurate target recognition. Experimental results demonstrate its superior performance on unmanned drone small object detection tasks, validating its effectiveness and advancement. While presented in the context of unmanned drone detection, the proposed multi-scale feature fusion methodology is applicable to other remote sensing detection tasks such as traffic surveillance and aerial search and rescue.

This research has certain limitations. First, the model’s parameter count and computational overhead, though reduced, still have room for optimization to fit extremely low-power unmanned drone platforms. Second, the robustness of the TSSA mechanism in capturing features of ultra-small objects needs further enhancement to adapt to more complex remote sensing environments. Future work will focus on: 1) Further reducing model size and computation via quantization, pruning, and knowledge distillation for edge deployment on unmanned drones; 2) Optimizing the TSSA mechanism and multi-scale fusion strategy to improve robustness for ultra-small object detection; 3) Refining algorithms across all pipeline stages to achieve a better balance among accuracy, real-time performance, and lightweight design for the demanding domain of unmanned aerial vehicle perception.