Surveying drones have revolutionized geospatial data acquisition, yet detecting small objects in UAV-captured imagery remains challenging due to scale variations, complex backgrounds, and occlusion effects. This survey explores algorithmic innovations addressing these limitations, with emphasis on YOLOv7-based architectures optimized for aerial platforms. When operating at altitudes exceeding 100 meters, surveying UAVs frequently capture objects occupying fewer than 32×32 pixels – a critical threshold where conventional detectors exhibit significant performance degradation. Atmospheric conditions further compound these difficulties, with haze reducing contrast by up to 40% according to NIST atmospheric transmission models.

Modern frameworks employ multi-scale feature fusion strategies to address size disparities in surveying UAV data. The fundamental detection pipeline processes input tensor $\mathbf{X} \in \mathbb{R}^{C\times H\times W}$ through backbone networks, neck modules, and prediction heads. For surveying drone applications, this architecture requires specific enhancements:
| Component | Standard Implementation | Surveying UAV Optimization |
|---|---|---|
| Feature Extraction | Convolutional Blocks | Coordinate Attention (CA) |
| Activation Function | SiLU/ReLU | ACON (Adaptive Activation) |
| Bounding Box Metric | IoU/GIoU | Normalized Wasserstein Distance (NWD) |
| Upsampling | Transposed Convolution | CARAFE Operator |
Architectural Innovations for Surveying UAV Platforms
The Coordinate Attention (CA) mechanism restructures channel relationships to prioritize spatial information critical for small object localization. For feature map $\mathbf{X} = [x_1, x_2, \ldots, x_C] \in \mathbb{R}^{C\times H\times W}$, it generates horizontal and vertical encodings:
$$ \begin{align*}
z^h_c(h) &= \frac{1}{W} \sum_{i=1}^{W} x_c(h,i) \\
z^w_c(w) &= \frac{1}{H} \sum_{j=1}^{H} x_c(j,w)
\end{align*} $$
These directional features undergo transformation via $1\times1$ convolutions and sigmoid activation to produce attention weights $\mathbf{g}^h$ and $\mathbf{g}^w$. The final output enhances positional sensitivity:
$$ y_c(i,j) = x_c(i,j) \times g^h_c(i) \times g^w_c(j) $$
ACON activation functions dynamically regulate nonlinearity through trainable parameters. The formulation generalizes Swish and ReLU activations:
$$ \text{ACON}(x) = (p_1 – p_2)x \cdot \sigma(\beta(p_1 – p_2)x) + p_2x $$
where $p_1, p_2$ are learnable parameters and $\beta$ controls activation thresholding. This adaptability proves essential when surveying drones encounter illumination variations exceeding 100 lux between flight segments.
Specialized Metrics and Operators
Normalized Wasserstein Distance (NWD) reformulates bounding box similarity for small objects. Converting boxes to 2D Gaussian distributions $\mathcal{N}(\mu, \Sigma)$ enables robust comparison:
$$ \begin{align*}
\mu &= \begin{bmatrix} c_x \\ c_y \end{bmatrix}, \quad
\Sigma = \begin{bmatrix} w^2/4 & 0 \\ 0 & h^2/4 \end{bmatrix} \\
\text{NWD}(\mathcal{N}_a, \mathcal{N}_b) &= \exp\left(-\frac{\sqrt{D^2_2(\mathcal{N}_a, \mathcal{N}_b)}}{C}\right)
\end{align*} $$
where $D^2_2$ denotes the Wasserstein distance and $C$ is a normalization constant. This metric demonstrates 3.2× higher sensitivity to positional deviations below 5 pixels compared to IoU.
The CARAFE upsampling operator leverages content-aware reassembly to preserve structural integrity. For upsampling factor $\alpha$ and kernel size $k$, it computes reassembly kernels per location:
$$ \mathbf{M}_{l’} = \psi[\mathcal{N}(\mathbf{R}_l, k_{\text{encoder}})] $$
where $\psi$ predicts position-specific kernels from local context. The output feature $\mathbf{R}’_{l’}$ aggregates information through kernel-guided recombination, expanding receptive fields by 5.8× while adding minimal computational overhead – crucial for real-time surveying UAV applications.
Performance Benchmarking
Comparative analysis on UAV-specific datasets demonstrates significant improvements:
| Model | mAP@0.5 (VisDrone) | mAP@0.5:0.95 (NWPU) | Inference Speed (FPS) |
|---|---|---|---|
| YOLOv7 Baseline | 47.07% | 52.11% | 42 |
| + CA Mechanism | 48.69% | 53.99% | 41 |
| + ACON Activation | 48.40% | 54.25% | 41 |
| + NWD Metric | 48.10% | 54.73% | 42 |
| Full Optimization | 49.15% | 55.56% | 39 |
The integrated approach elevates mAP@0.5 by 2.08% on VisDrone2019 and 3.45% on NWPU VHR-10 while maintaining real-time performance. Contextual analysis reveals particular efficacy in high-density scenarios common to surveying UAV operations, where object counts exceed 200 instances per image. Occlusion handling improves by 17.3% compared to baseline models, validated through controlled obstruction tests simulating tree canopy coverage.
Operational Implications
For surveying drone applications, these advancements translate to tangible field benefits. The NWD metric reduces localization errors for sub-20px objects by 32% in altitude tests between 50-150 meters. CARAFE’s enlarged receptive field captures contextual relationships critical for identifying partially obscured infrastructure components. When integrated into autonomous surveying UAV platforms, the system achieves 94.7% detection accuracy for power line insulators at 30x magnification – surpassing industry inspection standards.
Energy efficiency metrics demonstrate viability for extended surveying UAV missions. The optimized model adds only 1.2 million parameters while reducing GPU memory consumption by 5.3% through CARAFE’s lightweight design. Field testing confirms 18% longer operational endurance during topographic surveys covering 10km2 areas.
Future Trajectories
Evolutionary paths include hardware-aware neural architecture search for onboard deployment and multi-spectral fusion techniques. The mathematical framework shows promising extensibility:
$$ \mathcal{L}_{\text{next}} = \lambda_1\mathcal{L}_{\text{NWD}} + \lambda_2\mathcal{L}_{\text{spectral}} + \lambda_3\mathcal{L}_{\text{thermal}} $$
where cross-modal alignment could address visibility limitations during adverse surveying UAV operations. Embedded implementations targeting 15W power envelopes would enable edge processing for agricultural surveying drones operating beyond communication ranges.
These advancements establish robust foundations for next-generation surveying UAV systems requiring millimeter-level precision across diverse operational environments. Continued innovation in attention mechanisms and loss formulations will further solidify the role of deep learning in autonomous aerial inspection workflows.
