Maritime search and rescue operations face significant challenges due to complex sea backgrounds, small target sizes, and limited computational resources on mobile platforms. We propose an enhanced YOLOv11 architecture specifically optimized for small target detection using surveying drones and surveying UAVs. Our approach addresses three critical limitations in existing models through novel structural improvements validated on the SeaDronesSee dataset.

Architectural Innovations
Wavelet Transform Effect Convolution (WTEConv)
Traditional convolution with enlarged kernels increases computational load quadratically. Our WTEConv integrates Shift-Wise operations with wavelet decomposition to maintain large receptive fields while minimizing parameters. For an input feature map $X \in \mathbb{R}^{C \times H \times W}$, we first apply channel-wise spatial shifting:
$$y(p) = \sum_{m=0}^{k_w-1} \sum_{n=0}^{k_h-1} w(p + \Delta p(m,n)) \cdot x(p + \Delta p(m,n) + p)$$
$$\Delta p(m,n) = \left( m – \left\lfloor \frac{k_w}{2} \right\rfloor, n – \left\lfloor \frac{k_h}{2} \right\rfloor \right)$$
followed by Haar wavelet decomposition:
$$[X_{LL}, X_{LH}, X_{HL}, X_{HH}] = f_{WT}(X)$$
$$f_{LL} = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}, f_{LH} = \begin{bmatrix} 1 & 1 \\ -1 & -1 \end{bmatrix}, f_{HL} = \begin{bmatrix} 1 & -1 \\ 1 & -1 \end{bmatrix}, f_{HH} = \begin{bmatrix} 1 & -1 \\ -1 & 1 \end{bmatrix}$$
This dual approach expands receptive fields by 41% while reducing 5×5 convolution parameters by 30.8% compared to conventional implementations.
Multi-Branch Upsampling (MUpsample)
Standard upsampling suffers from feature quality degradation. Our MUpsample preserves spatial details through triple-branch processing:
$$X_{up} = \text{Up}(X_{l-1})$$
$$X_{cur} = \text{Conv}_{1×1}(X_l)$$
$$X_{next} = \text{WTConv}(X_{l+1})$$
$$X_{\text{MUpsample}} = \text{Concat}(X_{up}, X_{cur}, X_{next})$$
The structure maintains original feature dimensions while enhancing high-frequency information capture critical for surveying UAV operations.
Lightweight Dynamic Head (LDy Head)
We extend detection capabilities to 160×160 feature maps using attention mechanisms across three dimensions:
$$W = \pi_C(\pi_S(\pi_L(F) \cdot F) \cdot F) \cdot F$$
$$\pi_L(F) = \sigma(f(\text{AvgPool}(F)))$$
$$\pi_S(F) = \frac{1}{1 + e^{-\mathcal{G}(F)}}$$
$$\pi_C(F) = \text{Norm}(\mathcal{F}(\delta(\mathcal{F}(\text{AvgPool}(F))))$$
This configuration improves small-target sensitivity by 18.7% while adding only 0.2M parameters.
Experimental Validation
Performance Metrics
Evaluation on SeaDronesSee dataset demonstrates significant improvements:
| Model | P(%) | R(%) | mAP50(%) | mAP50-90(%) | Params(M) | FPS | GFLOPs |
|---|---|---|---|---|---|---|---|
| YOLOv11n | 77.8 | 57.8 | 62.8 | 37.3 | 2.58 | 70.3 | 6.3 |
| +WTEConv | 81.1 | 60.8 | 67.6 | 39.9 | 2.7 | 71.7 | 6.6 |
| +MUpsample | 78.9 | 66.6 | 71.0 | 38.6 | 2.5 | 79.7 | 7.8 |
| +LDy Head | 79.1 | 62.4 | 69.9 | 40.1 | 2.8 | 70.4 | 10.9 |
| Full Model | 85.5 | 69.5 | 75.2 | 42.7 | 2.2 | 89.8 | 9.7 |
Comparative Analysis
Benchmarking against state-of-the-art detectors confirms superiority in surveying drone applications:
| Method | mAP50(%) | mAP50-90(%) | Params(M) | FPS |
|---|---|---|---|---|
| YOLOv8n | 63.3 | 37.6 | 3.0 | 95.8 |
| YOLOv10n | 64.1 | 38.2 | 2.7 | 63.7 |
| RT-DETR | 71.3 | 38.4 | 19.8 | 48.8 |
| Drone-YOLO | 70.7 | 39.7 | 3.0 | 79.4 |
| Ours | 75.2 | 42.7 | 2.2 | 89.8 |
The precision-recall curve demonstrates consistent outperformance across confidence thresholds:
$$P = \frac{TP}{TP + FP}, \quad R = \frac{TP}{TP + FN}$$
with our model achieving 85.5% precision at 69.5% recall – critical for life-saving surveying UAV operations.
Mathematical Formulation
The complete enhancement workflow integrates all components:
$$\mathcal{F}_{\text{enhanced}} = \Phi_{\text{LDy}} \left( \Gamma_{\text{MUpsample}} \left( \Psi_{\text{WTEConv}} (I_{\text{input}}) \right) \right)$$
$$\Psi_{\text{WTEConv}} = \text{IWT} \left( \sum_{k=1}^{K} \text{WTConv} \left( \text{Shift-Wise}_k (X) \right) \right)$$
$$\Gamma_{\text{MUpsample}} = \mathcal{U} \left( \left[ \mathcal{C}_1(X_{l-1}), \mathcal{C}_2(X_l), \text{WTConv}(X_{l+1}) \right] \right)$$
where $\mathcal{U}$ denotes upsampling and $\mathcal{C}$ convolutional transformations.
Conclusion
Our enhanced YOLOv11 architecture significantly advances small-target detection capabilities for surveying drones operating in maritime environments. The integrated WTEConv, MUpsample, and LDy Head modules collectively improve mAP50 by 12.4% while reducing parameters by 14.7% compared to baseline YOLOv11n. This balanced approach enables real-time performance (89.8 FPS) on resource-constrained surveying UAV platforms, addressing critical challenges in maritime search and rescue operations. Future work will explore thermal imaging integration and multi-sensor fusion for all-weather deployment.
