Maritime accidents present critical challenges for search and rescue (SAR) operations. The vastness of the ocean, coupled with adverse weather conditions and the inherent urgency of saving lives, demands highly efficient and reliable detection technologies. While traditional rescue vessels and aircraft are indispensable, they often suffer from slow response times, high operational costs, and limited coverage areas when dispatched from distant bases. In this context, the deployment of an Unmanned Aerial Vehicle (UAV drone) offers a transformative solution. A UAV drone can be launched rapidly, cover vast expanses of ocean quickly, and operate at a fraction of the cost of manned aircraft. However, the complex maritime environment imposes stringent requirements on the onboard detection system. The model must remain lightweight for real-time inference on limited onboard computing resources, exhibit strong robustness against environmental disturbances like sea clutter, glare, and fog, and, most critically, possess exceptional capability in detecting small objects, such as a human head or a piece of life-saving equipment bobbing in the waves. These unique challenges render standard object detection models inadequate for this life-critical task.
Current mainstream object detection paradigms can be broadly categorized into single-stage and two-stage models. Two-stage models like Sparse R-CNN can achieve high detection accuracy but are computationally expensive and slow, making them unsuitable for real-time deployment on a UAV drone equipped with limited computational power. On the other hand, single-stage models from the YOLO family (YOLOv8, YOLOv10, YOLO11) are celebrated for their excellent balance of speed and accuracy, making them ideal for real-time applications. Nevertheless, they universally struggle with detecting extremely small objects in complex backgrounds. A person overboard, seen from a UAV drone flying at a safe altitude, often occupies only a tiny fraction of the image pixels, easily blending into the dynamic texture of the ocean surface. This frequently leads to missed detections, a critical failure in a search and rescue scenario. To bridge this gap, we have designed a novel model specifically tailored for maritime SAR tasks, which we call the Dynamic-Convolution, Dynamic-Upsampling, and Wasserstein-loss YOLO network, or DDW-YOLO.

This paper details the architecture and performance of our proposed DDW-YOLO model, which is built upon the YOLO11 framework. Several targeted modifications have been introduced to address the specific challenges of UAV drone-based maritime search and rescue. First, unlike standard architectures that often discard fine-grained spatial information too early, we incorporate a specialized small object detection layer. This layer operates on a higher-resolution feature map (160×160 pixels), allowing the network to preserve crucial low-level details about small targets like a swimmer’s head. Second, to tackle the extreme variability in maritime conditions—from bright sunlight and sharp reflections to overcast skies and choppy waves—we replace standard static convolutions with a dynamic convolution module. This allows the model to adjust its convolutional kernels adaptively based on the input image content, enhancing its feature extraction capabilities without a massive increase in parameters. Third, standard upsampling techniques like nearest-neighbor or bilinear interpolation often blur out important high-frequency information, which is detrimental for small objects. We have integrated a dynamic upsampling module that learns to generate sample point offsets based on the input features, thereby preserving fine details during feature fusion. Finally, we introduce a novel loss function, WWbox_loss. This loss combines the geometric sensitivity of a Wasserstein distance-based loss for handling non-overlapping bounding boxes with the dynamic focusing mechanism of a Wise IoU (WIoU) loss to down-weight the influence of low-quality training samples. The synergy of these improvements makes DDW-YOLO a robust and highly effective tool for a UAV drone engaged in maritime search and rescue operations, significantly boosting both precision and recall to improve the chances of finding and saving lives at sea.
Model Architecture and Improvement
The complete structure of our proposed DDW-YOLO network is detailed below. To provide a clear and structured overview of our specific improvements to the YOLO11 baseline, we summarize our key contributions in the following table:
| Improvement Module | Target Problem | Solution Strategy |
|---|---|---|
| Small Object Detection Layer | Insufficient recall for tiny targets (e.g., a person’s head) in high-altitude UAV drone imagery. | Adds a 160×160 detection head by upsampling an 80×80 feature map and fusing it with a shallow, high-resolution feature map from the backbone. |
| Dynamic Convolution (DyConv) | Poor adaptability of fixed-parameter kernels to the highly variable appearance of sea surfaces (waves, reflections, etc.). | Replaces standard convolutions with a dynamic module that learns to aggregate multiple expert kernels based on the input features, increasing expressiveness with minimal FLOPs increase. |
| Dynamic Upsampling (DySample) | Loss of high-frequency detail during upsampling, which is critical for delineating small objects from a complex background. | Replaces interpolation-based methods with a content-aware, point-based sampling method that learns dynamic offsets to avoid blurring and preserve detail. |
| WWbox_loss | Gradient vanishing for non-overlapping boxes and over-sensitivity to low-quality training samples. | A weighted combination of a Wasserstein distance-based loss (to handle non-overlapping boxes) and a Wise IoU (WIoU) loss (to dynamically focus on high-quality samples). |
Small Object Detection Layer
Standard single-stage models like YOLO11, while fast, are not optimized for the microscopic scales of targets in UAV drone imagery. In maritime SAR, a person in the water is often represented by just a few pixels. To directly address this, we augment the YOLO11 neck with a new detection head for small objects. The process begins by taking a feature map of size 80×80 from the neck, which contains rich semantic information. This feature map is upsampled by a factor of 2 to produce a new feature map of 160×160. This is then concatenated with a correspondingly sized 160×160 feature map from the shallower layers of the backbone, which contains higher spatial resolution. This fused map, combining both fine-grained detail and semantic context, is then processed by a C3k2 module to extract complex features before being passed to the detection head. This new head operates at a resolution twice that of the largest default detection head in YOLO11, enabling it to capture the pixel-level details necessary to positively identify a survivor’s head and distinguish it from visual noise, such as a wave crest or floating debris.
Dynamic Convolution Module
A significant challenge in maritime computer vision is the sheer variability of the sea surface. A fixed convolutional kernel might work well on a calm, sunny day but fail entirely in choppy, overcast conditions or strong glare. To provide our model with the necessary adaptability for a UAV drone, we introduce a dynamic convolution module. This module replaces the static standard convolution with a structure that dynamically composes its kernel based on the input. The core of this module is a lightweight “routing network.” This network processes the input feature map through global average pooling to condense spatial information, followed by a fully connected layer and a Sigmoid activation function. The output of this routing network is a set of “expert” weights. Instead of using Softmax (which creates a winner-takes-all competition among experts), we use Sigmoid, allowing the model to activate multiple parallel experts simultaneously. In the conditional convolution part, these learned weights are used to create a weighted sum of multiple convolutional kernels. This summed kernel is then applied to the input feature map using grouped convolution. The routing network’s computation is very cheap, but it allows the overall convolution operation to adapt its behavior for each input image, making the model far more resilient to the diverse visual conditions encountered by a UAV drone over the sea.
The core mathematical operation of the routing network can be summarized as:
$$routing\_weights = \sigma(Pooled\_inputs \cdot W_r) \in \mathbb{R}^{B \times E}$$
Where \( \sigma \) is the Sigmoid function and \(W_r\) is a learnable projection matrix.
Dynamic Upsampling Module
Feature upsampling is a crucial step in the neck of a detection network, enabling the fusion of multi-scale features. Traditional interpolation methods (nearest-neighbor, bilinear) are simple and fast but treat all pixels uniformly, often leading to the loss of high-frequency information like edges and fine textures. For a UAV drone trying to identify the silhouette of a person floating in the water, this lost detail can be the difference between a successful detection and a missed one. To mitigate this, we implement a dynamic upsampling module. This module learns to resample the feature map in a content-aware manner. First, the input feature map goes through two parallel 1×1 convolutional layers. One predicts the primary offsets for sampling points. The other generates a modulation coefficient via a Sigmoid activation. The final offset is computed by multiplying the primary offset and the dynamic coefficient, which adjusts the range of the offset. This learned offset is used to generate a dynamic sampling grid, and the final output is obtained by bilinear sampling on this grid. The adaptation of the sampling grid to the input content helps preserve critical details that are fundamental for detecting small objects from a drone’s perspective.
WWbox_loss (Wasserstein Wise Box Loss)
The standard IoU-based bounding box regression loss functions face two major issues in the context of small targets and difficult samples found in maritime SAR. First, when a small target’s predicted bounding box has no overlap with the ground truth, the IoU is 0, leading to a zero gradient and a halt in the learning process for that sample. Second, these loss functions can treat all samples equally, sometimes giving excessive weight to low-quality or ambiguous detections, which degrades the overall model performance. We propose a new loss function, WWbox_loss, to overcome these limitations. This loss is a weighted average of two complementary losses:
$$WWbox\_loss = \alpha \cdot L_{Wasserstein} + (1 – \alpha) \cdot L_{WIoU}$$
Where \( \alpha \) is a trade-off coefficient. The Wasserstein distance-based loss tackles the non-overlapping problem. By modeling the predicted and ground truth boxes as 2D Gaussian distributions, it calculates a distance between these distributions. This distance provides a meaningful gradient even when the boxes do not overlap, guiding the prediction toward the target smoothly. The Wasserstein distance component is defined as:
$$W_{2}^{2} = ||\mu_1 – \mu_2 ||_2^2 + Tr(\Sigma_1 + \Sigma_2 – 2(\Sigma_{1}^{1/2}\Sigma_2\Sigma_{1}^{1/2})^{1/2})$$
The second component is the Wise IoU (WIoU) loss. WIoU is a dynamic IoU that addresses the issue of low-quality samples. It introduces a non-monotonic focusing mechanism that assigns high gain to anchor boxes with moderate-quality, preventing harmful gradients from extremely low-quality samples. Its scaling factor is given by:
$$scaled\_loss = \frac{\alpha}{\beta}$$
Where \(\beta = \frac{IoU}{iou\_mean}\) and \(\alpha = \delta \cdot \gamma^{\beta – \delta}\). This allows the model to focus on the most beneficial training signals. By combining the geometric awareness of the Wasserstein loss with the quality-aware learning of the WIoU loss, the WWbox_loss significantly enhances the stability and accuracy of our UAV drone maritime detection model.
Experiments and Analysis
To validate the effectiveness of our proposed improvements, we conducted a comprehensive set of experiments using the challenging SeaDronesSee v2 dataset, which is designed for UAV drone-based maritime object detection. We used a subset of 5000 images cropped to 640×640 pixels. All experiments were performed with identical hyperparameters for a fair comparison: a learning rate of 0.01, momentum of 0.937, batch size of 64, and an SGD optimizer for 400 epochs. We evaluated models based on Precision, Recall, mAP50, mAP50-95, FPS, and model parameters.
Ablation Study Results
We first conducted an ablation study to measure the individual and collective impact of each proposed module. The results are summarized in the table below.
| Small Target Layer | DyConv | DySample | WWbox_loss | Precision (%) | Recall (%) | mAP50 (%) | mAP50-95 (%) | FPS | Params (M) |
|---|---|---|---|---|---|---|---|---|---|
| No | No | No | No | 92.3 | 87.4 | 91.8 | 61.1 | 285 | 2.58 |
| Yes | No | No | No | 94.4 | 91.1 | 94.6 | 64.3 | 204 | 2.63 |
| Yes | Yes | No | No | 92.3 | 91.7 | 93.9 | 64.5 | 250 | 4.64 |
| Yes | Yes | Yes | No | 93.3 | 92.5 | 94.1 | 65.1 | 238 | 4.67 |
| Yes | Yes | Yes | Yes | 94.7 | 92.7 | 95.3 | 66.2 | 232 | 4.67 |
The results clearly demonstrate a progressive enhancement in model performance. The introduction of the small object detection layer alone caused a significant leap in Recall (from 87.4% to 91.1%), directly confirming its effectiveness in detecting previously missed small targets. The dynamic convolution module, while slightly reducing precision initially, helped to recover inference speed and improved the more robust mAP50-95 metric, suggesting better bounding box quality. The dynamic upsampling module further improved recall and mAP scores by preserving crucial high-frequency information for a UAV drone. Finally, the integration of the WWbox_loss function yielded the best overall performance, boosting Precision to 94.7%, Recall to 92.7%, and mAP50 to 95.3%. This represents a 5.3% improvement in recall and a 3.5% improvement in mAP50 over the baseline YOLO11 model, all while maintaining a highly practical inference speed of 232 FPS.
Comparative Analysis with State-of-the-Art Models
To benchmark our DDW-YOLO against other leading detection models suitable for a UAV drone, we compared its performance with various YOLO models and the Real-Time Detection Transformer (RTDETR) family. The results are shown in the following table.
| Model | Precision (%) | Recall (%) | mAP50 (%) | mAP50-95 (%) | FPS | Params (M) | GFLOPs |
|---|---|---|---|---|---|---|---|
| DDW-YOLO (Ours) | 94.7 | 92.7 | 95.3 | 66.2 | 232 | 4.67 | 8.4 |
| RTDETR_l | 73.2 | 73.0 | 75.8 | 40.0 | 147 | 32.0 | 103.5 |
| RTDETR_resnet50 | 79.9 | 82.9 | 83.2 | 47.9 | 127 | 41.9 | 125.6 |
| YOLO11n | 92.3 | 87.4 | 91.8 | 61.1 | 285 | 2.58 | 6.3 |
| YOLOv10n | 91.0 | 89.1 | 92.0 | 62.6 | 350 | 2.71 | 8.4 |
The comparative results show a clear advantage for our proposed model. The YOLO families of models significantly outperform the RTDETR models in this challenging task. This is likely due to the multi-scale feature fusion (like SPPF), efficient attention mechanisms (C2PSA), and anchor-based design that are highly effective for dense, small objects. While YOLO11n is fast, its lower recall leads to missed detections. YOLOv10n has a good balance but is still outperformed by our DDW-YOLO. Our model achieves the highest Precision, Recall, mAP50, and mAP50-95 across all benchmarks. With only 4.67 million parameters, it delivers a performance that surpasses models that are many times larger, such as the RTDETR variants. This superior performance, combined with a high FPS rate (232), makes DDW-YOLO an ideal candidate for real-time deployment on an edge computing platform within a UAV drone, offering the highest probability of detecting a survivor in the vast and demanding maritime environment.
Conclusion
In this work, we have presented a novel object detection model, DDW-YOLO, specifically designed for the demanding task of maritime search and rescue using a UAV drone. The model directly tackles the core challenges of this application: extreme small target sizes, high background variability, and the need for robust, real-time performance. Our key contributions include the integration of a dedicated small object detection layer, adaptable dynamic convolution and upsampling modules, and a novel WWbox_loss function that improves training stability and accuracy. Extensive experiments on the SeaDronesSee v2 dataset confirm the effectiveness of our approach. The DDW-YOLO model achieves a 94.7% precision and a 92.7% recall, representing a 2.4% and 5.3% improvement over the YOLO11 baseline, respectively. Furthermore, it outperforms other state-of-the-art models like YOLOv10 and RTDETR across all major accuracy metrics. Critically, this level of performance is achieved with a highly efficient model structure, requiring only 4.67M parameters and 8.4 GFLOPs while maintaining a high inference speed of 232 FPS. This combination of high accuracy and efficiency makes our model exceptionally well-suited for deployment on a UAV drone, where computational resources are limited but accuracy is paramount. By significantly reducing missed detections, our DDW-YOLO provides a more reliable and effective visual detection solution, directly contributing to the success of critical maritime search and rescue missions. For future work, we plan to explore further optimization for even more constrained hardware and investigate domain adaptation techniques to enhance model generalization across different sea states and geographical locations.
