DDW-YOLO: An Improved YOLO11 Model for China UAV Maritime Search and Rescue Target Detection

In the event of maritime accidents, compared to traditional rescue vessels and aircraft, China UAV (unmanned aerial vehicle) systems demonstrate multiple advantages such as rapid response, wide coverage, and low cost. However, the complex marine environment demands that the detection model onboard a China UAV must remain lightweight, low-overhead, and possess strong anti-interference capabilities along with high accuracy for small targets. Existing single-stage object detection models face significant challenges in maritime search and rescue (SAR) scenarios. Traditional fixed-parameter models struggle to adapt to variable sea conditions, and single-stage models exhibit insufficient ability to detect small targets, leading to frequent missed and false detections.

To address these challenges, we propose a novel detection model named DDW-YOLO (Dynamic-convolution, Dynamic-upsampling, and Wasserstein-loss YOLO) based on the YOLO11 architecture. Our model introduces several targeted improvements: (1) a small target detection layer; (2) dynamic convolution modules that adaptively adjust convolutional kernels based on input features; (3) a dynamic upsampling module based on point sampling and dynamic offsets; (4) a hybrid loss function combining Wasserstein distance loss and Wise IoU (WIoU) loss. These enhancements significantly improve the adaptability of China UAV detection systems in complex marine environments and boost small-target detection performance.

1. Methodology

1.1 Small Target Detection Layer

When a China UAV performs maritime search and rescue, the target often appears as only the head of a person in distress, which is an extremely small object. The original YOLO11 lacks sufficient detection capability for such tiny targets. We added a new detection head operating on a 160×160 feature map, which is generated by upsampling the 80×80 feature map from the neck network and concatenating it with a shallow 160×160 feature map from the backbone. This high-resolution layer retains more pixel-level details and fuses both high-resolution spatial information and deep semantic features, thereby significantly improving the detection accuracy for small objects.

The architecture of the small target detection layer is described as follows: The 80×80 feature map (channel dimension C₁) is upsampled by a factor of 2 to produce a 160×160 feature map, which is then concatenated with the backbone’s 160×160 feature map (channel dimension C₂). The concatenated feature map of size (C₁+C₂)×160×160 is processed by a C3k2 module before being fed into the detection head. This design enables the network to simultaneously capture fine-grained details and high-level semantics, reducing false positives in complex maritime backgrounds.

1.2 Dynamic Convolution Module

Standard convolution kernels are fixed once trained, making them poorly adaptable to varying sea conditions such as different lighting, wave patterns, and fog. We replace the standard convolutions in YOLO11 with dynamic convolutions. Instead of using a single static kernel, we employ a set of expert kernels and a routing network that computes per-sample mixing weights. The entire process consists of two parts: a routing network and a conditional convolution.

The routing network first applies global average pooling to the input feature map x ∈ ℝ^B×C×H×W, compressing spatial dimensions into a global descriptor:

$$ \text{pooled\_inputs} = \frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} x_{:,:,i,j} \in \mathbb{R}^{B \times C} $$

Then, a linear layer with weights W_r ∈ ℝ^C×E projects the descriptor to E expert weights, activated by a Sigmoid function:

$$ \text{routing\_weights} = \sigma(\text{pooled\_inputs} \cdot \mathbf{W}_r) \in \mathbb{R}^{B \times E} $$

Note that we use Sigmoid instead of Softmax to avoid the mutual exclusivity of experts, allowing multiple experts to be activated simultaneously for better feature representation.

In the conditional convolution step, the routing weights are used to blend the expert kernels. The mixed kernel W_mixed is obtained by matrix multiplication:

$$ \mathbf{W}_{\text{mixed}} = \alpha \cdot \mathbf{W} \in \mathbb{R}^{B \times P} $$

where α ∈ ℝ^B×E and the sum of elements for each sample satisfies Σ_j=1^E α_i,j = 1 (due to Sigmoid but normalized implicitly through training). The mixed kernel is then reshaped into (B·out_channels, C/groups, K_h, K_w) and applied via grouped convolution, enabling efficient dynamic computation without significantly increasing FLOPs. This mechanism allows the China UAV detection model to adapt its receptive field and feature extraction pattern based on the input content, greatly enhancing robustness in maritime environments.

1.3 Dynamic Upsampling

Traditional interpolation-based upsampling methods (e.g., bilinear, nearest neighbor) suffer from loss of high-frequency details, especially under challenging sea conditions with wave interference, lighting changes, and haze. We propose a dynamic upsampling module based on point sampling with learnable offsets. Given an input feature map, the module predicts per-pixel dynamic sampling offsets and applies bilinear sampling on a deformed grid.

The process is as follows: Two parallel 1×1 convolutions are applied to the input feature map x ∈ ℝ^C×H×W. The first convolution predicts the primary offset Δ_main with output channels 2·G·s² (where G is the number of groups and s is the upsampling factor). The second convolution outputs a scale factor that is passed through a Sigmoid activation to produce a modulation coefficient. The final offset Δ is computed as:

$$ \Delta = f_{\text{offset}}(x) \cdot \sigma(f_{\text{scope}}(x)) \cdot 0.5 + p_{\text{init}} $$

Here, f_offset and f_scope are 1×1 convolutions, σ is the Sigmoid function, and p_init is the initial sampling position. The coordinates for sampling are generated by:

$$ \text{coords} = 2 \cdot C + \Delta – N $$

where C is the original coordinate grid and N is a normalization factor. Finally, bilinear sampling is performed according to these coordinates to obtain the upsampled feature map:

$$ y = \text{grid\_sample}(x, \text{coords}) $$

This dynamic upsampling preserves high-frequency content and adapts to local image structures, which is crucial for detecting small objects like human heads in rough seas. The module introduces minimal additional parameters while significantly improving feature representation quality for China UAV applications.

1.4 Loss Function Improvement

The original YOLO11 uses IoU-based loss. However, in maritime SAR, many targets (e.g., a swimmer’s head) are tiny and often produce bounding boxes that do not overlap with ground truth, causing IoU loss to become zero with no gradient. Furthermore, low-quality samples (e.g., partial occlusions, reflections) can dominate training. To address these issues, we propose a novel loss function called WWBoxLoss, which is a weighted combination of Wasserstein distance loss and Wise IoU (WIoU) loss.

Wasserstein Distance Loss: We model both the predicted box and ground truth box as 2D Gaussian distributions. For a box B = (x₁, y₁, x₂, y₂), the mean μ and covariance Σ are defined as:

$$ \mu = \left( \frac{x_1 + x_2}{2}, \frac{y_1 + y_2}{2} \right), \quad \Sigma = \begin{pmatrix} \frac{x_2 – x_1}{4} & 0 \\ 0 & \frac{y_2 – y_1}{4} \end{pmatrix} $$

The squared Wasserstein distance between two Gaussian distributions N₁(μ₁, Σ₁) and N₂(μ₂, Σ₂) is:

$$ W_2^2 = \| \mu_1 – \mu_2 \|_2^2 + \text{Tr}\left( \Sigma_1 + \Sigma_2 – 2\left( \Sigma_1^{1/2} \Sigma_2 \Sigma_1^{1/2} \right)^{1/2} \right) $$

When the boxes do not overlap, the Wasserstein distance still provides a meaningful gradient, thus solving the gradient vanishing problem.

Wise IoU (WIoU) Loss: WIoU introduces a non-monotonic focusing mechanism to adaptively adjust the influence of each sample. The scaling factor is computed as:

$$ \text{scaled\_loss} = \frac{\alpha}{\beta}, \quad \beta = \frac{\text{IoU}}{\text{iou\_mean}}, \quad \alpha = \delta \cdot \gamma^{\beta – \delta} $$

where δ and γ are hyperparameters. By dynamically updating iou_mean during training via momentum, high-quality samples get larger gradients while low-quality samples are suppressed. This reduces the negative impact of difficult examples common in maritime images.

Our final WWBox loss is a convex combination:

$$ \mathcal{L}_{\text{WWBox}} = \alpha_{\text{weight}} \cdot \mathcal{L}_{\text{Wasserstein}} + (1 – \alpha_{\text{weight}}) \cdot \mathcal{L}_{\text{WIoU}} $$

Through experiments, we set α_weight = 0.5 as the optimal balance. This combined loss function significantly improves the localization accuracy and robustness of the China UAV detection model under challenging sea conditions.

2. Experiments and Analysis

2.1 Dataset

We conducted experiments on the SeaDronesSee v2 dataset, a large-scale maritime aerial image dataset containing over 54,000 4K-resolution images with six object categories (e.g., swimmer, boat, buoy, life-saving equipment). The dataset covers challenging scenarios including illumination variations, wave interference, and small targets. To match our computational resources, we randomly sampled 5,000 images and used a area-based clustering algorithm to crop each image into 640×640 patches, resulting in a final dataset of 5,000 images.

2.2 Implementation Details

All experiments were performed on a Tesla V100-SXM2-32GB GPU, Windows 10, PyTorch environment, with conda 24.9.2 and Python 3.10. We fixed the following hyperparameters to ensure fair comparison:

Table 1: Training Parameters
Parameter	Value
Learning Rate	0.01
Momentum	0.937
Epochs	400
Optimizer	SGD
Batch Size	64
Random Seed	1

2.3 Evaluation Metrics

We used seven metrics: Precision (P), Recall (R), mAP₅₀, mAP_50:95, FPS, Parameters (M), and GFLOPs. The definitions are as follows:

$$ \text{Precision} = \frac{TP}{TP + FP}, \quad \text{Recall} = \frac{TP}{TP + FN} $$

$$ \text{mAP}_{50} = \int_0^1 P(r) dr $$

$$ \text{mAP}_{50:95} = \frac{1}{10} \sum_{k=0}^{9} \text{mAP}_{\text{IoU}=0.5+0.05k} $$

$$ \text{FPS} = \frac{1}{\text{PrepTime} + \text{InfTime} + \text{PostTime}} $$

2.4 Ablation Study

To evaluate the contribution of each component, we started from the YOLO11 baseline and progressively added the small target detection layer, dynamic convolution, dynamic upsampling, and WWBox loss. The results are shown in Table 2.

Table 2: Ablation Experiment Results on SeaDronesSee v2
Small Target Layer	DyConv	DySample	WWBox	P (％)	R (％)	mAP₅₀ (％)	mAP_50:95 (％)	FPS	Params (M)
–	–	–	–	92.3	87.4	91.8	61.1	285	2.58
✓	–	–	–	94.4	91.1	94.6	64.3	204	2.63
✓	✓	–	–	92.3	91.7	93.9	64.5	250	4.64
✓	✓	✓	–	93.3	92.5	94.1	65.1	238	4.67
✓	✓	✓	✓	94.7	92.7	95.3	66.2	232	4.67

As shown, each component brings improvement. The small target detection layer boosts recall significantly (from 87.4% to 91.1%). Dynamic convolution recovers some speed (FPS from 204 to 250) while maintaining accuracy. Dynamic upsampling further enhances mAP_50:95 to 65.1%. Finally, the WWBox loss pushes all metrics to the peak: Precision 94.7%, Recall 92.7%, mAP₅₀ 95.3%, mAP_50:95 66.2%.

We also visualize the Precision-Recall (P-R) curves of the baseline YOLO11 and our DDW-YOLO for each category. The areas under the curves for small objects (swimmer, life-saving equipment, buoy) are notably larger for DDW-YOLO, confirming its superiority in detecting small targets critical for China UAV rescue missions.

2.5 Comparison with State-of-the-Art Methods

We compared DDW-YOLO with other leading detection models, including YOLO11n, YOLOv10n, RT-DETR-l, and RT-DETR-resnet50. The results are summarized in Table 3.

Table 3: Comparison with State-of-the-Art Methods
Model	P (％)	R (％)	mAP₅₀ (％)	mAP_50:95 (％)	FPS	Params (M)	GFLOPs
DDW-YOLO	94.7	92.7	95.3	66.2	232	4.67	8.4
RT-DETR-l	73.2	73.0	75.8	40.0	147	32.0	103.5
RT-DETR-resnet50	79.9	82.9	83.2	47.9	127	41.9	125.6
YOLO11n	92.3	87.4	91.8	61.1	285	2.58	6.3
YOLOv10n	91.0	89.1	92.0	62.6	350	2.71	8.4

DDW-YOLO achieves the highest Precision, Recall, mAP₅₀, and mAP_50:95 among all models. Although YOLO11n has higher FPS (285) and lower parameters, its detection accuracy is much lower, especially for small targets. DDW-YOLO’s FPS of 232 is still well above real-time requirements (≥30 FPS), making it highly suitable for China UAV edge deployment. The RT-DETR models lag significantly in accuracy and are too heavy for real-time airborne applications. Our model’s GFLOPs (8.4) is comparable to YOLOv10n and only slightly higher than YOLO11n, demonstrating excellent efficiency.

2.6 Qualitative Analysis

We present representative prediction examples where the baseline YOLO11 missed targets while DDW-YOLO correctly detected them. In the first example, sea surface reflections created stripe patterns similar to a swimmer’s arm; the dynamic upsampling in DDW-YOLO preserved high-frequency details and avoided false negatives. In the second example, only the orange life jacket was visible above water; dynamic convolution allowed the model to associate context and infer the presence of a person. In the third example, the target was partially occluded; the improved loss function enhanced robustness against incomplete target boundaries. These qualitative results confirm that DDW-YOLO effectively addresses the unique challenges of maritime search and rescue for China UAV systems.

3. Conclusion

We have proposed DDW-YOLO, an improved YOLO11-based model specifically designed for China UAV maritime search and rescue target detection. By introducing a small target detection layer, dynamic convolution, dynamic upsampling, and a hybrid Wasserstein-WIoU loss function, our model significantly enhances small object detection capability and environmental adaptability. Extensive experiments on the SeaDronesSee v2 dataset demonstrate that DDW-YOLO achieves a Precision of 94.7% and a Recall of 92.7%, outperforming state-of-the-art methods while maintaining a compact size of only 4.67M parameters and 8.4 GFLOPs. With a real-time inference speed of 232 FPS, our model is suitable for deployment on lightweight China UAV platforms, providing an efficient and reliable visual detection solution for maritime rescue operations.