Real-Time Onboard Flooded Building Detection for UAV Drones Using Dual-Scale Weak Semantic Prompt Guidance

In this paper, we present FloodSAM‑Duo, a lightweight and efficient framework designed for real‑time object‑level detection of flooded buildings directly onboard UAV drones during flood emergency monitoring. The method integrates dual‑scale weak semantic reasoning from a Pyramid Scene Parsing Network (PSPNet) with prompt‑guided instance segmentation from FastSAM, enabling direct generation of structured disaster information from single UAV images without relying on high‑precision segmentation or off‑board processing. Extensive experiments on real flood UAV datasets from different regions demonstrate that our approach achieves state‑of‑the‑art performance in terms of both accuracy and inference speed on embedded platforms, validating its suitability for operational deployment on UAV drones.

1. Introduction

Low‑altitude remote sensing by UAV drones has become an indispensable tool for rapid disaster response, particularly in flood scenarios. The ability to acquire high‑resolution imagery in near real‑time allows emergency responders to assess building inundation status, which is critical for evacuation planning and resource allocation. However, existing methods often require either high‑precision pixel‑level segmentation followed by spatial overlay analysis or rely on multi‑frame mosaicking and off‑board processing, both of which introduce latency and depend on stable communication links. In practice, UAV drones operating in post‑disaster environments frequently face degraded channels, making continuous transmission of high‑resolution images impractical. There is thus a pressing need for onboard intelligence that can directly infer flooded building instances from single frames and output compact, structured information.

To address this challenge, we propose FloodSAM‑Duo (Dual‑scale Prompt‑guided SAM for Flooded Building Extraction). Our method leverages a lightweight PSPNet to generate weak semantic responses for buildings and water at two complementary scales. These responses are then used to automatically generate semantic‑stable prompt points that guide FastSAM to perform fine‑grained instance segmentation only within candidate regions. The final output is a set of object‑level flood‑affected building instances, which can be further aggregated across frames using simple geometric fusion to produce concise geospatial summaries. This design not only reduces computational load but also minimizes the dependence on image transmission, aligning perfectly with the real‑time constraints of UAV drones.

2. Related Work

Recent advances in deep learning have enabled various real‑time processing techniques for UAV drones. Lightweight semantic segmentation networks such as ENet have been deployed on edge devices for flood area extraction. Instance segmentation models like YOLOv8n‑seg and detection‑only frameworks like RT‑DETR Tiny provide object‑level outputs with moderate latency. However, when applied to flood‑affected building detection, these methods face difficulties: ENet often produces fragmented boundaries, YOLOv8n‑seg struggles with small or partially submerged buildings, and RT‑DETR Tiny lacks precise contour information. Moreover, domain shifts between training data and real flood imagery (e.g., water reflections, shadows, and texture degradation) further degrade performance. Our approach overcomes these limitations by combining weak semantic priors with prompt‑driven instance segmentation, achieving robust and structurally complete building masks directly on UAV drones.

A distinct advantage of FloodSAM‑Duo is its ability to transform segmentation outputs into structured disaster information (coordinates and counts) without requiring pixel‑level ground truth for the instance segmentation stage. Only weak semantic training for the PSPNet is needed, which can be obtained from coarse labels or even from freely available building‑footprint and water‑body datasets. This greatly reduces annotation costs and facilitates rapid adaptation to new areas, which is essential for emergency deployments of UAV drones.

3. Methodology

Our overall framework comprises three main modules: (1) dual‑scale weak semantic generation and flooded candidate region inference, (2) semantic‑stable prompt point generation, and (3) prompt‑guided instance segmentation with FastSAM. The architecture is illustrated schematically, and its key components are detailed below.

3.1 Dual‑Scale Weak Semantic Generation

Given an input UAV image \( I \in \mathbb{R}^{H \times W \times 3} \), we first extract feature maps using a lightweight PSPNet encoder. The encoder output is a high‑level semantic feature tensor \( F \in \mathbb{R}^{h \times w \times c} \) where \( h \ll H, w \ll W \). To capture both global context and local details, we construct two weak semantic representations from the same feature tensor \( F \):

Large‑scale features \( F^{(L)} \) obtained through global pyramid pooling (e.g., 1×1 or 2×2 bins), emphasizing the overall layout of buildings and surrounding water.
Small‑scale features \( F^{(S)} \) obtained through local finer pooling (e.g., 4×4 or 8×8 bins), preserving edge and corner information.

The two feature branches are concatenated and fused via a 1×1 convolution:

\[ F^{(D)} = \phi\!\left(\text{Concat}(F^{(L)}, F^{(S)})\right) \]

where \( \phi(\cdot) \) denotes a lightweight convolutional layer for channel reduction and cross‑scale interaction. Subsequently, two independent prediction heads generate weak semantic response maps for buildings and water:

\[ S_b = \sigma(W_b \cdot F^{(D)}), \quad S_w = \sigma(W_w \cdot F^{(D)}) \]

Here \( S_b, S_w \in [0,1]^{h \times w} \) represent the per‑location confidence of belonging to building or water classes. These maps are coarse probabilistic fields, not fine segmentation masks, and thus impose minimal computational overhead.

3.2 Flooded Candidate Region Inference

Rather than relying on exact mask overlay, we infer potential flooded building candidates based on structural consistency between the dual‑scale responses. For a building response region \( \Omega_b \subset S_b \), we mark it as a candidate if any of the following conditions hold:

It is surrounded by high water response in the large‑scale map \( S_w^{(L)} \).
Its boundary is adjacent to significant water response in the small‑scale map \( S_w^{(S)} \).
It contains weak semantic holes dominated by water response.
The local water response exceeds an adaptive threshold \( \tau_w \).

This multi‑condition check effectively identifies building areas likely affected by flood without requiring pre‑disaster imagery or accurate segmentation.

3.3 Semantic‑Stable Prompt Point Generation

From the building weak semantic map \( S_b \), we extract initial candidate prompt points via local peak detection:

\[ P_0 = \{ p_i \mid S_b(p_i) > S_b(q),\ \forall q \in \mathcal{N}(p_i) \} \]

where \( \mathcal{N}(p_i) \) is a local neighborhood. To avoid over‑clustering, we impose a minimum distance constraint: any two points closer than \( d_{\text{min}} \) are reduced to the one with higher response. Furthermore, spatial uniformity is enforced by dividing candidate regions into subregions and retaining only the most salient peak in each. Finally, isolated noise regions are filtered out using connected‑component analysis and response normalization, yielding the final prompt point set \( \mathcal{P} = \{ p_k \} \).

3.4 Prompt‑Guided Instance Segmentation

The generated prompt points are fed as positive point prompts into FastSAM, which performs local fine segmentation around each prompt. Since the points are located inside high‑confidence building regions, FastSAM’s search space is effectively constrained, leading to more coherent and complete building masks even under strong reflections or shadow occlusions. The segmentation outputs are individual building instance masks, which are then post‑processed into structured information for transmission.

3.5 Multi‑Frame Aggregation and Output

To reduce redundancy across consecutive frames, we adopt a simple geometric fusion strategy. Given a detected building instance in frame \( t \) with coordinate \( (x_i^t, y_i^t) \) and another in frame \( t+1 \) with coordinate \( (x_j^{t+1}, y_j^{t+1}) \), if the Euclidean distance \( d_{ij} < \tau \) (with \( \tau \) set to 3–5 meters depending on ground resolution), the two are considered the same building and fused by averaging coordinates:

\[ (\bar{x}_k, \bar{y}_k) = \frac{1}{M} \sum_{m=1}^{M} (x_m, y_m) \]

Moreover, when detections are highly dense in a local area, we apply clustering to output only the cluster center and count, resulting in a compact text‑based representation. This significantly reduces communication bandwidth while preserving essential disaster information.

4. Experiments

We conducted comprehensive experiments to evaluate FloodSAM‑Duo on real flood UAV imagery from Suizhou and Xinxiang regions in China. The datasets consist of images at 0.2–0.5 m ground resolution, covering various building types and flood conditions. All models were tested on the NVIDIA Jetson Xavier NX platform, a typical embedded system for UAV drones.

4.1 Onboard Inference Performance

Table 1 compares the model size, memory footprint, and single‑frame inference latency of our method against three representative lightweight approaches: ENet (semantic segmentation), YOLOv8n‑seg (instance segmentation), and RT‑DETR Tiny (object detection).

**Table 1: Model scale and inference performance on Jetson Xavier NX.**
Method	Parameters (M)	Model Size (MB)	GPU Memory (GB)	Latency (ms)
ENet	0.36	1.6	1.2	18–25
YOLOv8n‑seg	3.2	12	1.8	25–40
RT‑DETR Tiny	4.7	19	2.3	40–60
FloodSAM‑Duo (Ours)	2.9	10	1.6	15–20

Our method achieves the lowest latency (15–20 ms per frame) while keeping model size and memory consumption competitive. This efficiency stems from the decoupled design: the PSPNet runs only on low‑resolution feature maps, and FastSAM only processes small local areas. Such performance makes it ideal for real‑time deployment on UAV drones.

4.2 Flooded Building Detection Performance

We evaluated object‑level detection using Precision, Recall, F1‑score for flooded buildings, Average Precision (AP_Flooded), mean Intersection over Union for building instances (mIoU_building_inst), Boundary F1 (BF1_boundary), and Miss Rate. Table 2 presents the quantitative results.

**Table 2: Quantitative comparison of flooded building detection on the Suizhou & Xinxiang datasets.**
Method	Precision	Recall_Flooded	F1_Flooded	AP_Flooded	mIoU_building_inst	BF1_boundary	Miss Rate
ENet	0.63	0.55	0.59	0.48	0.52	0.58	0.45
YOLOv8n‑seg	0.74	0.68	0.71	0.62	0.65	0.72	0.32
RT‑DETR Tiny	0.70	0.61	0.65	0.56	—	—	0.38
FloodSAM‑Duo	0.86	0.82	0.84	0.78	0.76	0.85	0.18

Our method significantly outperforms all baselines across every metric. Particularly, the F1_Flooded improves by about 18% over YOLOv8n‑seg, and mIoU_building_inst increases by 17%. The low miss rate (0.18) demonstrates the effectiveness of dual‑scale weak semantics in reducing omissions of small or partially submerged buildings.

4.3 Ablation Study

We conducted ablations to isolate the contributions of the dual‑scale design and the candidate region inference module. Four variants were tested:

Single‑scale (Large): Only large‑scale weak semantic response.
Single‑scale (Small): Only small‑scale weak semantic response.
Dual‑scale w/o Candidate Region: Dual‑scale responses used to generate prompts but without candidate region inference.
FloodSAM‑Duo (Full): The complete method.

Results are summarized in Table 3.

**Table 3: Ablation study results.**
Variant	Precision	Recall_Flooded	F1_Flooded	AP_Flooded	mIoU_building_inst	BF1_boundary	Miss Rate
Single‑scale (Large)	0.75	0.64	0.69	0.60	0.62	0.68	0.36
Single‑scale (Small)	0.72	0.67	0.69	0.61	0.64	0.70	0.33
Dual‑scale w/o Candidate Region	0.80	0.74	0.77	0.69	0.70	0.78	0.26
FloodSAM‑Duo (Full)	0.86	0.82	0.84	0.78	0.76	0.85	0.18

The full model consistently achieves the best results. The comparison shows that dual‑scale fusion without candidate region inference already improves over single‑scale versions, but the addition of candidate region constraint further boosts both recall and boundary quality. This validates our hypothesis that a coarse‑to‑fine reasoning pipeline effectively mitigates error propagation from weak semantic outputs to final instance masks.

4.4 Negative Sample Robustness

To test false positive suppression, we evaluated all methods on two negative‑sample sets: (A) non‑flood urban areas (buildings present but no water) and (B) water‑only regions (no buildings). Table 4 reports the number of false positives (FP), FP rate (%), and mean FP per image.

**Table 4: False positive analysis on negative samples.**
Method	Non‑flood urban FP (A)	FP rate A (%)	Water‑only FP (B)	FP rate B (%)	Mean FP per image
ENet	42	21.0	57	28.5	2.48
YOLOv8n‑seg	27	13.5	34	17.0	1.52
RT‑DETR Tiny	31	15.5	29	14.5	1.50
FloodSAM‑Duo	6	3.0	4	2.0	0.25

FloodSAM‑Duo produces far fewer false positives than all baselines, especially in water‑only scenes where other methods often hallucinate building instances due to wave patterns or sun glint. This robustness is attributed to the weak semantic screening: the dual‑scale responses require both building and water signals in a structural relationship before generating prompts, effectively filtering out non‑building regions.

5. Conclusion

We have introduced FloodSAM‑Duo, a novel framework that enables real‑time, object‑level flooded building detection directly onboard UAV drones. By combining dual‑scale weak semantic reasoning with prompt‑guided instance segmentation, our method achieves high detection accuracy, structural completeness of building masks, and low false positive rates, while maintaining inference speeds below 20 ms per frame on embedded hardware. The ability to output structured disaster information reduces reliance on high‑bandwidth image transmission, making it particularly suitable for emergency scenarios where UAV drones operate under degraded communication conditions. Extensive experiments on real flood datasets confirm the superiority of our approach over existing lightweight segmentation and detection methods. Future work will focus on temporal modeling to track flooding progression and on adaptive prompt generation strategies to further enhance robustness across diverse environments, ultimately advancing the autonomy and intelligence of UAV drones in disaster response.