In recent years, the rapid proliferation of civil drones has transformed various sectors, including surveillance, agriculture, and disaster response, offering unprecedented convenience and efficiency. However, the accessibility and versatility of these devices also pose significant threats to public safety, such as unauthorized surveillance, smuggling, and espionage. Consequently, developing robust and reliable detection systems for civil drones has become a critical research focus. Traditional radar-based methods often struggle with low-altitude, slow-moving drones due to their small radar cross-sections and interference from ground clutter. In contrast, electro-optical sensors, including infrared and visible-light cameras, provide a viable alternative by capturing high-resolution image and video data, enabling visual target detection. This paper explores the advancements in civil drone detection leveraging deep convolutional neural networks (DCNNs), which have revolutionized object detection by automating feature extraction and enhancing semantic representation.
Object detection in computer vision involves identifying and localizing specific targets within images or videos. Classical approaches relied on handcrafted features like Scale-Invariant Feature Transform (SIFT), Histogram of Oriented Gradients (HOG), and Local Binary Patterns (LBP), combined with classifiers such as Support Vector Machines (SVM) or AdaBoost. However, these methods were computationally intensive due to sliding window strategies and suffered from limited adaptability to varying target appearances and backgrounds. The advent of DCNNs addressed these limitations by integrating feature learning and pattern discrimination into a unified framework, achieving remarkable accuracy on large-scale datasets. This survey delves into DCNN-based techniques for civil drone detection, categorizing them into static image and video-based approaches, and discusses challenges and future directions.

DCNN-Based Visual Object Detection
DCNN-based object detection algorithms can be broadly classified into two-stage and one-stage methods. Two-stage approaches, such as R-CNN and its variants, first generate region proposals and then classify them, while one-stage methods, like YOLO and SSD, perform detection in a single pass. The evolution of these algorithms has significantly influenced civil drone detection research.
Two-Stage Methods
Two-stage detectors excel in accuracy but often at the cost of speed. The R-CNN framework introduced DCNNs to object detection by extracting features from region proposals generated via selective search. However, it suffered from computational redundancy and geometric distortions due to fixed-size resizing. SPPNet mitigated this with spatial pyramid pooling, enabling multi-scale feature extraction, while Fast R-CNN integrated region of interest (RoI) pooling and joint training for improved efficiency. Faster R-CNN further enhanced performance by incorporating a Region Proposal Network (RPN), which shares convolutional features with the detection network, enabling end-to-end training. For instance, the loss function in Faster R-CNN combines classification and regression terms:
$$L = L_{cls} + L_{reg}$$
where $L_{cls}$ is the log loss for object classification and $L_{reg}$ is the smooth L1 loss for bounding box regression. Extensions like Mask R-CNN added instance segmentation capabilities by predicting pixel-level masks, which can be beneficial for precise civil drone localization in cluttered environments.
One-Stage Methods
One-stage detectors prioritize speed and real-time performance. YOLO (You Only Look Once) divides the image into grids and predicts bounding boxes and class probabilities directly, but it struggles with small objects like distant civil drones. SSD (Single Shot MultiBox Detector) addresses this by leveraging multi-scale feature maps and default boxes, improving detection across varying sizes. YOLOv2 and YOLOv3 introduced anchor boxes, feature pyramid networks, and residual connections to enhance small object detection. For example, YOLOv3’s loss function incorporates binary cross-entropy for classification and mean squared error for box coordinates:
$$L = \lambda_{coord} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{obj} \left[ (x_i – \hat{x}_i)^2 + (y_i – \hat{y}_i)^2 + (w_i – \hat{w}_i)^2 + (h_i – \hat{h}_i)^2 \right] + \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{obj} \left[ C_i \log(\hat{C}_i) + (1 – C_i) \log(1 – \hat{C}_i) \right] + \lambda_{noobj} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{noobj} \left[ C_i \log(\hat{C}_i) + (1 – C_i) \log(1 – \hat{C}_i) \right] + \sum_{i=0}^{S^2} \mathbb{1}_{i}^{obj} \sum_{c \in classes} \left[ p_i(c) \log(\hat{p}_i(c)) + (1 – p_i(c)) \log(1 – \hat{p}_i(c)) \right]$$
where $S^2$ is the grid size, $B$ is the number of boxes, and $\mathbb{1}$ denotes indicator functions. Recent models like YOLOv4 and DETR (Detection Transformer) have further pushed the boundaries with advanced training techniques and attention mechanisms, though small civil drone detection remains challenging.
| Model | Year | Backbone | Characteristics |
|---|---|---|---|
| R-CNN | 2014 | AlexNet | Integrates CNN classification and proposal generation; multi-stage training; computationally expensive. |
| SPPNet | 2015 | ZFNet | Introduces spatial pyramid pooling for multi-scale features; improves speed but reduces accuracy. |
| Fast R-CNN | 2015 | VGG16 | Uses RoI pooling and joint training; faster than R-CNN but not real-time. |
| Faster R-CNN | 2015 | VGG | Incorporates RPN for proposal generation; high accuracy but complex training. |
| YOLOv1 | 2016 | GoogLeNet | End-to-end detection; real-time but poor with small objects. |
| SSD | 2016 | VGG16 | Multi-scale detection with default boxes; balances speed and accuracy. |
| YOLOv3 | 2018 | DarkNet53 | Feature pyramid networks for small objects; improved performance. |
| DETR | 2020 | ResNet101 | Transformer-based; end-to-end but struggles with small civil drones. |
Datasets for Civil Drone Detection
The development of DCNN-based civil drone detection models relies heavily on annotated datasets. However, publicly available large-scale datasets are scarce, necessitating the use of specialized collections. Key datasets include Anti-UAV2020, which provides dual-modal (visible and infrared) videos with annotations for various civil drone types and scenarios, and Drone-vs-Bird Detection Challenge, focusing on small civil drones and birds in diverse environments. Additionally, researchers have created custom datasets, such as Anti-Drone Dataset and UAV data, encompassing multiple civil drone models, backgrounds, and conditions. These datasets are crucial for training and evaluating models, though they often lack the scale needed for comprehensive benchmarking. Data augmentation techniques, including rotation, scaling, and noise injection, are commonly employed to expand dataset size and improve model generalization for civil drone detection.
Static Image-Based Civil Drone Detection
Static image detection forms the foundation for identifying civil drones in individual frames. Approaches can be categorized into generic object detection adaptations, transfer learning, and infrared-based methods.
Generic DCNN Adaptations
Many studies adapt existing DCNN models to civil drone detection. For instance, Faster R-CNN with VGG16 backbone has been used for high accuracy, while YOLO variants are preferred for real-time applications. To address the small size of civil drones, modifications like multi-scale feature fusion and anchor box optimization are incorporated. For example, UAVDet extends YOLOv3 to four prediction scales, enhancing detection across varying civil drone sizes. The use of k-means clustering to generate drone-specific anchor boxes improves localization precision. The objective function for anchor box generation can be expressed as:
$$\min \sum_{i=1}^{k} \sum_{x \in S_i} \|x – \mu_i\|^2$$
where $k$ is the number of clusters, $x$ represents bounding box dimensions, and $\mu_i$ is the cluster centroid. Additionally, super-resolution networks have been integrated with Faster R-CNN to enhance small civil drone details, boosting recall rates.
Transfer Learning and Data Augmentation
Given the limited civil drone datasets, transfer learning is widely adopted. Models pre-trained on large-scale datasets like ImageNet are fine-tuned on drone-specific data, improving performance with minimal samples. For example, Faster R-CNN with ResNet101 achieves high accuracy on the Drone-vs-Bird dataset after fine-tuning. Data augmentation further alleviates overfitting by generating synthetic samples through geometric transformations, color adjustments, and background mixing. Techniques like Gaussian blur and motion blur simulate real-world conditions, enhancing model robustness for civil drone detection in dynamic environments.
Infrared-Based Detection
Infrared sensors are advantageous for civil drone detection in low-light or adverse weather conditions. However, infrared images often have low resolution and poor contrast, complicating detection. DCNN-based approaches, such as modified YOLOv3 with SPP modules and Generalized IoU loss, improve detection of small civil drones. Full convolutional networks combined with visual saliency mechanisms highlight potential civil drone regions while suppressing background noise. The signal-to-noise ratio (SNR) in infrared images can be modeled as:
$$\text{SNR} = \frac{\mu_{target} – \mu_{background}}{\sigma_{background}}$$
where $\mu$ and $\sigma$ denote the mean and standard deviation of pixel intensities. For very small civil drones (e.g., below 9×9 pixels), methods like motion compensation and temporal filtering are employed to enhance target visibility.
Video-Based Civil Drone Detection
Video data provides temporal context, which is crucial for detecting civil drones in motion. Techniques leverage optical flow and multi-frame correlations to exploit spatiotemporal information.
Optical Flow Methods
Optical flow captures the motion of civil drones between consecutive frames. Traditional methods, like Horn-Schunck and Lucas-Kanade, assume brightness constancy and smooth flow, but they are computationally expensive and sensitive to noise. Deep learning-based approaches, such as FlowNet and RAFT, learn optical flow from data, offering improved accuracy and efficiency. In video object detection, optical flow is used to propagate features from keyframes to non-keyframes, reducing computational load. For example, the feature flow module in FlowNet can be described as:
$$F_t = \mathcal{W}(F_{t-1}, \Delta_{t-1 \to t})$$
where $F_t$ is the feature at time $t$, $\mathcal{W}$ is the warping function, and $\Delta$ is the optical flow field. This enables real-time civil drone detection by minimizing redundant computations.
Multi-Frame Correlation Features
For civil drones with weak appearance cues, multi-frame correlation enhances detection by accumulating target energy over time. Spatiotemporal image cubes are generated and stabilized via motion compensation, increasing local SNR. Recurrent Neural Networks (RNNs) and Siamese networks model temporal dependencies, improving consistency in civil drone tracking. The objective function for RNN-based detection includes temporal smoothing terms:
$$L_{temp} = \sum_{t=1}^{T} \| \mathbf{y}_t – \hat{\mathbf{y}}_t \|^2 + \lambda \sum_{t=2}^{T} \| \mathbf{y}_t – \mathbf{y}_{t-1} \|^2$$
where $\mathbf{y}_t$ is the detection output at frame $t$, and $\lambda$ controls temporal coherence. Additionally, drone-specific kinematic patterns are exploited to distinguish civil drones from birds or other distractors, reducing false alarms.
Challenges and Future Directions
Despite progress, civil drone detection faces several challenges. The variability in civil drone size, shape, and motion, combined with complex backgrounds like clouds, buildings, and birds, leads to high false negatives and positives. Image noise, motion blur, and occlusion further degrade performance. Moreover, the lack of large, annotated datasets hampers model training and evaluation.
Future research should focus on several areas. First, leveraging spatial context through attention mechanisms and task-driven reasoning can enhance civil drone search efficiency. Second, advanced motion modeling, such as kinematic priors and trajectory analysis, can improve temporal context utilization. Third, fusing appearance and motion features in a biologically inspired framework, mimicking primate visual pathways, may boost robustness. Finally, creating large-scale public datasets and exploring self-supervised learning will address data scarcity issues. The integration of transformer architectures and neuromorphic computing could also revolutionize civil drone detection by enabling efficient, explainable, and low-power systems.
Conclusion
Civil drone detection using DCNNs has made significant strides, with two-stage and one-stage algorithms offering a balance between accuracy and speed. Static image methods benefit from transfer learning and infrared adaptations, while video-based approaches exploit optical flow and multi-frame correlations. However, challenges related to small size, complex backgrounds, and data limitations persist. By advancing spatial and temporal context integration, developing comprehensive datasets, and drawing inspiration from biological vision, future systems can achieve the low false alarm rates, high recall, and robustness required for real-world civil drone detection applications. The continuous evolution of DCNNs and emerging technologies will undoubtedly play a pivotal role in securing airspace against unauthorized civil drone activities.
