Few-Shot Radar UAV Recognition Based on Multi-Scale Feature Matching

In recent years, the rapid proliferation of unmanned aerial vehicles (UAVs), particularly in China and globally, has revolutionized various sectors, including surveillance, agriculture, and logistics. However, the unauthorized use of UAVs, often referred to as drones, poses significant security risks, such as airspace intrusion and privacy violations. Traditional radar-based UAV recognition methods rely on extensive datasets, but in practical scenarios, such as monitoring China’s vast airspace, data scarcity is a common challenge due to non-cooperative environments and limited collection opportunities. This motivates the need for few-shot learning techniques that can effectively classify UAVs with minimal labeled examples. In this paper, we propose an efficient few-shot classification framework, termed Efficient Net-based Multi-scale Learning Network (EMLNet), which integrates multi-scale feature enhancement and metric learning to address the data limitations in radar UAV recognition. Our method leverages a lightweight Efficient Net backbone augmented with an efficient multi-scale attention mechanism (EMA) for robust feature extraction, coupled with a local descriptor-based matching strategy for fine-grained similarity modeling. We further introduce a composite loss function, PCE Loss, combining Prototypical Loss and Cross-Entropy Loss, to enhance intra-class compactness and inter-class separability. Extensive experiments on an open-source Doppler radar dataset demonstrate that EMLNet achieves superior performance in few-shot settings, outperforming existing methods while maintaining low computational complexity. This work contributes to advancing UAV surveillance technologies, especially in resource-constrained environments like China, where drone detection is critical for national security.

The recognition of UAVs using radar signals is a pivotal task in modern defense and civilian applications. In China, the increasing adoption of drones for both legitimate and illicit activities has spurred research into reliable detection systems. Radar-based methods often exploit micro-Doppler signatures, which are induced by the rotating blades of drones, to distinguish them from other aerial targets like birds. However, conventional approaches depend on large-scale datasets for training deep learning models, which are often unavailable in real-world scenarios. Few-shot learning, which aims to learn from a small number of examples, offers a promising solution. Among few-shot techniques, metric learning methods, such as Prototypical Networks and Matching Networks, have shown efficacy by learning a similarity metric between samples. Despite their success, these methods typically use global feature representations, which may overlook local discriminative patterns in micro-Doppler spectrograms. To overcome this, we propose EMLNet, which emphasizes multi-scale local feature matching to capture fine-grained details essential for UAV classification. Our approach is particularly relevant for China’s UAV monitoring efforts, where diverse drone models and environmental conditions necessitate adaptable and efficient recognition systems.

In this paper, we first review related work on radar UAV recognition and few-shot learning. Then, we detail the architecture of EMLNet, including the feature extraction module with EMA attention, the classification module based on local descriptor matching, and the PCE Loss function. Next, we present experimental results on the DIAT-μSAT dataset, comparing our method with baseline networks and few-shot learning approaches. We also conduct ablation studies to validate the contributions of each component. Finally, we conclude with insights and future directions for UAV recognition in China and beyond.

Related Work

Radar-based UAV recognition has garnered significant attention, with many studies focusing on micro-Doppler signature analysis. Early methods employed handcrafted features, such as time-frequency representations, and classifiers like support vector machines (SVMs). For instance, Zhang et al. used dual-band radar to capture micro-Doppler signatures and applied PCA for feature fusion, achieving high accuracy on drone classification. However, these approaches are limited by their reliance on expert knowledge and may not generalize well to new drone types. With the advent of deep learning, convolutional neural networks (CNNs) have become prevalent for automatically extracting features from spectrograms. Dongsuk Park et al. utilized ResNet-SP on spectrograms to classify drones, while others integrated CNNs with ensemble learning for distinguishing drones from birds. Despite these advances, these methods require substantial labeled data, which is often impractical in real-time surveillance, especially in dynamic environments like China’s airspace.

Few-shot learning addresses data scarcity by learning from limited examples. Metric learning-based methods, such as Prototypical Networks and Matching Networks, learn an embedding space where samples from the same class are clustered. These have been applied to various domains, including signal processing. For example, Wenbin Li et al. proposed a deep nearest neighbor neural network for few-shot classification, improving accuracy on benchmark datasets. Liu et al. introduced a cycle optimization metric learning approach with geometric algebra graph networks for few-shot tasks. However, these methods often use global feature vectors, which may not capture local variations in radar signals. In contrast, our EMLNet incorporates multi-scale attention and local matching to enhance feature discriminability, making it suitable for UAV recognition where local micro-Doppler patterns are crucial. This is particularly important for monitoring China UAV drone activities, as drones exhibit diverse blade structures and flight dynamics.

Methodology

Our EMLNet framework consists of two main modules: a feature extraction module and a classification module. The overall architecture is designed to handle few-shot learning tasks, specifically C-way K-shot classification, where C is the number of classes and K is the number of support samples per class. We first preprocess radar signals into time-frequency spectrograms using short-time Fourier transform (STFT). Let the baseband radar signal be represented as:

$$z(t) = \sum_{i=1}^{n} a_i e^{-j\phi_i(t)} + \eta(t)$$

where $a_i$ is the reflection coefficient of the $i$-th scatterer, $\phi_i(t) = 4\pi R_i(t)/\lambda$ is the phase modulation, and $\eta(t)$ is noise. The STFT is applied to obtain spectrograms:

$$\text{STFT}(t, \omega) = \int_{-\infty}^{\infty} z(\tau) h(\tau – t) e^{-j\omega \tau} d\tau$$

where $h(t)$ is a window function. These spectrograms serve as input to our network.

Feature Extraction Module

The feature extraction module is based on Efficient Net, a lightweight CNN that uses compound scaling to balance depth, width, and resolution. Efficient Net achieves high accuracy with fewer parameters, making it suitable for few-shot learning where overfitting is a concern. However, its core MBConv blocks include Squeeze-and-Excitation (SE) attention, which only models channel-wise dependencies and may miss spatial information. To enhance local feature representation, we integrate the Efficient Multi-scale Attention (EMA) mechanism into Efficient Net. EMA employs parallel sub-networks with 1×1 and 3×3 convolutions to capture multi-scale spatial dependencies without channel compression. This allows for cross-scale feature fusion, improving the model’s ability to discern fine-grained patterns in UAV spectrograms.

Formally, let the input feature map be $\mathbf{x} \in \mathbb{R}^{C \times H \times W}$, where $C$, $H$, and $W$ denote channels, height, and width, respectively. EMA first divides the input into groups along the channel dimension. For each group, it applies 2D global average pooling to obtain a spatial context vector:

$$z_c = \frac{1}{H \times W} \sum_{j=1}^{H} \sum_{i=1}^{W} x_c(i, j)$$

where $x_c$ is the feature map of the $c$-th channel. Then, parallel convolutions generate attention weights that model cross-spatial relationships. The output is computed via matrix multiplication and sigmoid activation, producing refined features that emphasize discriminative regions. Compared to SE, EMA reduces parameters by approximately $5 \times 10^3$ while enhancing spatial awareness, which is beneficial for capturing local micro-Doppler signatures of China UAV drones.

The feature extraction process can be summarized as follows. Given an input spectrogram $\mathbf{X}$, the Efficient Net backbone with EMA produces a feature map $\Psi(\mathbf{X}) \in \mathbb{R}^{C \times H’ \times W’}$, where $H’$ and $W’$ are reduced spatial dimensions. For an input size of 224×224, the output is 7×7×1280, resulting in $m = H’ \times W’ = 49$ local descriptors, each of dimension $C = 1280$. Thus, we have a set of local descriptors:

$$\Psi(\mathbf{X}) = [\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_m] \in \mathbb{R}^{C \times m}$$

These descriptors encode multi-scale information crucial for distinguishing different UAV types, such as those commonly used in China.

Classification Module

The classification module performs few-shot inference using a local descriptor-based matching strategy. Unlike traditional methods that use global feature vectors, our approach computes similarities between local descriptors of support and query images, enabling fine-grained comparison. For a few-shot task with support set $\mathcal{S} = \{(\mathbf{x}_i, y_i)\}_{i=1}^{N_s}$ and query set $\mathcal{Q} = \{(\mathbf{q}_j, y_j)\}_{j=1}^{N_q}$, where $N_s = C \times K$ and $N_q = C \times Q$ (with $Q$ query samples per class), we extract local descriptors for each image via the feature extraction module.

For a query image $\mathbf{q}$, we have descriptors $\Psi(\mathbf{q}) = [\mathbf{x}_1^q, \ldots, \mathbf{x}_m^q]$. For each class $c$ in the support set, we consider the descriptors of all support images belonging to class $c$, denoted as $\{\mathbf{x}_i^c\}$. For each query descriptor $\mathbf{x}_i^q$, we find its $k$ nearest neighbors among the support descriptors of class $c$ using cosine similarity:

$$\text{cos}(\mathbf{x}_i^q, \mathbf{x}_j^c) = \frac{\mathbf{x}_i^q \cdot \mathbf{x}_j^c}{\|\mathbf{x}_i^q\| \|\mathbf{x}_j^c\|}$$

where $\|\cdot\|$ is the L2 norm. The total similarity score between query $\mathbf{q}$ and class $c$ is the sum of similarities across all query descriptors and their nearest neighbors:

$$\Phi(\Psi(\mathbf{q}), c) = \sum_{i=1}^{m} \sum_{j=1}^{k} \text{cos}(\mathbf{x}_i^q, \mathbf{x}_{i,j}^c)$$

where $\mathbf{x}_{i,j}^c$ is the $j$-th nearest neighbor of $\mathbf{x}_i^q$ in class $c$. The query is then classified to the class with the highest similarity score. This local matching approach leverages detailed features, making it robust to variations in UAV spectrograms, such as those from different drone models in China.

PCE Loss Function

To train EMLNet in a few-shot setting, we propose a composite loss function, PCE Loss, that combines Prototypical Loss and Cross-Entropy Loss. This encourages both intra-class clustering and inter-class separation in the embedding space. Prototypical Loss computes class prototypes as the mean of support features and minimizes the distance between query features and their corresponding class prototypes. For class $c$, the prototype $\mathbf{p}_c$ is:

$$\mathbf{p}_c = \frac{1}{|\mathcal{S}_c|} \sum_{(\mathbf{x}_i, y_i) \in \mathcal{S}_c} f_{\phi}(\mathbf{x}_i)$$

where $\mathcal{S}_c$ is the support set for class $c$, and $f_{\phi}$ is the feature extractor. The distance between a query feature $f_{\phi}(\mathbf{q})$ and prototype $\mathbf{p}_c$ is measured by squared Euclidean distance:

$$d(f_{\phi}(\mathbf{q}), \mathbf{p}_c) = \| f_{\phi}(\mathbf{q}) – \mathbf{p}_c \|^2$$

The probability that query $\mathbf{q}$ belongs to class $c$ is given by a softmax over distances:

$$\Pr(y = c \mid \mathbf{q}) = \frac{\exp(-d(f_{\phi}(\mathbf{q}), \mathbf{p}_c))}{\sum_{c’=1}^{C} \exp(-d(f_{\phi}(\mathbf{q}), \mathbf{p}_{c’}))}$$

The Prototypical Loss is then the negative log-likelihood:

$$\mathcal{L}_{\text{proto}} = -\log \Pr(y = c \mid \mathbf{q})$$

Cross-Entropy Loss, on the other hand, is applied to the output logits of the model. Let the model’s logits for query $\mathbf{q}$ be $\mathbf{f}(\mathbf{q}) = [f_1(\mathbf{q}), \ldots, f_C(\mathbf{q})]$. The probability for class $c$ is:

$$\Pr_{\text{ce}}(y = c \mid \mathbf{q}) = \frac{\exp(f_c(\mathbf{q}))}{\sum_{c’=1}^{C} \exp(f_{c’}(\mathbf{q}))}$$

and the Cross-Entropy Loss is:

$$\mathcal{L}_{\text{ce}} = -\log \Pr_{\text{ce}}(y = c \mid \mathbf{q})$$

The PCE Loss is a weighted sum:

$$\mathcal{L}_{\text{total}} = \alpha \cdot \mathcal{L}_{\text{proto}} + (1 – \alpha) \cdot \mathcal{L}_{\text{ce}}$$

where $\alpha \in [0,1]$ is a hyperparameter. We set $\alpha = 0.5$ based on empirical validation. This composite loss guides the model to learn discriminative features while maintaining robustness in few-shot scenarios, which is essential for accurate UAV recognition in diverse settings like China.

Experiments and Results

We evaluate EMLNet on the DIAT-μSAT dataset, an open-source Doppler radar dataset containing micro-Doppler signatures of six UAV types. The dataset includes 4849 spectrograms, collected with an X-band continuous-wave radar at a sampling frequency of 10 kHz. Spectrograms are generated using STFT with a Hamming window of length 256 and overlap of 200 samples. We resize and crop images to 224×224 for input. The dataset simulates real-world conditions, such as multi-path effects and clutter, relevant to outdoor surveillance in China.

We adopt an episodic training and evaluation strategy for few-shot learning. The dataset is split into training and testing sets with non-overlapping classes: 3 classes for training and 3 for testing. We conduct experiments on 3-way 1-shot and 3-way 5-shot tasks. In each episode, for a C-way K-shot task, we sample K support images and 15 query images per class. Training involves 10,000 episodes with a batch size of 1, using the Adam optimizer and an initial learning rate of $5 \times 10^{-4}$. Testing is performed on 500 episodes, and we report top-1 average accuracy.

Dataset Composition

The DIAT-μSAT dataset comprises six UAV categories, which are representative of common drone types, including those used in China. The categories are: (a) Two-blade rotor, (b) Three-long-blade rotor, (c) Three-short-blade rotor, (d) Bionic bird, (e) Two-blade rotor with bionic bird, and (f) Quadrotor. These variations in blade structure and motion produce distinct micro-Doppler signatures, challenging classification algorithms. Table 1 summarizes the dataset statistics.

UAV Category	Number of Spectrograms	Description
Two-blade rotor	Approx. 800	Common in commercial drones
Three-long-blade rotor	Approx. 800	Used in agricultural drones
Three-short-blade rotor	Approx. 800	Typical in hobbyist drones
Bionic bird	Approx. 800	Mimics bird flight, used in surveillance
Two-blade rotor with bionic bird	Approx. 800	Hybrid design for stealth
Quadrotor	Approx. 849	Standard multi-rotor drone

This diversity makes the dataset suitable for evaluating few-shot learning, as it mirrors the variety of China UAV drone operations in real environments.

Implementation Details

All experiments are conducted on a system with an Intel Xeon Gold 6132 CPU and NVIDIA GeForce RTX 3090 GPU, using PyTorch. Data augmentation techniques, such as random brightness, contrast, saturation adjustments, and flipping, are applied to reduce overfitting. The feature extraction network is based on Efficient Net-B0, modified with EMA attention. The classification module uses local descriptor matching with $k=3$ nearest neighbors. For loss computation, we set $\alpha=0.5$ in PCE Loss. We compare EMLNet with baseline networks like ResNet50, VisionTransformer, and Efficient NetV2, as well as few-shot methods like Prototypical Network and Matching Network. All compared methods are trained under the same episodic protocol for fairness.

Ablation Studies

To validate the contributions of individual components, we perform ablation studies on the 3-way 1-shot and 3-way 5-shot tasks. We start with a base model (Efficient Net without EMA) and incrementally add EMA attention and PCE Loss. Results are shown in Table 2.

Model	3-way 1-shot Accuracy	3-way 5-shot Accuracy
Base (Efficient Net)	72.5%	80.7%
Base + EMA	76.7%	86.9%
Base + PCE Loss	77.8%	85.5%
Base + EMA + PCE Loss (EMLNet)	84.1%	91.7%

The base model achieves moderate accuracy, but adding EMA improves performance by 4.2% in 1-shot and 6.2% in 5-shot, demonstrating the benefit of multi-scale spatial attention. PCE Loss alone boosts accuracy by 5.3% in 1-shot and 4.8% in 5-shot, highlighting the importance of the composite loss. When combined, EMLNet achieves the best results: 84.1% in 1-shot and 91.7% in 5-shot, confirming the synergy between EMA and PCE Loss. This synergy is crucial for handling the subtle differences in China UAV drone signatures.

Comparison with Other Networks

We compare EMLNet with popular deep learning networks under the same few-shot setting. Each network is trained with either standard cross-entropy loss or PCE Loss, and the higher accuracy is reported. Table 3 summarizes the results, including parameter counts and FLOPs to assess efficiency.

Network	Parameters (M)	FLOPs (G)	3-way 1-shot Accuracy	3-way 5-shot Accuracy
ResNet50	23.5	4.13	76.5%	82.6%
VisionTransformer	85.8	16.8	82.6%	86.5%
Efficient NetV2	19.8	2.87	71.6%	81.2%
EMLNet (Ours)	3.7	0.88	84.1%	91.7%

EMLNet significantly outperforms all baselines in accuracy while having the lowest parameters (3.7M) and FLOPs (0.88G). Compared to ResNet50, EMLNet reduces parameters by 84.3% and FLOPs by 78.7%, yet increases accuracy by 7.6% in 1-shot and 9.1% in 5-shot. VisionTransformer, though accurate, is computationally heavy (85.8M parameters, 16.8 GFLOPs), making it less suitable for deployment in resource-constrained scenarios like China’s border surveillance. Efficient NetV2 is lighter than ResNet50 but still lags behind EMLNet by 12.5% in 1-shot and 10.5% in 5-shot. These results underscore the efficiency and effectiveness of EMLNet for few-shot UAV recognition, particularly for monitoring China UAV drone threats where real-time processing is essential.

Comparison with Few-Shot Learning Methods

We also compare EMLNet with established few-shot learning methods: Prototypical Network and Matching Network. Both are implemented with the same feature extractor (Efficient Net) for fairness. Results are in Table 4.

Method	3-way 1-shot Accuracy	3-way 5-shot Accuracy
Matching Network	76.5%	82.7%
Prototypical Network	80.4%	87.9%
EMLNet (Ours)	84.1%	91.7%

EMLNet surpasses Matching Network by 7.6% in 1-shot and 9.0% in 5-shot, and Prototypical Network by 3.7% in 1-shot and 3.8% in 5-shot. This improvement stems from our local descriptor matching and multi-scale attention, which better capture micro-Doppler details than global feature matching. For instance, in classifying China UAV drones with similar blade patterns, local features are more discriminative. The PCE Loss further stabilizes training, leading to higher accuracy.

Feature Visualization and Analysis

To gain insights into EMLNet’s decision-making, we visualize feature embeddings using t-SNE on the test set. Figure 1 shows the embeddings for 3-way 5-shot and 3-way 1-shot tasks. In the 5-shot case, classes are well-separated with minimal overlap, indicating strong discriminative power. In the 1-shot case, clusters are more compact but still distinct, though some confusion occurs between similar classes like “Bionic bird” and “Two-blade rotor with bionic bird.” This aligns with the accuracy drop in 1-shot settings, emphasizing the need for more support samples for challenging distinctions. The visualization confirms that EMLNet learns effective representations for UAV types, beneficial for applications in China where drone diversity is high.

Additionally, we compute confusion matrices for quantitative analysis. In 5-shot, EMLNet achieves accuracies above 91% for all classes: Two-blade rotor (91.39%), Three-long-blade rotor (92.29%), and Three-short-blade rotor (92.60%). In 1-shot, accuracies are 83.83%, 84.08%, and 86.55%, respectively. The lower performance in 1-shot highlights the difficulty of few-shot learning with limited data, yet EMLNet maintains robust performance compared to baselines.

Discussion

Our experiments demonstrate that EMLNet effectively addresses few-shot radar UAV recognition by integrating multi-scale feature enhancement and metric learning. The EMA attention mechanism captures both local and global patterns in spectrograms, which is vital for distinguishing subtle micro-Doppler differences among drones. For example, China UAV drones often have unique blade designs for specific applications, and EMA’s multi-scale processing helps identify these variations. The local descriptor matching strategy further refines classification by focusing on discriminative regions, rather than relying on global averages that may blur details.

The PCE Loss plays a critical role in training stability. By combining Prototypical Loss and Cross-Entropy Loss, it ensures that features cluster around class prototypes while maintaining decision boundaries. This is particularly important in few-shot scenarios where overfitting is a risk. Our ablation studies show that both components contribute significantly, and their combination yields the best results.

From a practical perspective, EMLNet’s low parameter count and computational cost make it suitable for deployment on edge devices, such as radar systems in remote areas of China. This aligns with the growing need for real-time, on-site UAV detection without relying on cloud infrastructure. However, limitations exist: the performance in 1-shot settings, while superior to alternatives, still leaves room for improvement, especially for highly similar drone classes. Future work could explore data augmentation techniques specific to radar signals or incorporate temporal information from sequential spectrograms.

Conclusion

In this paper, we presented EMLNet, a few-shot learning framework for radar-based UAV recognition. Our method combines a lightweight Efficient Net backbone with EMA attention for multi-scale feature extraction and a local descriptor-based matching module for fine-grained classification. The PCE Loss function enhances training by promoting intra-class cohesion and inter-class separation. Extensive evaluations on the DIAT-μSAT dataset show that EMLNet achieves state-of-the-art accuracy in 3-way 1-shot and 5-shot tasks, outperforming both deep learning networks and few-shot learning methods. Moreover, EMLNet is computationally efficient, with only 3.7 million parameters and 0.88 GFLOPs, making it ideal for real-world applications.

This work has significant implications for UAV surveillance, particularly in regions like China where drone usage is widespread and security concerns are paramount. By enabling accurate recognition with limited data, EMLNet can aid in monitoring unauthorized drone activities, protecting critical infrastructure, and ensuring airspace safety. Future directions include extending the framework to multi-modal data (e.g., combining radar with visual sensors) and adapting it to other few-shot signal processing tasks. We believe that our contributions will advance the field of UAV recognition and support the development of intelligent surveillance systems for China and beyond.

In summary, EMLNet represents a step forward in few-shot learning for radar UAV recognition, balancing accuracy and efficiency. As drones continue to evolve, methods like ours will be essential for maintaining security in an increasingly automated world. We hope that this research inspires further innovation in leveraging AI for defense and civilian applications, especially in monitoring China UAV drone operations.