Unmanned Aerial Vehicles (UAVs) in China: A Comprehensive Review of Individual Drone Identification Technologies Based on Acoustic Fingerprinting

The rapid advancement of artificial intelligence and communication technologies is catalyzing a new industrial revolution at an unprecedented pace. Within this transformative landscape, Unmanned Aerial Vehicles (UAVs), or drones, have emerged as pivotal tools, offering unique aerial perspectives, operational flexibility, and a wide array of application scenarios. Their value and potential are being realized across numerous sectors, including parcel delivery, environmental monitoring, and agricultural management. In China, e-commerce giants such as Alibaba, JD.com, and Meituan are actively exploring and deploying UAV-based solutions to establish efficient pathways for delivering packages directly from warehouses to customers. The potential of China’s UAV logistics sector is immense. By 2035, the output value of China’s drone logistics industry is projected to surpass one trillion yuan, with the application market in logistics expected to grow at an annual rate of 20%, indicating exceptionally broad prospects for the development of low-altitude logistics and distribution within China.

However, this promising application is accompanied by significant security vulnerabilities, particularly during the pickup phase of a delivery mission. When a logistics company dispatches a legitimate UAV to a shipper’s location, a malicious third party can orchestrate a spoofing attack by deploying an unauthorized drone to impersonate the legitimate one, aiming to intercept the package. The shipper, relying solely on visual inspection, faces a considerable challenge in accurately authenticating the UAV’s identity, especially when the attacker uses a drone of the identical model. This critical security gap necessitates robust and reliable UAV identity recognition technology. Current research explores various modalities for drone identification, including radar-based, vision-based, radio frequency (RF)-based, and audio-based methods.

Radar-based methods utilize reflected radar signals for detection and classification. By analyzing micro-Doppler signatures, different UAV types (e.g., fixed-wing, multi-rotor) can be distinguished. While these methods offer a relatively long detection range, they involve high deployment costs for radar equipment and lack the granularity to differentiate between individual drones of the same model. Vision-based techniques employ cameras and computer vision algorithms to detect and classify drones from images or video streams. Although effective for differentiating drone models, these methods share the same fundamental limitation as visual human inspection: they cannot reliably distinguish between identical-looking drones. RF-based identification works by intercepting and analyzing the wireless communication signals between the drone and its controller. These methods can achieve high accuracy in model classification but require dedicated RF systems, are susceptible to environmental interference, and their effectiveness against identity spoofing within the same model is not well-established.

In contrast, acoustic or audio-based methods leverage the sound generated by a drone’s motors and propellers during flight. Every UAV, due to inherent manufacturing imperfections in its mechanical components (e.g., motor imbalances, propeller blade variances), possesses a unique acoustic signature or “voiceprint.” This acoustic fingerprint remains distinctive even among drones of the same make and model. Compared to radar, vision, and RF techniques, audio-based identification offers distinct advantages: it requires no additional specialized hardware deployment (a ubiquitous smartphone microphone suffices), operates on a simple principle, and is low-cost. Most importantly, it has the inherent potential to perform individual identification of drones from the same model lineup, which is precisely the scenario exploited in the spoofing attack described earlier. However, existing research on acoustic drone identification has notable limitations. Many studies focus only on inter-model classification rather than individual identification. Data collection is often conducted in controlled, indoor environments, limiting practical applicability. Furthermore, the critical aspect of system security—specifically, its resilience against replay attacks where a malicious party replays a recorded legitimate drone sound—is frequently unaddressed. To bridge these gaps, this article delves into the design and validation of a robust, acoustic fingerprint-based individual UAV identity recognition system, tailored to secure logistics and delivery operations in real-world outdoor settings.

Threat Model and System Objective

The considered application scenario is a UAV-based parcel pickup service. A shipper places an order via a logistics company’s platform (e.g., a mobile app). The company then dispatches a registered, legitimate UAV to the specified pickup location. The core threat involves a malicious adversary who, having intercepted the logistics information, deploys an unauthorized UAV of the same model to the same location with the intent of stealing the package. Upon the drone’s arrival, the shipper must decide whether to hand over the package. Relying on visual identification is futile when the drones are visually identical. The objective of the proposed system is to provide the shipper with a reliable authentication tool. Using a standard mobile device, the shipper records a short audio clip of the hovering drone. The system analyzes this audio clip and determines whether it originates from a legitimate, registered UAV or an imposter, thereby mitigating the spoofing attack.

System Design: UVBRS (UAV Voiceprint-Based Recognition System)

The proposed UVBRS operates through a sequential pipeline comprising three main stages: Signal Preprocessing, Feature Extraction, and Classification & Recognition. The overall architecture is designed to handle the challenges of outdoor audio capture and open-set recognition (where unknown, unregistered drones may be encountered).

Stage 1: Signal Preprocessing

Raw audio captured outdoors contains the target drone’s sound mixed with environmental noise (wind, traffic, etc.) and device noise. Preprocessing aims to enhance the signal-to-noise ratio (SNR). First, peak normalization is applied to the time-domain signal $x(t)$ to balance amplitude levels across different recordings:

$$x_{\text{norm}}(t) = \frac{x(t)}{\max(|x(t)|)}$$

The normalized signal is then segmented into overlapping frames (e.g., 0.1s length with 50% overlap) for short-time analysis. For denoising, we employ Empirical Wavelet Transform (EWT), an adaptive signal decomposition method superior to standard wavelet transforms for this application. EWT constructs a set of wavelet filters directly from the signal’s Fourier spectrum. Let $X(f)$ be the Fourier transform of a signal frame:

$$X(f) = \int_{-\infty}^{\infty} x(t) e^{-j2\pi ft} dt$$

The EWT algorithm detects boundaries $\omega_n$ in the magnitude spectrum $|X(f)|$ to partition the frequency domain into $N$ segments. For each segment $[\omega_n, \omega_{n+1}]$, an empirical wavelet $\phi_n(t)$ is constructed. The signal is then decomposed into coefficients:

$$W_n(t) = \int_{-\infty}^{\infty} x(\tau) \phi_n^*(\tau – t) d\tau$$

Noise components typically reside in high-frequency bands and correspond to coefficients with small magnitudes. A soft-thresholding function is applied to these coefficients before reconstructing the denoised signal $\hat{x}(t)$. This adaptive process effectively suppresses high-frequency noise while preserving the crucial harmonic structures of the China UAV drone audio.

Stage 2: Feature Extraction

This stage transforms the preprocessed audio into a discriminative feature vector. The core insight is that the unique identity of a China UAV drone is encoded in the specific amplitudes and fine frequency variations of its harmonic components, not just its fundamental frequency.

Step 1: Frequency Transformation. The denoised frame $\hat{x}(t)$ is converted to the frequency domain via the Fast Fourier Transform (FFT) to obtain its power spectrum $P(f)$.

Step 2: Specialized Audio Filter Bank (SAFB). Traditional features like Mel-Frequency Cepstral Coefficients (MFCCs) are modeled on human auditory perception, which emphasizes low frequencies. However, for drone identification, discriminative information is spread across harmonics. Therefore, a custom filter bank is designed. The SAFB consists of $N$ triangular filters. The first $N_{\text{low}}$ filters are spaced linearly over the critical lower frequency band (e.g., 0-2000 Hz) where harmonic differences are most pronounced:

$$f_k = f_{\text{low\_start}} + (k-1)\Delta, \quad \text{for } k=1,2,\dots,N_{\text{low}}$$
$$\Delta = \frac{f_{\text{low\_end}} – f_{\text{low\_start}}}{N_{\text{low}} – 1}$$

The remaining $N_{\text{high}}$ filters cover higher frequencies (e.g., 2000-8000 Hz) with a logarithmic spacing to capture broader spectral trends:

$$f_{N_{\text{low}}+m} = f_{\text{low\_end}} + \beta m^p, \quad \text{for } m=1,2,\dots,N_{\text{high}}$$
$$\beta = \frac{F_{\text{max}} – f_{\text{low\_end}}}{(N_{\text{high}})^p}$$

Each triangular filter $H_m(f)$ is defined by its lower boundary $l_m$, center frequency $f_m$, and upper boundary $u_m$, with its amplitude scaled by a factor $A_m = (f_m / 2000)^{-\delta}$ to gently emphasize lower harmonic bands. The filterbank output $S(m)$ for the $m$-th filter is the weighted sum of the power spectrum within its bandwidth.

Step 3: Cepstral Coefficient and Energy Feature Extraction. To de-correlate the filter bank outputs and compress the information, the Discrete Cosine Transform (DCT) is applied to the log-scaled filter energies:

$$C(w) = \sqrt{\frac{2}{N}} \sum_{m=1}^{N} \log_{10}(S(m)) \cdot \cos\left(\frac{\pi w (2m-1)}{2N}\right)$$

where $C(w)$ represents the $w$-th cepstral coefficient. Typically, the first 13-26 coefficients are retained. To augment the feature set, the frame’s log energy and its standard deviation are computed and appended, forming the final feature vector $\mathbf{v}$ for the audio frame. The parameters for the SAFB are summarized below:

Description Parameter Value Rationale
Sampling Frequency $f_s$ 44.1 kHz Standard audio quality, captures relevant harmonics.
FFT Points $N_{FFT}$ 1024 Balances frequency resolution and computational load.
Maximum Frequency $F_{max}$ 8 kHz Covers significant harmonic content of most UAVs.
Low-band Filters $N_{low}$ 10 Provides fine resolution in the most discriminative 0-2 kHz band.
High-band Filters $N_{high}$ 16 Captures broader spectral shape in higher frequencies.
Amplitude Scaling Factor $\delta$ 0.5 Applies mild emphasis to lower-frequency components.
Log Spacing Parameter $p$ 1.6 Controls the non-linear spread of high-frequency filters.

Stage 3: Classification with Open-Set Recognition

The sequence of feature vectors from an audio clip needs to be classified. We employ a Bidirectional Long Short-Term Memory (Bi-LSTM) network as the core classifier. Unlike standard LSTMs that process sequences in one direction, Bi-LSTMs process data in both forward and backward directions, capturing contextual information from past and future frames within a sequence, which is beneficial for analyzing stable acoustic events like a drone hover.

To address the real-world requirement of rejecting unregistered drones (open-set recognition), the Bi-LSTM is combined with the OpenMax algorithm. During training, the Bi-LSTM is trained on known, registered China UAV drone identities in a closed-set manner. The OpenMax layer is then calibrated. For a test sample, OpenMax works by:

  1. Calculating the Bi-LSTM’s activation vector (AV).
  2. Estimating the probability that the sample belongs to a “unknown” class by modeling the distribution of distances between the test AV and the mean AVs of known classes.
  3. Re-calibrating the output softmax scores for known classes, reducing their probabilities, and assigning the remaining probability mass to the unknown class.

A threshold is applied to the OpenMax probability for the “unknown” class. If the unknown probability is highest and exceeds the threshold, the system identifies the drone as an imposter; otherwise, it is assigned the identity of the known class with the highest probability.

Experimental Validation and Performance Analysis

To validate the proposed UVBRS, comprehensive experiments were conducted using real China UAV drones in outdoor environments. The dataset comprised 10 drones from 3 models, including 6 units of the same popular commercial model to test individual identification. Audio was recorded using a mobile device held at a realistic pickup distance (e.g., 2m horizontal, 3m vertical) during hover. Data was collected over multiple days to incorporate environmental variability.

Overall Identification Performance

The system’s performance was evaluated on two critical tasks: differentiating between drone models and identifying individual drones within the same model. The results are summarized below:

Identification Task Target Accuracy Key Metric
Inter-Model Classification 3 Different China UAV Drone Models 100% Classification Accuracy
Intra-Model Individual Identification 6 Drones of the Same Model 99.8% Verification Accuracy

The near-perfect accuracy for individual identification demonstrates the efficacy of the SAFB feature extraction in capturing subtle, identity-specific harmonic patterns. A comparative analysis was performed against a baseline system using standard MFCC features and a unidirectional LSTM classifier. A statistical t-test confirmed that the proposed system (SAFB + Bi-LSTM) achieved significantly higher performance across all metrics (Accuracy, Precision, Recall, F1-Score) with p-values < 0.001.

Security Performance Against Replay Attacks

A critical security test simulates the spoofing attack: an adversary replays a recording of a legitimate drone’s sound using a speaker near an imposter drone. The system was tested against such replay attacks. The UVBRS successfully identified the UAV as unauthorized with a 99.5% success rate. The very low False Acceptance Rate (0.05%) underscores its robustness. This is because the replay process through a secondary speaker, coupled with the underlying acoustic noise from the imposter drone’s own motors, introduces distortions and additive components that alter the acoustic fingerprint, making it discernibly different from the original legitimate signature.

Robustness Analysis

The practicality of the system for shippers using various mobile devices in diverse environments was rigorously tested.

1. Impact of Different Mobile Devices: The system was evaluated using audio recorded from four different smartphone models. Performance remained consistently high, as shown below:

Smartphone Model Average Accuracy Average F1-Score
Phone A (High-End) > 99.5% > 99.5%
Phone B (High-End) > 99.5% > 99.5%
Phone C (Mid-Range) ~98.5% ~98.5%
Phone D (Mid-Range) ~98.4% ~98.4%

2. Impact of Different Acoustic Environments: Noise samples from urban and suburban settings were artificially added to clean drone audio at varying Signal-to-Noise Ratios (SNR). The system’s performance demonstrates strong robustness, particularly at moderate SNR levels typical of relatively quiet pickup locations.

Environment SNR = 1 dB SNR = 5 dB SNR = 10 dB
Urban Indoor Accuracy > 88% Accuracy > 94% Accuracy > 98%
Urban Outdoor Accuracy > 88% Accuracy > 92% Accuracy > 96%
Suburban Outdoor Accuracy > 83% Accuracy > 91% Accuracy > 94%

3. Impact of Recording Distance: The authentication performance was tested at varying distances between the recorder and the China UAV drone. The system maintains high accuracy even at increased distances, proving its suitability for real-world pickup scenarios.

Recording Distance (Horizontal, Vertical) Average Accuracy
2m, 3m > 98.5%
3m, 2m > 98.5%
3m, 3m > 98.0%

Conclusion and Future Directions

This article presented a comprehensive study on individual UAV identification based on acoustic fingerprinting, specifically addressing the security needs of the rapidly growing low-altitude logistics sector in China. The proposed UVBRS system integrates adaptive denoising via Empirical Wavelet Transform, a specialized audio filter bank (SAFB) for precise harmonic feature extraction, and an OpenMax-equipped Bi-LSTM network for robust open-set classification. Experimental results from real-world outdoor data demonstrate that the system can authenticate individual drones of the same model with 99.8% accuracy and successfully detect replay-based spoofing attacks with 99.5% success rate. This provides a practical, low-cost, and secure solution for shippers to verify the identity of a China UAV drone during parcel pickup, thereby safeguarding the integrity of drone delivery operations.

The current system is designed for single-drone authentication. A significant future direction involves extending this technology to handle swarm scenarios, where multiple drones operate simultaneously, and their audio signals overlap. This presents a challenging blind source separation problem. Research is underway to integrate techniques like Independent Component Analysis (ICA) with the existing pipeline to disentangle mixed audio streams, enabling the identification of individual drones within a group, which will be crucial for managing complex future logistics operations involving multiple China UAV drones.

Scroll to Top