With the rapid evolution of communication technologies, conventional Radio Frequency (RF) communication is increasingly challenged in electromagnetically sensitive or interference-heavy environments such as hospitals, industrial facilities, or post-disaster zones. Visible Light Communication (VLC) emerges as a promising alternative, leveraging the visible light spectrum for data transmission. It offers significant advantages including immunity to electromagnetic interference, abundant and license-free spectrum, and high data rates. Deploying VLC transceivers on Unmanned Aerial Vehicles (UAVs) or unmanned drones capitalizes on their high mobility and flexible deployment, effectively overcoming the coverage limitations of fixed VLC infrastructures. This fusion of UAV and VLC technology, forming a UAV-assisted VLC system, demonstrates immense potential for reliable data acquisition tasks in complex scenarios, particularly for efficiently gathering information from a massive number of nodes in the Internet of Things (IoT). However, the limited onboard energy of unmanned drones makes the intelligent planning of their three-dimensional flight trajectories a critical factor in enhancing overall system efficiency.

This article investigates a multi-unmanned drone cooperative VLC system for IoT data collection. It comprehensively considers the impact of flight jitter on the VLC channel characteristics, innovatively designs a task allocation method based on an improved clustering algorithm, and delves into a three-dimensional trajectory planning strategy utilizing deep reinforcement learning. The performance is validated through simulation, aiming to provide theoretical support and technical solutions for such intelligent communication systems.
1. System and Channel Model
The considered system comprises multiple IoT nodes on the ground and multiple UAVs in the air. The i-th IoT node and the k-th unmanned drone are located at coordinates $\mathbf{w}_i = [x_i, y_i, 0]^T$ and $\mathbf{q}_k[n] = [x_{k,n}, y_{k,n}, h_k]^T$ at time slot $n$, respectively, where $h_k$ is the fixed flying altitude of the k-th drone. Each IoT node is equipped with a Light Emitting Diode (LED) for uplink transmission, and each unmanned drone is equipped with a photodiode (PD) receiver.
1.1 Jitter-Affected VLC Channel Model
Assuming intensity modulation and direct detection with on-off keying, the line-of-sight channel gain based on the Lambertian model is considered. Crucially, the flight jitter of the unmanned drone causes random fluctuations in its orientation. The tilt angle $\theta_{i,k}^{(n)}$ due to jitter is modeled as a random variable. Its probability density function (PDF) is given by:
$$f_{\Theta}(\theta_{i,k}^{(n)}) = \frac{\cos(\theta_{i,k}^{(n)})}{\sigma^2 \sin(\theta_{i,k}^{(n)})} \exp\left(-\frac{\cos^2(\theta_{i,k}^{(n)})}{2\sigma^2}\right), \quad 0 < \theta_{i,k}^{(n)} < \frac{\pi}{2}$$
where $\sigma^2$ is the jitter variance. The incidence angle at the PD becomes $\psi_{i,k}^{(n)} = \phi_{i,k}^{(n)} – \theta_{i,k}^{(n)}$, where $\phi_{i,k}^{(n)}$ is the irradiance angle. The channel gain is therefore:
$$
H_{i,k}^{(n)} =\begin{cases}
\frac{(m+1)A}{2\pi (d_{i,k}^{(n)})^2} \cos^m(\phi_{i,k}^{(n)}) \cos(\psi_{i,k}^{(n)}), & 0 \le \psi_{i,k}^{(n)} \le \Psi_c\\
0, & \psi_{i,k}^{(n)} > \Psi_c
\end{cases}
$$
Here, $m$ is the Lambertian order, $A$ is the PD area, $d_{i,k}^{(n)}$ is the distance, and $\Psi_c$ is the receiver field-of-view semi-angle. The channel capacity for the link between IoT node $i$ and unmanned drone $k$ at time slot $n$ is:
$$
C_{i,k}^{(n)} = \frac{1}{2} \log_2\left(1 + \frac{(\xi_i P_i H_{i,k}^{(n)})^2}{2\pi e \sigma_k^2}\right)
$$
where $\xi_i$ is the optical-to-electrical conversion efficiency, $P_i$ is the transmit power of node $i$, and $\sigma_k^2$ is the noise power at drone $k$.
1.2 Data Acquisition and Problem Formulation
Let $b_{i,k}^{(n)} \in \{0,1\}$ indicate whether the channel gain exceeds a threshold $H_{\text{th}}$, and $a_{i,k}^{(n)} \in \{0,1\}$ indicate whether a successful data transmission occurs from node $i$ to drone $k$ at slot $n$. A node is considered served if $\sum_{k} \sum_{n} a_{i,k}^{(n)} \ge 1$. The objective is to minimize the total flight distance of all unmanned drones while ensuring all IoT nodes are served. The binary variable $\pi_{i,k} \in \{0,1\}$ indicates the task assignment (node $i$ assigned to drone $k$). The optimization problem (P1) is formulated as:
$$
\begin{aligned}
\text{(P1):} \quad & \min_{\{x_{k,n}, y_{k,n}, N_k, \pi_{i,k}\}} \sum_{k=1}^{K} \sum_{n=1}^{N_k-1} \sqrt{(x_{k,n+1}-x_{k,n})^2 + (y_{k,n+1}-y_{k,n})^2} \\
\text{s.t.} \quad & \text{C1: } \sum_{i=1}^{I} a_{i,k}^{(n)} \le K_{\text{up}}, \quad \forall k, n \\
& \text{C2: } \sum_{i=1}^{I} \sum_{k=1}^{K} \sum_{n=1}^{N_k} a_{i,k}^{(n)} = I \\
& \text{C3: } \sqrt{(x_{k,n+1}-x_{k,n})^2 + (y_{k,n+1}-y_{k,n})^2} \le d_{\max}, \quad \forall k, n \\
& \text{C4: } 0 \le x_{k,n}, y_{k,n} \le D \\
& \text{C5: } \pi_{i,k} \in \{0,1\}, \quad \forall i,k \\
& \text{C6: } \sum_{k=1}^{K} \pi_{i,k} = 1, \quad \forall i
\end{aligned}
$$
C1 limits the number of simultaneous connections per unmanned drone. C2 ensures all nodes are served. C3 constrains the maximum travel distance per slot. C4 defines the operational area. C5 and C6 enforce that each node is assigned to exactly one drone. Problem P1 is a mixed-integer non-convex optimization problem that is challenging to solve directly due to the coupling between task assignment $\pi_{i,k}$, the unknown number of time slots $N_k$, and the continuous trajectory variables.
2. Proposed Two-Phase Framework: Task Allocation and Path Planning
To tackle the complex problem, we propose a two-phase framework: first, assign IoT nodes to unmanned drones using an improved clustering algorithm; second, plan the optimal flight path for each drone to visit its assigned nodes using a reinforcement learning algorithm.
2.1 Improved K-Means Clustering for Task Allocation
Traditional K-means clusters nodes based solely on Euclidean distance to drone initial positions. For VLC-based data collection, the communication link quality, determined by factors like transmit power and the resulting maximum communicable radius at the drone’s altitude $h_k$, is equally important. We propose an improved clustering factor $D(i,k)$ for assigning node $i$ to drone $k$:
$$
D(i,k) = (1-w) \cdot \tilde{d}_{i,k} + w \cdot \tilde{r}_{i,k}
$$
where $\tilde{d}_{i,k}$ is the normalized Euclidean distance between node $i$ and the initial position of drone $k$, and $\tilde{r}_{i,k}$ is the normalized maximum communicable radius of node $i$ at altitude $h_k$. The weight $w \in [0,1]$ balances the importance of distance versus communication quality. The modified K-means algorithm iteratively assigns nodes to the drone/cluster with the smallest $D(i,k)$ and updates cluster centers until convergence.
2.2 TD3-based 3D Path Planning for Individual Drones
After task allocation, each unmanned drone needs to plan a path to sequentially visit its assigned nodes. We model this as a Markov Decision Process (MDP) and employ the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm, known for its stability in continuous action spaces. The key MDP elements are:
- State $s_n$: Includes the drone’s current 2D position $(x_n, y_n)$, and the service status (served/not served) of all nodes in its assigned cluster.
- Action $a_n$: The drone’s movement in the 2D plane, defined by a displacement vector $(\Delta x, \Delta y)$ within a maximum step size.
- Reward $r_n$: Designed to incentivize efficient task completion.
- A positive reward $\kappa_{\text{cov}}$ is given for serving a new node.
- A penalty $-\kappa_{\text{dis}}$ is applied per step to encourage shorter paths.
- A large positive terminal reward $R_{\text{dis}}$ is given upon serving all assigned nodes, inversely proportional to the total flight distance.
- A negative penalty $-P_{\text{ob}}$ is applied for flying out of bounds.
The TD3 algorithm uses an Actor-Critic architecture with two Critic networks (Q-functions) to mitigate overestimation bias. The Actor network (policy $\pi_\phi$) outputs the deterministic action. Key features include target policy smoothing and delayed policy updates. The update process for parameters $\phi$ (Actor) and $\theta_i$ (Critics) is summarized below, where $\mathcal{B}$ is a sampled batch from the replay buffer, $y$ is the target value, and $\tau$ is the soft update coefficient:
$$
\begin{aligned}
y &= r + \gamma (1-d) \min_{i=1,2} Q_{\theta_i’}(s’, \tilde{a}’), \quad \tilde{a}’ = \pi_{\phi’}(s’) + \epsilon, \quad \epsilon \sim \text{clip}(\mathcal{N}(0, \tilde{\sigma}), -c, c) \\
\theta_i &\leftarrow \arg \min_{\theta_i} \mathbb{E}_{(s,a,r,s’,d)\sim\mathcal{B}}[(Q_{\theta_i}(s,a) – y)^2] \\
\nabla_\phi J(\phi) &\approx \mathbb{E}_{s\sim\mathcal{B}}[\nabla_a Q_{\theta_1}(s,a)|_{a=\pi_\phi(s)} \nabla_\phi \pi_\phi(s)] \\
\theta_i’ &\leftarrow \tau \theta_i + (1-\tau)\theta_i’, \quad \phi’ \leftarrow \tau \phi + (1-\tau)\phi’
\end{aligned}
$$
3. Simulation Analysis and Performance Evaluation
We evaluate the proposed “Improved K-Means + TD3” framework against several baseline algorithms: a simple SCAN strategy, “K-means + Greedy + RRT”, “K-means + TD3” (with standard clustering), and a Multi-Agent TD3 (MATD3) algorithm where drones learn to cooperate without explicit prior task allocation. The complexity of the main algorithms is compared in Table 1.
| Algorithm | Time Complexity | Remarks |
|---|---|---|
| Proposed (Improved K-Means+TD3) | $O(TIK + ENM)$ | $T$: Clustering iterations, $I$: Nodes, $K$: Drones, $E$: Training episodes, $N$: Steps, $M$: NN params. Higher than simple planners but more efficient than multi-agent RL. |
| SCAN (Baseline) | $O(I D)$ | Lowest complexity, poorest performance. |
| K-Means + Greedy + RRT | $O(TIK + I \log I)$ | Lower complexity than proposed, but prone to local optima in path planning. |
| Multi-Agent TD3 (MATD3) | $O(ENKM)$ | Highest complexity due to multi-agent policy interaction and $K$ times larger network updates. |
3.1 Impact of UAV Jitter
Figure 3 (simulated results) shows the impact of the jitter coefficient $\sigma$ on task success rate and total flight distance. For low jitter ($\sigma < 0.10$), the success rate is 100% with relatively short paths. As jitter increases to a medium level ($\sigma = 0.15-0.20$), success rate drops and flight distance becomes volatile, indicating disrupted VLC links. Interestingly, at $\sigma=0.30$, success recovers to 100% with a shorter path, possibly because stronger jitter helps escape local optima. For high jitter ($\sigma \ge 0.35$), the success rate falls and stabilizes around 0.88, with paths shortening as missions fail earlier, demonstrating the significant detrimental effect of severe jitter on system reliability and performance of unmanned drones.
3.2 Effect of Clustering Weight Factor $w$
The clustering weight $w$ crucially balances distance and communication feasibility. Figure 4 illustrates the clustering outcomes for different $w$ values. When $w=0$, clustering is purely geometric. When $w=0.5$, some nodes near cluster boundaries are reassigned based on better link quality, leading to more blended clusters. When $w=1$, assignment is based solely on link quality, often resulting in scattered clusters that increase travel distance. Figure 5 shows the system performance versus $w$. The total flight distance and energy consumption first decrease as $w$ increases from 0 to around 0.3-0.4, achieving an optimal trade-off. Performance then degrades sharply for $w > 0.5$ due to highly inefficient task distribution. This confirms that incorporating communication quality into clustering ($w \approx 0.3$) is beneficial.
3.3 Comparative Performance of Full Algorithms
Figure 6 compares the total flight distance required by different algorithms to complete the data collection mission with multiple unmanned drones. The proposed “Improved K-Means (w=0.3) + TD3” algorithm achieves the shortest flight distance, reducing the total path length by approximately 56% compared to the baseline SCAN algorithm. It also outperforms “K-means + Greedy + RRT” and standard “K-means + TD3”. The performance is comparable to the MATD3 algorithm, which learns cooperative behavior directly but at a higher computational cost and without explicit task assignment logic. Figure 7 depicts the training progression of the MATD3 algorithm, showing the converging flight paths of multiple drones over training episodes.
Key simulation parameters are listed in Table 2.
| Parameter | Value |
|---|---|
| Number of UAVs (K) | 1 or 3 |
| Number of IoT Nodes (I) | 20 or 50 |
| UAV Flying Altitude $(h_k)$ | 13 m |
| Maximum Simultaneous Connections $(K_{\text{up}})$ | 1 |
| LED Half-Power Angle | 60° |
| LED Transmit Power $(P_i)$ | 10 W |
| Noise Power $(\sigma_k^2)$ | -128.82 dBW |
| PD Area $(A)$ | 1 cm² |
| Operation Area $(D \times D)$ | 100 m × 100 m |
| Reinforcement Learning Episodes | 8000 |
4. Conclusion
This work addresses the critical challenge of efficient multi-unmanned drone cooperative data acquisition in VLC-based IoT networks. By jointly considering the practical impact of UAV flight jitter on the VLC channel and the need for efficient trajectory planning, we proposed a novel two-phase framework. An improved K-Means clustering algorithm incorporating a tunable communication-quality factor optimally allocates tasks to drones. Subsequently, a TD3-based deep reinforcement learning algorithm plans efficient 3D flight paths for each individual unmanned drone. Simulation results demonstrate that the proposed framework significantly outperforms baseline methods, reducing the total flight distance by up to 56% when the clustering weight is optimally set (e.g., $w=0.3$). This validates the importance of balancing geometric distance with communication link quality in task allocation for UAV-assisted VLC systems. The study provides a solid foundation for designing intelligent, energy-efficient, and reliable multi-drone communication systems for operation in complex electromagnetic environments. Future work may explore integration with Reconfigurable Intelligent Surfaces (RIS) to further enhance VLC link robustness against jitter and dynamic trajectory planning in environments with moving obstacles.
