The rapid evolution of wireless networks demands innovative solutions for pervasive, high-capacity connectivity. UAV drones have emerged as a pivotal technology in this landscape, offering unparalleled mobility and rapid deployment for applications ranging from emergency communications to traffic management in smart cities. However, the practical deployment of UAV drone-assisted networks faces significant environmental and physical constraints. Chief among these are No-Fly Zones (NFZs), which restrict UAV drone flight paths around sensitive areas like airports, government facilities, or dense urban infrastructure. These restrictions can force UAV drones into longer, suboptimal trajectories, exacerbating path loss and creating coverage gaps for users located within or behind these zones.

To restore and enhance coverage in such constrained environments, the integration of Reconfigurable Intelligent Surfaces (RIS) with UAV drone platforms presents a transformative opportunity. An RIS is a planar array composed of numerous low-cost, passive elements capable of dynamically manipulating the phase of impinging electromagnetic waves. By mounting an RIS on a UAV drone, we create an “aerial intelligent reflector” that can establish programmable reflection links between a Base Station (BS) and ground users (GUs), bypassing obstacles and focusing signal energy. This synergy combines the strategic placement freedom of UAV drones with the fine-grained beamforming capability of RIS.
Nevertheless, realizing the full potential of a UAV drone-borne RIS is non-trivial. The directional gain of an RIS is highly sensitive to its spatial orientation. The flight dynamics of a UAV drone—specifically its roll, pitch, and yaw movements—continuously alter the RIS’s pointing direction. This misalignment can drastically degrade the reflective channel’s gain if not actively compensated. Therefore, optimizing the communication performance of such a system requires a holistic approach that jointly considers the UAV drone’s trajectory for NFZ avoidance, its attitude for RIS alignment, the phase shifts of the RIS elements, and the beamforming strategy at the BS.
This paper proposes a novel communication framework employing a quadrotor UAV drone equipped with a bottom-mounted RIS. We tackle the complex problem of joint UAV drone trajectory planning, RIS phase shift optimization, UAV drone attitude control, and BS beamforming under the strict constraint of multiple polygonal NFZs. The primary objective is to maximize the total communication rate for a set of GUs, some of whom may be located inside NFZs, while guaranteeing that the UAV drone’s continuous flight path never enters any restricted zone. Given the high complexity and coupling of this optimization problem, we model it as a Markov Decision Process (MDP) and solve it using a Deep Reinforcement Learning (DRL) algorithm based on the Soft Actor-Critic (SAC) framework, renowned for its stability and efficiency in continuous control tasks. The proposed method demonstrates superior performance in terms of achievable rate, complete NFZ avoidance, and system scalability compared to baseline schemes.
1. System Model and Problem Formulation
We consider a downlink communication system where a multi-antenna Base Station (BS) serves K single-antenna Ground Users (GUs). A quadrotor UAV drone, carrying an RIS with N reflective elements on its underside, is deployed to assist the communication. The set of GUs is denoted by $\mathcal{K} = \{1, …, K\}$, and the RIS elements are grouped into S subsurfaces for reduced control complexity, with the set $\mathcal{S} = \{1, …, S\}$. The total flight period T is discretized into L time slots of duration $\delta = T/L$, indexed by $l \in \mathcal{L} = \{1, …, L\}$.
1.1 UAV Drone Trajectory and NFZ Modeling
The UAV drone’s horizontal position at time slot l is denoted by $\mathbf{q}[l] = (x[l], y[l])$, and it flies at a fixed altitude H. Its velocity and acceleration are $\mathbf{v}[l] = (v_x[l], v_y[l])$ and $\mathbf{a}[l] = (a_x[l], a_y[l])$, respectively, bounded by $v_{max}$ and $a_{max}$.
NFZs are modeled as vertical prisms with regular n-sided polygonal cross-sections. Merely constraining the UAV drone’s discrete sampled positions $\mathbf{q}[l]$ to be outside an NFZ is insufficient, as the path between two points might cut through it. To ensure continuous avoidance, we employ a path integral method. Let the UAV drone’s path from $\mathbf{q}[l]$ to $\mathbf{q}[l+1]$ be parameterized by $s \in [0,1]$: $\mathbf{q}[l](s) = (1-s)\mathbf{q}[l] + s\mathbf{q}[l+1]$. For an n-sided NFZ with vertices $\mathbf{g}_i$ (i=1,…,n), we define the edge vector $\mathbf{e}_i = \mathbf{g}_{i+1} – \mathbf{g}_i$ and the relative vector $\mathbf{p}_i[l](s) = \mathbf{q}[l](s) – \mathbf{g}_i$. The condition for the point $\mathbf{q}[l](s)$ to be outside the polygon is that the cross products $\mathbf{e}_i \times \mathbf{p}_i[l](s)$ are either all positive or all negative (using a consistent winding order). We define an indicator function $\chi(\cdot)$ based on this condition. The path integral for time slot l is:
$$
I[l] = \int_{0}^{1} \chi \left( \prod_{i=1}^{n} (\mathbf{e}_i \times \mathbf{p}_i[l](s) < 0) \lor \prod_{i=1}^{n} (\mathbf{e}_i \times \mathbf{p}_i[l](s) > 0) \right) ds
$$
The constraint $I[l] = 0, \forall l$ guarantees the entire path segment during slot l lies outside the NFZ. A violation occurs if $I[l] > 0$.
1.2 UAV Drone Dynamics and Attitude Model
The UAV drone is modeled as a rigid body. Its attitude is described by Euler angles: roll ($\phi$), pitch ($\theta$), and yaw ($\varphi$), collectively denoted as $\mathbf{U}[l] = \{\phi[l], \theta[l], \varphi[l]\}$. The dynamics relate the control inputs (rotor angular speeds) to the UAV drone’s acceleration. For a quadrotor flying at constant altitude ($a_z=0$), the horizontal accelerations can be expressed as functions of the Euler angles and velocities:
$$
\begin{aligned}
a_x[l] &= g\frac{\tan\phi[l] \sin\varphi[l]}{\cos\theta[l]} – g \tan\theta[l] \cos\varphi[l] – \frac{K_{dx} v_x[l] |v_x[l]|}{m} \\
a_y[l] &= g \tan\theta[l] \sin\varphi[l] – g\frac{\tan\phi[l] \cos\varphi[l]}{\cos\theta[l]} – \frac{K_{dy} v_y[l] |v_y[l]|}{m}
\end{aligned}
$$
where g is gravity, m is the UAV drone’s mass, and $K_d$ are drag coefficients. The attitude angles are constrained: $\theta[l] \in [-\theta_{max}, \theta_{max}]$, $\phi[l] \in [-\phi_{max}, \phi_{max}]$, and their change between slots is bounded by $\Delta U_{max}$.
1.3 RIS Attitude Transformation and Channel Model
The orientation of the RIS significantly impacts its effective gain. The local RIS normal vector in the body frame is $\mathbf{e}^{loc}_{\perp} = [0, 0, -1]^T$. Using the rotation matrices for roll ($\mathbf{R}_{\phi}$), pitch ($\mathbf{R}_{\theta}$), and yaw ($\mathbf{R}_{\varphi}$), the composite rotation matrix $\mathbf{R}_{BE}[l] = \mathbf{R}_{\phi}[l]\mathbf{R}_{\theta}[l]\mathbf{R}_{\varphi}[l]$ transforms the normal to the global coordinate system: $\mathbf{e}_{\perp}[l] = \mathbf{R}_{BE}[l] \mathbf{e}^{loc}_{\perp}$.
Let $\mathbf{e}^{RIS}_{BS}[l]$ and $\mathbf{e}^{RIS}_{k}[l]$ be the unit direction vectors from the RIS to the BS and to GU k, respectively. The angles between these vectors and the RIS normal are:
$$
\gamma^{RIS}_{BS/k}[l] = \arccos\left( \frac{-\mathbf{e}^{RIS}_{BS/k}[l]^T \mathbf{e}_{\perp}[l]}{\|\mathbf{e}^{RIS}_{BS/k}[l]\| \|\mathbf{e}_{\perp}[l]\|} \right)
$$
The channels are modeled with Rician fading. The BS-to-RIS channel $\mathbf{H}_{B,R}[l] \in \mathbb{C}^{M \times N}$ and the RIS-to-GU-k channel $\mathbf{h}_{R,k}[l] \in \mathbb{C}^{N \times 1}$ are:
$$
\begin{aligned}
\mathbf{H}_{B,R}[l] &= \sqrt{\frac{\rho_0}{d_{B,R}[l]^{\alpha_1}}} \left( \sqrt{\frac{K_1}{1+K_1}} \bar{\mathbf{H}}_{B,R}[l] + \sqrt{\frac{1}{1+K_1}} \tilde{\mathbf{H}}_{B,R}[l] \right) \\
\mathbf{h}_{R,k}[l] &= \sqrt{\frac{\rho_0}{d_{R,k}[l]^{\alpha_2}}} \left( \sqrt{\frac{K_{2,k}}{1+K_{2,k}}} \bar{\mathbf{h}}_{R,k}[l] + \sqrt{\frac{1}{1+K_{2,k}}} \tilde{\mathbf{h}}_{R,k}[l] \right)
\end{aligned}
$$
where $d_{B,R}[l]$ and $d_{R,k}[l]$ are distances, $\rho_0$ is the reference channel gain, $\alpha$ are path-loss exponents, K are Rician factors, and $\bar{\mathbf{H}}, \tilde{\mathbf{H}}$ denote LoS and NLoS components.
The RIS phase shift matrix is $\boldsymbol{\Psi}[l] = \text{diag}(\boldsymbol{\psi}[l] \otimes \mathbf{1}_{\tilde{N}}) \in \mathbb{C}^{N \times N}$, where $\boldsymbol{\psi}[l] = [e^{j\psi_1[l]}, …, e^{j\psi_S[l]}]^T$ contains the common phase for each subsurface. The practical gain of the RIS considers its directivity. Using a normalized power gain pattern $F(\zeta, \eta) = \cos^z(\eta)$ for angles within the half-space, the overall RIS gain factor for the BS-RIS-GU-k link is:
$$
\Upsilon_k[l] = D^2_{max} \cdot |\cos(\gamma^{RIS}_{BS}[l]) \cos(\gamma^{RIS}_{k}[l])|^z \cdot \boldsymbol{\Psi}[l]
$$
where $D_{max}$ is the maximum directivity. The composite channel from the BS to GU k is:
$$
\boldsymbol{\Xi}_k[l] = \mathbf{h}_{R,k}^H[l] \Upsilon_k[l] \mathbf{H}_{B,R}[l] + \mathbf{h}_{B,k}^H[l]
$$
The received signal at GU k is $y_k[l] = \boldsymbol{\Xi}_k[l] \mathbf{w}_k[l] x_k[l] + \sum_{i \neq k} \boldsymbol{\Xi}_k[l] \mathbf{w}_i[l] x_i[l] + n_k$, where $\mathbf{w}_k[l]$ is the BS beamforming vector for GU k, $x_k[l]$ is the transmitted signal with unit power, and $n_k \sim \mathcal{CN}(0, \sigma^2)$ is noise. The achievable rate for GU k at slot l is:
$$
R_k[l] = \log_2 \left( 1 + \frac{|\boldsymbol{\Xi}_k[l] \mathbf{w}_k[l]|^2}{\sum_{i \neq k} |\boldsymbol{\Xi}_k[l] \mathbf{w}_i[l]|^2 + \sigma^2} \right)
$$
1.4 Problem Formulation
Our goal is to maximize the total sum rate over the flight period by jointly optimizing the UAV drone trajectory $\mathcal{Q} = \{\mathbf{q}[l]\}$, RIS phase shifts $\boldsymbol{\Phi} = \{\boldsymbol{\Psi}[l]\}$, UAV drone attitude $\mathcal{U} = \{\mathbf{U}[l]\}$, and BS beamforming $\mathcal{W} = \{\mathbf{w}_k[l]\}$.
$$
\begin{aligned}
& \max_{\mathcal{Q}, \boldsymbol{\Phi}, \mathcal{U}, \mathcal{W}} \quad R_{sum} = \sum_{l=1}^{L} \sum_{k=1}^{K} R_k[l] \\
& \text{s.t.} \\
& \text{C1: } \|\mathbf{a}[l]\| \le a_{max}, \quad \forall l \\
& \text{C2: } \|\mathbf{v}[l]\| \le v_{max}, \quad \forall l \\
& \text{C3: } I[l] = 0, \quad \forall l \quad \text{(NFZ Avoidance)} \\
& \text{C4: } \theta[l] \in [-\theta_{max}, \theta_{max}], \quad \forall l \\
& \text{C5: } \phi[l] \in [-\phi_{max}, \phi_{max}], \quad \forall l \\
& \text{C6: } |\mathbf{U}[l] – \mathbf{U}[l-1]| \le \Delta U_{max}, \quad \forall l \ge 2 \\
& \text{C7: } \|\mathbf{q}[l+1] – \mathbf{q}[l]\| \le v_{max} \delta, \quad \forall l
\end{aligned}
$$
This is a highly non-convex, mixed-integer optimization problem with tightly coupled variables, making traditional optimization methods intractable, especially for online operation.
2. DRL-Based Solution: SAC Algorithm
We reformulate the problem as a Markov Decision Process (MDP) and employ the Soft Actor-Critic (SAC) algorithm, a state-of-the-art DRL method for continuous control, known for its sample efficiency and stability.
2.1 MDP Formulation
The key elements of the MDP are defined as follows:
State Space: The state $s_l$ at time slot l should encompass all necessary information for decision-making.
$$
s_l = \{\mathbf{q}[l], \mathbf{a}[l], \mathbf{v}[l], \mathbf{U}[l], R_{sum}[l-1]\}
$$
It includes the UAV drone’s position, acceleration, velocity, attitude (Euler angles), and the previous total rate.
Action Space: The action $a_l$ corresponds to the adjustable optimization variables.
$$
a_l = \{\Delta\mathbf{U}[l], \boldsymbol{\psi}[l], \mathbf{W}[l]\}
$$
where $\Delta\mathbf{U}[l]$ is the change in Euler angles, $\boldsymbol{\psi}[l]$ are the RIS subsurface phases, and $\mathbf{W}[l] = [\mathbf{w}_1[l], …, \mathbf{w}_K[l]]$ is the BS beamforming matrix.
Reward Function: The reward guides the UAV drone agent towards the objective while penalizing constraint violations. The base reward is the instantaneous sum rate. Substantial penalties ($P_1$ to $P_4$) are applied for violating NFZ, speed, acceleration, or boundary constraints.
$$
r_l =
\begin{cases}
\sum_{k=1}^{K} R_k[l] – P_1, & \text{if } I[l] > 0 \text{ (NFZ violation)} \\
\sum_{k=1}^{K} R_k[l] – P_2, & \text{if } \|\mathbf{v}[l]\| > v_{max} \\
\sum_{k=1}^{K} R_k[l] – P_3, & \text{if } \|\mathbf{a}[l]\| > a_{max} \\
\sum_{k=1}^{K} R_k[l] – P_4, & \text{if } x[l] > x_{max} \text{ or } y[l] > y_{max} \\
\sum_{k=1}^{K} R_k[l], & \text{otherwise.}
\end{cases}
$$
2.2 SAC Algorithm Design
SAC is an off-policy actor-critic algorithm that maximizes a trade-off between expected return and policy entropy, which encourages exploration. The objective is:
$$
J(\pi) = \sum_{l=1}^{L} \mathbb{E}_{(s_l, a_l) \sim \rho_{\pi}} \left[ \gamma^{l-1} r(s_l, a_l) + \alpha \mathcal{H}(\pi(\cdot|s_l)) \right]
$$
where $\mathcal{H}(\pi(\cdot|s_l))$ is the entropy of the policy $\pi$ at state $s_l$, $\alpha$ is a temperature parameter, and $\gamma$ is a discount factor.
The SAC framework maintains the following networks:
- Actor (Policy Network $\pi_{\phi}$): Outputs a mean and variance for a Gaussian distribution over actions, from which actions are sampled.
- Two Critic (Q-Networks $Q_{\omega_1}, Q_{\omega_2}$): Estimate the state-action value function to reduce overestimation bias.
- Two Target Critic Networks: Provide stable targets for Q-network updates.
- Temperature Parameter $\alpha$: Can also be learned automatically.
The training involves alternating between policy evaluation and policy improvement, using experience replay. The key update steps are summarized in the following algorithm and table.
| Step | Description |
|---|---|
| 1 | Initialization: Initialize actor network parameters $\phi$, critic network parameters $\omega_1, \omega_2$, target critic parameters $\hat{\omega}_1, \hat{\omega}_2$, replay buffer $\mathcal{D}$, and temperature $\alpha$. |
| 2 | For each training episode: |
| 3 | Environment Interaction: For each time slot l, agent observes state $s_l$, selects action $a_l \sim \pi_{\phi}(\cdot|s_l)$, executes it, observes reward $r_l$ and next state $s_{l+1}$, and stores transition $(s_l, a_l, r_l, s_{l+1})$ in $\mathcal{D}$. |
| 4 | Network Updates: Sample a random minibatch of transitions from $\mathcal{D}$. |
| 5 | Critic Update: Update $\omega_i$ by minimizing the soft Bellman residual loss: $$ L_{Q}(\omega_i) = \mathbb{E}_{(s_l,a_l) \sim \mathcal{D}} \left[ \frac{1}{2} \left( Q_{\omega_i}(s_l, a_l) – \hat{Q}(s_l, a_l) \right)^2 \right] $$ where the target $\hat{Q}$ uses the target networks and the current policy: $$ \hat{Q}(s_l, a_l) = r_l + \gamma \left( \min_{j=1,2} Q_{\hat{\omega}_j}(s_{l+1}, a’) – \alpha \log \pi_{\phi}(a’|s_{l+1}) \right), a’ \sim \pi_{\phi}(\cdot|s_{l+1}) $$ |
| 6 | Actor Update: Update $\phi$ to maximize the expected future reward plus entropy: $$ L_{\pi}(\phi) = \mathbb{E}_{s_l \sim \mathcal{D}} \left[ \mathbb{E}_{a_l \sim \pi_{\phi}} \left[ \alpha \log \pi_{\phi}(a_l|s_l) – \min_{j=1,2} Q_{\omega_j}(s_l, a_l) \right] \right] $$ |
| 7 | Temperature Update: (Optional) Adjust $\alpha$ to maintain a target entropy level. |
| 8 | Target Network Update: Softly update target critic parameters: $\hat{\omega}_i \leftarrow \tau \omega_i + (1-\tau) \hat{\omega}_i$, with $\tau \ll 1$. |
3. Simulation Results and Analysis
We conduct simulations to evaluate the performance of the proposed SAC-based optimization framework for UAV drone-RIS networks under NFZ constraints.
3.1 Simulation Setup
The simulation area is a 160m x 160m square. Key parameters are listed below:
| Parameter | Value |
|---|---|
| Flight Period (T) / Time Slots (L) | 30 s / 30 |
| UAV Drone Altitude (H) | 100 m |
| UAV Max Speed ($v_{max}$) / Accel. ($a_{max}$) | 20 m/s / 4 m/s² |
| Max Roll/Pitch ($\phi_{max}, \theta_{max}$) | $\pi/4$ rad |
| BS Antennas (M) / RIS Elements (N) / Subs. (S) | 8 / 100 / 5 |
| Noise Power ($\sigma^2$) | -110 dBm |
| Path Loss Exponents ($\alpha_1, \alpha_2$) | 2.0, 2.0 |
| Rician Factors ($K_1$, $K_{2,k}$) | 10 dB |
| Reference Gain ($\rho_0$) | -30 dB |
| SAC Learning Rate / Discount ($\gamma$) | 5e-4 / 0.95 |
Three polygonal NFZs are placed within the area. GUs are randomly distributed, including inside NFZs. The BS is located at (40, 140) m. We compare our SAC-based method against several baselines: PPO, DDPG, TD3 (other DRL algorithms), Fixed Phase Shift (RIS phases random/unoptimized), Fixed RIS (UAV drone hovers at start point), and a No-RIS system.
3.2 Performance Evaluation
The convergence performance during training is shown in the learning curve. The proposed SAC algorithm achieves a higher and more stable cumulative reward (which correlates with the sum rate) compared to PPO, DDPG, and TD3. SAC’s entropy maximization facilitates better exploration in the high-dimensional joint action space (trajectory, attitude, phases, beamforming), allowing it to escape local optima and find superior policies. DDPG converges quickly but to a lower performance plateau. PPO shows higher variance, while TD3 performs well but is slightly outperformed by SAC.
The UAV drone trajectories generated by the SAC agent demonstrate its critical capabilities. The plots clearly show that the UAV drone successfully navigates around all polygonal NFZs without any intersection, validating the effectiveness of the path-integral based NFZ avoidance constraint (C3). Furthermore, the UAV drone dynamically adjusts its path to serve users both outside and inside the NFZs. It tends to linger near the BS and cluster around user-dense regions to maximize the reflective link quality, showcasing intelligent trajectory-user association. This behavior remains consistent even when the number of GUs increases, highlighting the scalability of the approach.
The impact of key system parameters is analyzed. The system sum rate improves as the number of BS antennas (M) increases, due to enhanced beamforming gain. Across all values of M, our proposed SAC-based joint optimization scheme significantly outperforms the baseline schemes. The performance gap between SAC and the “Fixed Phase” scheme illustrates the gain from optimizing the RIS configuration. The substantial gap between SAC and the “No RIS” scheme underscores the fundamental benefit introduced by the UAV drone-borne RIS in creating and enhancing reflection paths. The “Fixed RIS” scheme performs the worst among RIS-assisted schemes, emphasizing the crucial value of the UAV drone’s mobility in positioning the RIS favorably relative to the BS and users.
4. Conclusion
This paper has investigated the performance optimization of a UAV drone-RIS-assisted communication network operating in an environment constrained by multiple No-Fly Zones. We developed a comprehensive system model that integrates UAV drone trajectory dynamics, RIS orientation-dependent gain, and practical wireless channels. A novel path-integral method was proposed to enforce continuous and complete avoidance of polygonal NFZs. The challenging joint optimization problem for maximizing the sum rate was formulated and effectively solved using a Deep Reinforcement Learning framework based on the Soft Actor-Critic algorithm. The SAC agent successfully learns to coordinate the UAV drone’s flight path and attitude, the RIS’s phase profile, and the BS’s beamforming vectors in a holistic manner.
Simulation results confirm that the proposed approach achieves two primary objectives simultaneously: it guarantees strict NFZ avoidance, which is essential for regulatory compliance and safety in real-world UAV drone deployments, and it significantly enhances the overall network communication rate compared to benchmark strategies. The framework demonstrates robustness and adaptability to different user distributions and NFZ configurations. This work provides a viable and efficient solution for deploying intelligent, reconfigurable aerial platforms to extend and dynamically optimize wireless coverage in complex, restricted airspace, paving the way for more reliable and capable UAV drone-assisted communication in the 6G era.
