The rapid proliferation of drone technology in domains such as urban inspection, emergency rescue, logistics distribution, and agricultural plant protection has imposed ever higher demands on autonomous flight and cooperative control capabilities. Modern unmanned aerial vehicles (UAVs) must operate reliably in complex, dynamic environments while being constrained by limited size, weight, and power (SWaP). This review systematically examines the key technological progress in both autonomous flight and multi-drone cooperative control, highlighting the evolution from modular architectures to learning-based end-to-end solutions and from model-based consensus algorithms to data-driven multi-agent reinforcement learning. Throughout the discussion, we place special emphasis on how drone technology is being reshaped by differentiable physics simulation, imitation learning, and vision-language-action models. Finally, we propose a novel framework that integrates end-to-end decision-making with physical priors and differentiable simulation, and outline future directions for simulation-to-reality transfer, explainability, and scalability in large-scale swarms.

Autonomous Flight Research Progress
The core challenge in autonomous flight for drone technology is achieving high-speed and safe obstacle avoidance under resource constraints. Traditional perception-planning-control architectures perform well at low speeds in sparse environments, but suffer from information delays and error accumulation during high-speed maneuvers or strong disturbances. Recent breakthroughs in end-to-end learning have significantly improved system responsiveness and environmental adaptability by directly mapping sensory inputs to control outputs.
Traditional Perception-Planning-Control Framework
The classical three-layer architecture decomposes the autonomy task into perception (environment mapping via point clouds or voxel grids), planning (global and local trajectory optimization), and control (PID or model predictive control). While this modular design offers good interpretability, the sequential processing chain introduces latency and cumulative errors. Table 1 summarizes representative methods and their characteristics.
| Layer | Representative Method | Principle | Strength | Limitation |
|---|---|---|---|---|
| Perception | ORB-SLAM3 | Sparse feature-based SLAM | Low computational cost | Sensitive to low-texture environments |
| Perception | VINS-Mono | Visual-inertial fusion | Robust in motion | Requires IMU calibration |
| Planning | Fast-Planner | Hybrid A* + B-spline optimization | Smooth global paths | Pre-built map dependency |
| Planning | Ego-Planner | Gradient-based local optimization | Real-time reaction to dynamic obstacles | No global guarantee |
| Control | PID | Error feedback regulation | Simple, robust | Poor handling of nonlinearities |
| Control | MPC | Finite-horizon receding horizon optimization | Constraint-aware, precise tracking | High computational demand |
Reinforcement Learning Based End-to-End Methods
Reinforcement learning (RL) has emerged as a powerful alternative for drone technology, enabling end-to-end optimization from raw sensor data to motor commands. In high-speed racing, the Swift system achieved champion-level performance by mapping visual-inertial features to low-dimensional control commands through deep RL. The training objective is typically formulated as:
$$ J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \gamma^t r(s_t, a_t)\right] $$
where \(\pi_\theta\) is the policy parametrized by \(\theta\), \(\gamma\) the discount factor, and \(r\) the reward signal. To improve sample efficiency and safety, recent works incorporate curriculum learning, domain randomization, and human-in-the-loop intervention. For instance, the NavRL framework uses velocity-obstacle-based safety shields and parallel training in Isaac Sim, achieving a 52.2% reduction in collision rate in dynamic environments. However, RL still faces challenges such as reward design difficulty, training instability, and limited explainability.
Imitation Learning Based End-to-End Methods
Imitation learning (IL) accelerates policy acquisition by leveraging expert demonstrations, bypassing the inefficient exploration phase of RL. The core idea is to learn a mapping from observations to actions by minimizing a loss function between the predicted actions and expert actions:
$$ \mathcal{L}_{\text{BC}}(\theta) = \mathbb{E}_{(o, a^*) \sim \mathcal{D}} \left[ \| \pi_\theta(o) – a^* \|^2 \right] $$
where \(\mathcal{D}\) is the collected demonstration dataset. The Agile framework exemplifies a teacher-student paradigm: a teacher policy is trained in simulation with privileged information (e.g., full state and environment geometry) via RL; then a student policy is trained via behavioral cloning to mimic the teacher using only onboard sensor inputs. This zero-shot transfer approach has been successfully deployed in complex real-world environments such as dense forests and rubble fields.
More recently, vision-language-action (VLA) models have emerged as a frontier for drone technology. These models combine visual inputs, natural language instructions, and motor commands, enabling semantic-guided navigation. For example, RaceVLA maps first-person vision and textual commands to drone actions, achieving fine-grained behaviors like flying close to obstacles or avoiding dynamic targets. Synthetic data generation, often through diffusion models, plays a crucial role in training VLA models due to the high cost of real-world data collection. However, the large model size and computational demands remain significant barriers for real-time deployment on resource-constrained platforms. Table 2 compares different autonomous flight paradigms.
| Paradigm | Data Requirement | Real-time Performance | Explainability | Sample Efficiency |
|---|---|---|---|---|
| Traditional modular | Low | High (with careful optimization) | High | N/A |
| Reinforcement learning | Very high (simulation) | Medium (depends on network size) | Low | Low |
| Imitation learning (BC) | High (expert demonstrations) | High | Medium | High |
| Imitation learning (VLA) | Very high (sim + real) | Low (currently edge-dependent) | Medium | Medium |
Differentiable Physics Simulation Based End-to-End Methods
Differentiable physics simulation (DPS) bridges the gap between physical modeling and deep learning by making the simulation process differentiable. Unlike traditional simulators that only output state trajectories, DPS can provide gradient signals (e.g., \(\frac{\partial \text{state}}{\partial \text{control}}\)) that can be backpropagated to optimize control networks directly. This eliminates the need for reward engineering in RL or large expert datasets in IL, and ensures that learned policies respect physical laws.
The optimization objective can be formulated as minimizing a loss function over a time horizon:
$$ \mathcal{L}(\theta) = \sum_{t=1}^T \ell(s_t, s_t^{\text{target}}) \quad \text{with} \quad s_{t+1} = f_{\text{diff}}(s_t, \pi_\theta(o_t)) $$
where \(f_{\text{diff}}\) is the differentiable simulator (parametrized by physics equations) and \(\ell\) is a task-specific loss (e.g., trajectory tracking error). In drone technology, DPS has been used for visual-based stabilization (e.g., the BBTT algorithm, which recovers from random initial throws to stable hover) and for agile obstacle avoidance. The vision-based differentiable framework demonstrated by recent works shows that policies trained purely in differentiable simulators can transfer to real quadrotors without any real-world fine-tuning. Table 3 outlines the differences between DPS and RL for control policy learning.
| Aspect | Differentiable Physics Simulation | Reinforcement Learning |
|---|---|---|
| Gradient source | Analytic gradients from the simulator | Estimated via sampled returns or value functions |
| Sample efficiency | High (direct gradient propagation) | Low (requires many trial-and-error episodes) |
| Physical consistency | Explicitly enforced | Implicit (through rewards) |
| Interpretability | Medium (gradients traceable to physics) | Low (black-box policy) |
| Deployment complexity | Requires differentiable simulator | Standard simulator suffices |
Cooperative Control of Multi-Drone Systems
Beyond single-UAV autonomy, cooperative control is essential for tasks requiring wide-area coverage, multi-target tracking, or redundancy. In drone technology, multi-agent systems must handle communication constraints, partial observability, and dynamic topology. We categorize approaches into traditional model-based methods, multi-agent reinforcement learning (MARL), and an emerging differentiable physics perspective.
Traditional Multi-Agent Cooperative Control Methods
Traditional approaches rely on explicit mathematical models to achieve formation control and collision avoidance. Key methods include:
- Consensus theory: Each agent updates its state based on neighboring states. For velocity consensus: $$ \dot{v}_i = -\sum_{j \in \mathcal{N}_i} a_{ij}(v_i – v_j) $$ where \(a_{ij}\) are adjacency weights.
- Leader-follower strategy: Followers track the leader’s trajectory, often using relative position feedback.
- Artificial potential field (APF): Agents move along the negative gradient of a potential function composed of attractive and repulsive terms: $$ \mathbf{F}_i = -\nabla U_{\text{att}} – \nabla U_{\text{rep}} $$ $$ U_{\text{rep}} = \sum_{j \neq i} \frac{k_{\text{rep}}}{\| \mathbf{p}_i – \mathbf{p}_j \|} $$
These methods offer provable stability and low computational cost, but their performance degrades in highly dynamic or unstructured environments due to reliance on hand-tuned parameters.
| Method | Principle | Advantage | Limitation |
|---|---|---|---|
| Consensus theory | Distributed state agreement | Scalable, stability proof | Requires connected graph |
| Leader-follower | Slave tracks master | Simple, easy to implement | Single point of failure |
| Artificial potential field | Gradient of potential function | Real-time, reactive | Local minima issues |
| Distributed MPC | Optimization with neighbor constraints | Constraint satisfaction | High online computation |
Multi-Agent Reinforcement Learning Methods
MARL has gained traction in drone technology for its ability to learn cooperative behaviors without explicit interaction models. Representative algorithms include:
- MADDPG: Each agent has an actor that outputs actions based on local observations, and a centralized critic that evaluates state-action values using all agents’ information: $$ \nabla_{\theta_i} J(\mu_i) = \mathbb{E}_{o, a \sim \mathcal{D}} \left[ \nabla_{\theta_i} \mu_i(o_i) \nabla_{a_i} Q_i^\mu(s, a_1, \dots, a_N) \right] $$
- MAPPO: Extends PPO to multi-agent settings with a centralized value function and clipped surrogate objective: $$ L^{\text{CLIP}}(\theta) = \mathbb{E}_{t} \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right] $$
MARL has been applied to tasks such as collision avoidance (attention-based encoders for 32 drones), pursuit-evasion games, and dynamic task allocation. However, challenges remain: the black-box nature hinders explainability; sim-to-real transfer often fails due to sensor noise and communication delays; credit assignment in sparse-reward settings is problematic; and training typically assumes perfect communication, which is rarely true in practice. Table 5 summarizes recent MARL applications in drone technology.
| Reference | Algorithm | Task | Key Innovation | Number of UAVs |
|---|---|---|---|---|
| Huang et al. | MADDPG + Attention | Collision avoidance | Attention-based neighbor encoding | 4–32 (sim), 4–6 (real) |
| Batra et al. | MAPPO | Swarm navigation | Curriculum learning + collision replay | Up to 10 |
| Peng et al. | SAC (NAGC) | Cooperative pursuit | Graph attention + normalizing flow actor | 6 |
| Wu et al. | PG-MAPPO | Dynamic task allocation | Population-based learning + GMM adjustment | Up to 20 |
Multi-Agent Differentiable Physics Simulation – An Emerging Frontier
The success of differentiable physics in single-UAV control naturally suggests its extension to multi-agent systems. The key idea is to construct a joint differentiable model that captures both individual dynamics and inter-agent coupling constraints. The optimization objective can be formulated as a composite loss:
$$ \mathcal{L}_{\text{joint}}(\{\theta_i\}) = \sum_{i=1}^N \mathcal{L}_i^{\text{indiv}}(s_i, a_i) + \lambda \mathcal{L}^{\text{group}}( \{s_i, a_i\} ) $$
where \(\mathcal{L}_i^{\text{indiv}}\) penalizes tracking error or collision with obstacles, and \(\mathcal{L}^{\text{group}}\) enforces formation shape, connectivity, or task balance. The gradients flow through a fully differentiable simulation that includes pairwise interactions (e.g., repulsion forces, communication constraints). Although no mature application in multi-UAV systems exists yet, this approach holds promise for uniting the interpretability of physics models with the flexibility of end-to-end learning. Future work must address how to embed cooperative mechanisms directly into the forward dynamics rather than only as post-hoc loss terms, enabling emergent coordination naturally.
Challenges and Future Directions
Despite significant advances in drone technology, several critical challenges remain for both autonomous flight and multi-drone cooperation.
- Explainability and safety: End-to-end learning models lack formal safety guarantees. Integrating dynamics constraints and barrier functions into network architectures is a promising path to certifiable safe control.
- Differentiable physics for multi-agent systems: Extending single-agent differentiable simulation to multi-agent settings while maintaining computational tractability and gradient fidelity remains an open problem.
- Communication and partial observability: Real-world drone swarms must operate under intermittent communication and partial observations. Robust policy representations, such as recurrent or attention-based models that rely on observation history rather than instantaneous neighbor states, are essential.
- Simulation-to-reality transfer: High-fidelity training environments that explicitly model sensor noise, actuator delays, and aerodynamic variations, combined with online adaptation using a small amount of real flight data, can bridge the sim-to-real gap.
- System-level scalability: As swarms scale to hundreds or thousands, hierarchical decision-making architectures that combine local autonomy (e.g., reactive collision avoidance) with global coordination (e.g., task allocation) become necessary. The interplay between drone technology and infrastructure (edge computing, 5G, digital twins) will be a key enabler.
In conclusion, the evolution of drone technology toward fully autonomous and scalable multi-agent systems demands a synergistic integration of model-based reasoning, data-driven learning, and physics-informed simulation. The novel framework we propose—combining end-to-end decision-making with physical priors and differentiable simulation—offers a promising direction for future research in both autonomous flight and cooperative control.
