Abstract:
Unmanned aerial vehicle (UAV) swarms have found extensive applications in various fields, playing a crucial role in cluster collaboration. These swarms involve multiple UAVs that work together to achieve common objectives. A key challenging task in swarm operations is collision-free formation control of UAVs. To solve this problem, applying deep reinforcement learning methods has received significant attention, but their application on autonomous UAVs poses challenges, including dependency on global information during training, difficulties in sampling, and excessive resource utilization. To overcome these challenges, in this work, a novel approach based on multi-agent deep reinforcement learning (MARL) is proposed for collision-free formation control of UAV swarms. MARL allows each UAV to interact with a dynamic environment that includes other UAVs, enabling collaborative decision-making and adaptive behavior. We focus on leveraging local information to establish a state space for individual UAVs. To train the policy network, we employ the multi-agent proximal policy optimization (MAPPO) algorithm, allowing robust learning and policy optimization in a multi-agent setting. Also, we address the issues of sampling difficulties and resource constraints by utilizing digital twin technology, serving as a bridge between physical entities and virtual models, which offers a novel approach to the intelligent collaborative control of drone swarms. By establishing models in virtual space, digital twin technology enables the simulation of real-world spaces for pre-training the reinforcement learning algorithm by generating synthetic experiences. We construct multiple digital twin environments to facilitate interactive sampling and pre-train the swarm with basic task capabilities. Then, we supplement the training using real-world data collected in actual environments, enhancing the ability of the swarm to perform optimally in real-world scenarios. To evaluate the effectiveness of our approach, we compare the performance of the two-stage training architecture with other policy algorithms. To validate the sample efficiency of the on-policy algorithm MAPPO, we conducted a comparative analysis with other policy algorithms, particularly off-policy algorithms. The results reveal the superior sample efficiency and stability of MAPPO in addressing the challenges of collision-free formation control. Finally, we conduct a real-flight validation test to validate the practicality and reliability of the strategy model derived from the digital twin environments. Overall, this work demonstrates the effectiveness of our proposed approach in enabling UAV swarms to navigate complex environments and achieve collision-free formation control.