Abstract Unmanned aerial vehicle (UAV) cluster systems have advantages in redundancy of capabilities, high destruction resistance, and adaptability to complex scenarios, allowing more efficient mission execution and information acquisition. In recent years, deep reinforcement learning techniques have been combined into UAV cluster formation control methods to treat the drawbacks of cluster dimension explosion and difficulty in modelling cluster systems. However, deep reinforcement learning has problems such as low training efficiency. In this paper, a cluster formation method using an improved proximal policy optimization method was proposed. It could solve the slow convergence problems and neglect of high-value actions of the traditional proximal policy optimization method by using the dynamic estimation method as the evaluation mechanism, and effectively improve the data utilization rate. Simulation results verified the improvement in the training efficiency and sample reuse problems, thus achieving the optimized performance.