J Shanghai Jiaotong Univ Sci ›› 2025, Vol. 30 ›› Issue (5): 899-910.doi: 10.1007/s12204-023-2691-y

• • 上一篇    下一篇

基于多注意力机制的轻量化人体姿态估计

  

  1. 1.上海师范大学 信息与机电工程学院人工智能教育研究院,上海 200234;2. 上海师范大学 上海智能教育大数据工程技术研究中心,上海 200234;3. 上海市中小学在线教育研究基地,上海200234;4. 上海师范大学 体育学院,上海200234
  • 收稿日期:2023-08-03 接受日期:2023-08-24 出版日期:2025-09-26 发布日期:2023-12-21

Lightweight Human Pose Estimation Based on Multi-Attention Mechanism

林晓1,2,3, 陆美晨1,3 , 高幕峰4, 李岩1,2   

  1. 1. Institute of Artificial Intelligence on Education Research, College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai 200234, China; 2. Shanghai Intelligent Education Big Data Engineering Technology Research Center, Shanghai Normal University, Shanghai 200234, China; 3. Shanghai Online Education Research Base for Primary and Secondary Schools, Shanghai 200234, China; 4. School of Physical Education, Shanghai Normal University, Shanghai 200234, China
  • Received:2023-08-03 Accepted:2023-08-24 Online:2025-09-26 Published:2023-12-21

摘要: 人体姿态估计因其广泛的应用场景而受到研究界的关注,然而现有的网络结构通常结构复杂计算量大,并且在特征融合过程中存在特征丢失问题。针对上述问题,提出一个轻量化的基于多重注意力机制的人体姿态估计网络LMANet。在高分辨率网络的基础上利用深度可分离卷积对瓶颈块进行轻量化处理,能够大幅度减少网络参数量;之后引入多重注意力机制提高模型预测精度,在网络初始阶段加入通道注意力模块增强局部跨通道的信息交互;在多尺度特征融合阶段引入空间注意力机制,通过空间交叉感知模块减少特征提取过程中空间信息损失。在COCO2017数据集和MPII数据集上的实验结果表明,LMANet能够在较少的参数量和计算量的情况下保证较高的预测精确度;相较于高分辨率网络HRNet,网络的参数量和计算复杂度分别减少67%和73%。

关键词: 人体姿态估计, 注意力机制, 多尺度特征融合, 高分辨率网络

Abstract: Human pose estimation has received much attention from the research community because of its wide range of applications. However, current research for pose estimation is usually complex and computationally intensive, especially the feature loss problems in the feature fusion process. To address the above problems, we propose a lightweight human pose estimation network based on multi-attention mechanism (LMANet). In our method, network parameters can be significantly reduced by lightweighting the bottleneck blocks with depth-wise separable convolution on the high-resolution networks. After that, we also introduce a multi-attention mechanism to improve the model prediction accuracy, and the channel attention module is added in the initial stage of the network to enhance the local cross-channel information interaction. More importantly, we inject spatial crossawareness module in the multi-scale feature fusion stage to reduce the spatial information loss during feature extraction. Extensive experiments on COCO2017 dataset and MPII dataset show that LMANet can guarantee a higher prediction accuracy with fewer network parameters and computational effort. Compared with the highresolution network HRNet, the number of parameters and the computational complexity of the network are reduced by 67% and 73%, respectively.

中图分类号: