基于多注意力机制的轻量化人体姿态估计

doi:10.1007/s12204-023-2691-y

摘要/Abstract

摘要： 人体姿态估计因其广泛的应用场景而受到研究界的关注，然而现有的网络结构通常结构复杂计算量大，并且在特征融合过程中存在特征丢失问题。针对上述问题，提出一个轻量化的基于多重注意力机制的人体姿态估计网络LMANet。在高分辨率网络的基础上利用深度可分离卷积对瓶颈块进行轻量化处理，能够大幅度减少网络参数量；之后引入多重注意力机制提高模型预测精度，在网络初始阶段加入通道注意力模块增强局部跨通道的信息交互；在多尺度特征融合阶段引入空间注意力机制，通过空间交叉感知模块减少特征提取过程中空间信息损失。在COCO2017数据集和MPII数据集上的实验结果表明，LMANet能够在较少的参数量和计算量的情况下保证较高的预测精确度；相较于高分辨率网络HRNet，网络的参数量和计算复杂度分别减少67%和73%。

关键词: 人体姿态估计, 注意力机制, 多尺度特征融合, 高分辨率网络

Abstract: Human pose estimation has received much attention from the research community because of its wide range of applications. However, current research for pose estimation is usually complex and computationally intensive, especially the feature loss problems in the feature fusion process. To address the above problems, we propose a lightweight human pose estimation network based on multi-attention mechanism (LMANet). In our method, network parameters can be significantly reduced by lightweighting the bottleneck blocks with depth-wise separable convolution on the high-resolution networks. After that, we also introduce a multi-attention mechanism to improve the model prediction accuracy, and the channel attention module is added in the initial stage of the network to enhance the local cross-channel information interaction. More importantly, we inject spatial crossawareness module in the multi-scale feature fusion stage to reduce the spatial information loss during feature extraction. Extensive experiments on COCO2017 dataset and MPII dataset show that LMANet can guarantee a higher prediction accuracy with fewer network parameters and computational effort. Compared with the highresolution network HRNet, the number of parameters and the computational complexity of the network are reduced by 67% and 73%, respectively.

中图分类号:

TP391.41

. 基于多注意力机制的轻量化人体姿态估计[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(5): 899-910.

LIN Xiao, LU Meichen, GAO Mufeng, LI Yan. Lightweight Human Pose Estimation Based on Multi-Attention Mechanism[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(5): 899-910.

参考文献

[1] PEI S Y, CHEN A, LEE J, et al. Hand interfaces: Using hands to imitate objects in AR/VR for expressive interactions [C]//Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. New Orleans: ACM, 2022: 1-16.

[2] KHETA K, DELGOVE C, LIU R L, et al. Vision-based conflict detection within crowds based on high-resolution human pose estimation for smart and safe airport [DB/OL]. (2022-07-01). https://arxiv.org/abs/2207.00477

[3] ENDO M, POSTON K L, SULLIVAN E V, et al. GaitForeMer: self-supervised pre-training of transformers via human motion forecasting for few-shot gait impairment severity estimation[M]// Medical image computing and computer assisted intervention – MICCAI 2022. Cham: Springer, 2022: 130-139.

[4] TOSHEV A, SZEGEDY C. DeepPose: human pose estimation via deep neural networks [C]//2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 1653-1660.

[5] TOMPSON J, JAIN A, LECUN Y, et al. Joint training of a convolutional network and a graphical model for human pose estimation [C]// 27th International Conference on Neural Information Processing Systems. Montreal: NIPS, 2014: 1799-1807.

[6] WEI S H, RAMAKRISHNA V, KANADE T, et al. Convolutional pose machines [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 4724-4732.

[7] NEWELL A, YANG K Y, DENG J. Stacked hourglass networks for human pose estimation[M]// Computer vision – ECCV 2016. Cham: Springer, 2016: 483-499.

[8] XIAO B, WU H P, WEI Y C. Simple baselines for human pose estimation and tracking[M]// Computer vision – ECCV 2018. Cham: Springer, 2018: 472-487.

[9] SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 5686-5696.

[10] WANG Q L, WU B G, ZHU P F, et al. ECA-net: Efficient channel attention for deep convolutional neural networks [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11531-11539.

[11] HOU Q B, ZHOU D Q, FENG J S. Coordinate attention for efficient mobile network design [C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 13708-13717.

[12] CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 7103-7112.

[13] YU C Q, XIAO B, GAO C X, et al. Lite-HRNet: A lightweight high-resolution network [C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 10435-10445.

[14] CAO Z, SIMON T, WEI S H, et al. Realtime multi-person 2D pose estimation using part affinity fields [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 1302-1310.

[15] CHENG B W, XIAO B, WANG J D, et al. HigherHRNet: scale-aware representation learning for bottom-up human pose estimation [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 5385-5394.

[16] MCNALLY W, VATS K, WONG A, et al. Rethinking keypoint representations: Modeling keypoints and poses as objects for multi-person human pose estimation[M]// Computer vision – ECCV 2022. Cham: Springer, 2022: 37-54.

[17] LI Z, YE J W, SONG M L, et al. Online knowledge distillation for efficient pose estimation [C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 11720-11730.

[18] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks [J]. Communications of the ACM, 2017, 60(6): 84-90.

[19] HU J, SHEN L, SUN G. Squeeze-and-excitation networks [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 7132-7141.

[20] WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[M]// Computer vision – ECCV 2018. Cham: Springer, 2018: 3-19.

[21] FU J, LIU J, TIAN H J, et al. Dual attention network for scene segmentation [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 3141-3149.

[22] HOWARD A, SANDLER M, CHEN B, et al. Searching for MobileNetV3 [C]//2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 1314-1324.

[23] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context [M]//Computer vision – ECCV 2014. Cham: Springer, 2014: 740-755.

[24] ANDRILUKA M, PISHCHULIN L, GEHLER P, et al. 2D human pose estimation: New benchmark and state of the art analysis [C]//2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 3686-3693.

[25] ZHANG Z, TANG J, WU G S. Simple and lightweight human pose estimation [DB/OL]. (2019-11-23). https://arxiv.org/abs/1911.10346

[26] LI Q, ZHANG Z Y, XIAO F, et al. Dite-HRNet: Dynamic lightweight high-resolution network for human pose estimation [DB/OL]. (2022-04-22). https://arxiv.org/abs/2204.10762

[27] MAJI D, NAGORI S, MATHEW M, et al. YOLO-pose: Enhancing YOLO for multi person pose estimation using object keypoint similarity loss [C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. New Orleans: IEEE, 2022: 2636-2645.

[28] PAPANDREOU G, ZHU T, CHEN L C, et al. PersonLab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model[M]// Computer vision – ECCV 2018. Cham: Springer, 2018: 282-299.

[29] KOCABAS M, KARAGOZ S, AKBAS E. MultiPoseNet: fast multi-person pose estimation using pose residual network[M]// Computer vision – ECCV 2018. Cham: Springer, 2018: 437-453.

[30] PAPANDREOU G, ZHU T, KANAZAWA N, et al. Towards accurate multi-person pose estimation in the wild [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 3711-3719.

[31] CARREIRA J, AGRAWAL P, FRAGKIADAKI K, et al. Human pose estimation with iterative error feedback [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 4733-4742.

[32] GKIOXARI G, TOSHEV A, JAITLY N. Chained predictions using convolutional neural networks[M]// Computer vision – ECCV 2016. Cham: Springer, 2016: 728-743.

[33] WEI S H, RAMAKRISHNA V, KANADE T, et al. Convolutional pose machines [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 4724-4732.