Multi-Human Pose Estimation by Deep Learning-Based Sequential Approach for Human Keypoint Position and Human Body Detection

doi:10.1007/s12204-023-2658-z

Abstract

Abstract: Recent multimedia and computer vision research has focused on analyzing human behavior and activity using images. Skeleton estimation, known as pose estimation, has received a significant attention. For human pose estimation, deep learning approaches primarily emphasize on the keypoint features. Conversely, in the case of occluded or incomplete poses, the keypoint feature is insufficiently substantial, especially when there are multiple humans in a single frame. Other features, such as the body border and visibility conditions, can contribute to pose estimation in addition to the keypoint feature. Our model framework integrates multiple features, namely the human body mask features, which can serve as a constraint to keypoint location estimation, the body keypoint features, and the keypoint visibility via mask region-based convolutional neural network (Mask- RCNN). A sequential multi-feature learning setup is formed to share multi-features across the structure, whereas, in the Mask-RCNN, the only feature that could be shared through the system is the region of interest feature. By two-way up-scaling with the shared weight process to produce the mask, we have addressed the problems of improper segmentation, small intrusion, and object loss when Mask-RCNN is used, for instance, segmentation. Accuracy is indicated by the percentage of correct keypoint, and our model can identify 86.1% of the correct keypoints.

Key words: multiperson, pose estimation, multi-feature learning, mask region-based convolutional neural network (RCNN), deep learning

摘要： 多媒体和计算机视觉最新研究主要集中于利用图像分析人类行为和活动。骨架估计，又称姿态估计，受到广泛关注。对于人体姿态估计，深度学习方法主要强调关键点特征。相反，在遮挡或不完整姿势情况下，关键点特征不够丰富，尤其是当一个画面里有很多人的时候。除了关键点特征外，其他特征，如身体边界和可见性条件，也有助于姿态估计。利用掩码区域卷积神经网络（Mask-RCNN），模型框架集成了多个特征，即可以作为关键点位置估计约束的人体掩模特征，人体关键点特征，和关键点可见性。在整个结构中共享多个特征以形成一个连续的多特征学习设置，而在Mask-RCNN中，唯一可以通过系统共享的特征是区域感兴趣特征。共享权重过程的双向放大产生了掩码，另外解决了使用Mask-RCNN时分割不当、小入侵和对象丢失的问题，例如分割。准确率由正确关键点的百分比来表示，还有模型可以识别出86.1%正确关键点。

关键词: 多人，姿态估计，多特征学习，掩码区域卷积神经网络，深度学习

CLC Number:

TP391.4

TAHIR Rizwana, CAI Yunze. Multi-Human Pose Estimation by Deep Learning-Based Sequential Approach for Human Keypoint Position and Human Body Detection[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(6): 1103-1113.

References

[1] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[C]//26th Annual Conference on Advance in Neural Information Process System. Lake Tahoe: Curran Assosiates, Inc., 2012: 1-9.

[2] SHELHAMER E, LONG J, DARRELL T. Fully convolutional networks for semantic segmentation [C]//IEEE Transactions on Pattern Analysis and Machine Intelligence. Boston: IEEE, 2016: 640-651.

[3] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[C]//28th Annual Conference on Advances in Neural Information Processing Systems. Quebec: MIT Press, 2015: 91-99.

[4] TOSHEV A, SZEGEDY C. DeepPose: Human pose estimation via deep neural networks [C]//2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 1653-1660.

[5] KAMEL A, SHENG B, LI P, et al. Hybrid refinement-correction heatmaps for human pose estimation [J]. IEEE Transactions on Multimedia, 2021, 23: 1330-1342.

[6] CAO Z, HIDALGO G, SIMON T, et al. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(1): 172-186.

[7] ARTACHO B, SAVAKIS A. BAPose: Bottom-up pose estimation with disentangled waterfall representations [C]//2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops. Waikoloa: IEEE, 2023: 528-537.

[8] GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation [C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. New York: ACM, 2014: 580-587.

[9] LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature pyramid networks for object detection [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 936-944.

[10] HE K M, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN [C]//2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2980-2988.

[11] LI J E, WANG Z X, QI B, et al. MEMe: A mutually enhanced modeling method for efficient and effective human pose estimation [J]. Sensors, 2022, 22(2): 632.

[12] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition [DB/OL]. (2014-09-04). https://arxiv.org/abs/1409.1556

[13] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.

[14] NEWELL A, YANG K Y, DENG J. Stacked hourglass networks for human pose estimation[M]//European conference on computer vision. Cham: Springer, 2016: 483-499.

[15] HUA G G, LI L H, LIU S G. Multipath affinage stacked—Hourglass networks for human pose estimation [J]. Frontiers of Computer Science, 2020, 14(4): 144701.

[16] CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 7103-7112.

[17] SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 5686-5696.

[18] MAO W A, GE Y T, SHEN C H, et al. Poseur: direct human pose regression with transformers[M]//European conference on computer vision. Cham: Springer, 2022: 72-88.

[19] LUVIZON D C, TABIA H, PICARD D. Human pose regression by combining indirect part detection and contextual information [J]. Computers & Graphics, 2019, 85: 15-22.

[20] LIU H, LIU W, CHI Z, et al. Fast human pose estimation in compressed videos [J]. IEEE Transactions on Multimedia, 2022, 25: 1390-1400.

[21] XIAO B, WU H P, WEI Y C. Simple baselines for human pose estimation and tracking[M]//European conference on computer vision. Cham: Springer, 2018: 472-487.

[22] XIAO J, LI H, QU G, et al. Hope: Heatmap and offset for pose estimation[J]. Journal of Ambient Intelligence and Humanized Computing, 2022, 13: 2937-2949.

[23] GKIOXARI G, HARIHARAN B, GIRSHICK R, et al. Using k-poselets for detecting people and localizing their keypoints [C]//2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 3582-3589.

[24] PISHCHULIN L, ANDRILUKA M, GEHLER P, et al. Poselet conditioned pictorial structures [C]//2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland: IEEE, 2013: 588-595.

[25] PISHCHULIN L, JAIN A, ANDRILUKA M, et al. Articulated people detection and pose estimation: Reshaping the future [C]//2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence: IEEE, 2012: 3178-3185.

[26] REN Z H, FANG F Z, YAN N, et al. State of the art in defect detection based on machine vision [J]. International Journal of Precision Engineering and Manufacturing-Green Technology, 2022, 9(2): 661-691.

[27] FELZENSZWALB P F, HUTTENLOCHER D P. Pictorial structures for object recognition [J]. International Journal of Computer Vision, 2005, 61: 55-79.

[28] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks [C]//28th Annual Conference on Advances in Neural Information Processing Systems. Quebec: MIT Press, 2015: 1-8.

[29] PAPANDREOU G, ZHU T, KANAZAWA N, et al. Towards accurate multi-person pose estimation in the wild [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 3711-3719.

[30] PISHCHULIN L, INSAFUTDINOV E, TANG S Y, et al. DeepCut: joint subset partition and labeling for multi person pose estimation [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 4929-4937.

[31] INSAFUTDINOV E, PISHCHULIN L, ANDRES B, et al. DeeperCut: A deeper, stronger, and faster multi-person pose estimation model[M]//European conference on computer vision. Cham: Springer, 2016: 34-50.

[32] INSAFUTDINOV E, ANDRILUKA M, PISHCHULIN L, et al. ArtTrack: articulated multi-person tracking in the wild [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 1293-1301.

[33] LI Z Q, BAO J S, LIU T Y, et al. Judging the normativity of PAF based on TFN and NAN [J]. Journal of Shanghai Jiao Tong University (Science), 2020, 25(5): 569-577.

[34] ZHU X, JIANG Y, LUO Z. Multi-person pose estimation for posetrack with enhanced part affinity fields [C]//2017 International Conference on Computer Vision Pose Track Workshop. Venice: IEEE, 2017: 7-11.

[35] NEWELL A, HUANG Z, DENG J. Associative embedding: End-to-end learning for joint detection and grouping[C]//Advances in Neural Information Processing Systems. Long Beach: MIT Press, 2017: 2277-2287.

[36] KOCABAS M, KARAGOZ S, AKBAS E. MultiPoseNet: fast multi-person pose estimation using pose residual network[M]//European conference on computer vision. Cham: Springer, 2018: 437-453.

[37] PAPANDREOU G, ZHU T, CHEN L C, et al. PersonLab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model[M]//European conference on computer vision. Cham: Springer, 2018: 282-299.

[38] LIN J J, LEE G H. Learning spatial context with graph neural network for multi-person pose grouping[C]//2021 IEEE International Conference on Robotics and Automation. Xi’an: IEEE, 2021: 4230-4236.

[39] HARA K, KATAOKA H, SATOH Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?[C]//IEEE conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6546-6555.

[40] PETERSEN P, VOIGTLAENDER F. Optimal approximation of piecewise smooth functions using deep ReLU neural networks [J]. Neural Networks, 2018, 108: 296-330.

[41] ZHONG Y, WANG J, PENG J, et al. Anchor box optimization for object detection[C]//IEEE/CVF Winter Conference on Applications of Computer Vision. Colorado: IEEE, 2020: 1286-1294.

[42] CHEN D, ZHANG S S, OUYANG W L, et al. Person search via a mask-guided two-stream CNN model[M]//European conference on computer vision. Cham: Springer, 2018: 764-781.

[43] RIZWAN T, CAI Y Z, AHSAN M, et al. Neural network approach for 2-dimension person pose estimation with encoded mask and keypoint detection [J]. IEEE Access, 2020, 8: 107760-107771.

[44] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context[M]//European conference on computer vision. Cham: Springer, 2014: 740-755.

[45] GU Y L, ZHANG H Y, KAMIJO S. Multi-person pose estimation using an orientation and occlusion aware deep learning network [J]. Sensors, 2020, 20(6): 1593.

[46] WEI S H, RAMAKRISHNA V, KANADE T, et al. Convolutional pose machines [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 4724-4732.

[47] CHEN K, GABRIEL P, ALASFOUR A, et al. Patient-specific pose estimation in clinical environments [J]. IEEE Journal of Translational Engineering in Health and Medicine, 2018, 6: 1-11.

[48] ZHANG R, ZHU Z, LI P, et al. Exploiting offset-guided network for pose estimation and tracking[C]// IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 20-28.