Multi-Human Pose Estimation by Deep Learning-Based Sequential Approach for Human Keypoint Position and Human Body Detection

TAHIR Rizwana, CAI Yunze

doi:10.1007/s12204-023-2658-z

Journal of Shanghai Jiaotong University(Science) >

2025 , Vol. 30 >Issue 6: 1103 - 1113

DOI: https://doi.org/10.1007/s12204-023-2658-z

Automation & Computer Technologies

Multi-Human Pose Estimation by Deep Learning-Based Sequential Approach for Human Keypoint Position and Human Body Detection

Expand

a. Department of Automation; b. Key Laboratory of System Control and Information Processing of Ministry of Education; c. Key Laboratory of Marine Intelligent Equipment and System of Ministry of Education, Shanghai Jiao Tong University, Shanghai 200240, China

Received date: 2022-10-28

Accepted date: 2023-02-10

Online published: 2023-10-24

Fold

Abstract

Recent multimedia and computer vision research has focused on analyzing human behavior and activity using images. Skeleton estimation, known as pose estimation, has received a significant attention. For human pose estimation, deep learning approaches primarily emphasize on the keypoint features. Conversely, in the case of occluded or incomplete poses, the keypoint feature is insufficiently substantial, especially when there are multiple humans in a single frame. Other features, such as the body border and visibility conditions, can contribute to pose estimation in addition to the keypoint feature. Our model framework integrates multiple features, namely the human body mask features, which can serve as a constraint to keypoint location estimation, the body keypoint features, and the keypoint visibility via mask region-based convolutional neural network (Mask- RCNN). A sequential multi-feature learning setup is formed to share multi-features across the structure, whereas, in the Mask-RCNN, the only feature that could be shared through the system is the region of interest feature. By two-way up-scaling with the shared weight process to produce the mask, we have addressed the problems of improper segmentation, small intrusion, and object loss when Mask-RCNN is used, for instance, segmentation. Accuracy is indicated by the percentage of correct keypoint, and our model can identify 86.1% of the correct keypoints.

Cite this article

TAHIR Rizwana, CAI Yunze . Multi-Human Pose Estimation by Deep Learning-Based Sequential Approach for Human Keypoint Position and Human Body Detection[J]. Journal of Shanghai Jiaotong University(Science), 2025 , 30(6) : 1103 -1113 . DOI: 10.1007/s12204-023-2658-z

References

[1] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[C]//26th Annual Conference on Advance in Neural Information Process System. Lake Tahoe: Curran Assosiates, Inc., 2012: 1-9.

[2] SHELHAMER E, LONG J, DARRELL T. Fully convolutional networks for semantic segmentation [C]//IEEE Transactions on Pattern Analysis and Machine Intelligence. Boston: IEEE, 2016: 640-651.

[3] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[C]//28th Annual Conference on Advances in Neural Information Processing Systems. Quebec: MIT Press, 2015: 91-99.

[4] TOSHEV A, SZEGEDY C. DeepPose: Human pose estimation via deep neural networks [C]//2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 1653-1660.

[5] KAMEL A, SHENG B, LI P, et al. Hybrid refinement-correction heatmaps for human pose estimation [J]. IEEE Transactions on Multimedia, 2021, 23: 1330-1342.

[6] CAO Z, HIDALGO G, SIMON T, et al. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(1): 172-186.

[7] ARTACHO B, SAVAKIS A. BAPose: Bottom-up pose estimation with disentangled waterfall representations [C]//2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops. Waikoloa: IEEE, 2023: 528-537.

[8] GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation [C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. New York: ACM, 2014: 580-587.

[9] LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature pyramid networks for object detection [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 936-944.

[10] HE K M, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN [C]//2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2980-2988.

[11] LI J E, WANG Z X, QI B, et al. MEMe: A mutually enhanced modeling method for efficient and effective human pose estimation [J]. Sensors, 2022, 22(2): 632.

[12] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition [DB/OL]. (2014-09-04). https://arxiv.org/abs/1409.1556

[13] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.

[14] NEWELL A, YANG K Y, DENG J. Stacked hourglass networks for human pose estimation[M]//European conference on computer vision. Cham: Springer, 2016: 483-499.

[15] HUA G G, LI L H, LIU S G. Multipath affinage stacked—Hourglass networks for human pose estimation [J]. Frontiers of Computer Science, 2020, 14(4): 144701.

[16] CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 7103-7112.

[17] SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 5686-5696.

[18] MAO W A, GE Y T, SHEN C H, et al. Poseur: direct human pose regression with transformers[M]//European conference on computer vision. Cham: Springer, 2022: 72-88.

[19] LUVIZON D C, TABIA H, PICARD D. Human pose regression by combining indirect part detection and contextual information [J]. Computers & Graphics, 2019, 85: 15-22.

[20] LIU H, LIU W, CHI Z, et al. Fast human pose estimation in compressed videos [J]. IEEE Transactions on Multimedia, 2022, 25: 1390-1400.

[21] XIAO B, WU H P, WEI Y C. Simple baselines for human pose estimation and tracking[M]//European conference on computer vision. Cham: Springer, 2018: 472-487.

[22] XIAO J, LI H, QU G, et al. Hope: Heatmap and offset for pose estimation[J]. Journal of Ambient Intelligence and Humanized Computing, 2022, 13: 2937-2949.

[23] GKIOXARI G, HARIHARAN B, GIRSHICK R, et al. Using k-poselets for detecting people and localizing their keypoints [C]//2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 3582-3589.

[24] PISHCHULIN L, ANDRILUKA M, GEHLER P, et al. Poselet conditioned pictorial structures [C]//2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland: IEEE, 2013: 588-595.

[25] PISHCHULIN L, JAIN A, ANDRILUKA M, et al. Articulated people detection and pose estimation: Reshaping the future [C]//2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence: IEEE, 2012: 3178-3185.

[26] REN Z H, FANG F Z, YAN N, et al. State of the art in defect detection based on machine vision [J]. International Journal of Precision Engineering and Manufacturing-Green Technology, 2022, 9(2): 661-691.

[27] FELZENSZWALB P F, HUTTENLOCHER D P. Pictorial structures for object recognition [J]. International Journal of Computer Vision, 2005, 61: 55-79.

[28] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks [C]//28th Annual Conference on Advances in Neural Information Processing Systems. Quebec: MIT Press, 2015: 1-8.

[29] PAPANDREOU G, ZHU T, KANAZAWA N, et al. Towards accurate multi-person pose estimation in the wild [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 3711-3719.

[30] PISHCHULIN L, INSAFUTDINOV E, TANG S Y, et al. DeepCut: joint subset partition and labeling for multi person pose estimation [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 4929-4937.

[31] INSAFUTDINOV E, PISHCHULIN L, ANDRES B, et al. DeeperCut: A deeper, stronger, and faster multi-person pose estimation model[M]//European conference on computer vision. Cham: Springer, 2016: 34-50.

[32] INSAFUTDINOV E, ANDRILUKA M, PISHCHULIN L, et al. ArtTrack: articulated multi-person tracking in the wild [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 1293-1301.

[33] LI Z Q, BAO J S, LIU T Y, et al. Judging the normativity of PAF based on TFN and NAN [J]. Journal of Shanghai Jiao Tong University (Science), 2020, 25(5): 569-577.

[34] ZHU X, JIANG Y, LUO Z. Multi-person pose estimation for posetrack with enhanced part affinity fields [C]//2017 International Conference on Computer Vision Pose Track Workshop. Venice: IEEE, 2017: 7-11.

[35] NEWELL A, HUANG Z, DENG J. Associative embedding: End-to-end learning for joint detection and grouping[C]//Advances in Neural Information Processing Systems. Long Beach: MIT Press, 2017: 2277-2287.

[36] KOCABAS M, KARAGOZ S, AKBAS E. MultiPoseNet: fast multi-person pose estimation using pose residual network[M]//European conference on computer vision. Cham: Springer, 2018: 437-453.

[37] PAPANDREOU G, ZHU T, CHEN L C, et al. PersonLab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model[M]//European conference on computer vision. Cham: Springer, 2018: 282-299.

[38] LIN J J, LEE G H. Learning spatial context with graph neural network for multi-person pose grouping[C]//2021 IEEE International Conference on Robotics and Automation. Xi’an: IEEE, 2021: 4230-4236.

[39] HARA K, KATAOKA H, SATOH Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?[C]//IEEE conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6546-6555.

[40] PETERSEN P, VOIGTLAENDER F. Optimal approximation of piecewise smooth functions using deep ReLU neural networks [J]. Neural Networks, 2018, 108: 296-330.

[41] ZHONG Y, WANG J, PENG J, et al. Anchor box optimization for object detection[C]//IEEE/CVF Winter Conference on Applications of Computer Vision. Colorado: IEEE, 2020: 1286-1294.

[42] CHEN D, ZHANG S S, OUYANG W L, et al. Person search via a mask-guided two-stream CNN model[M]//European conference on computer vision. Cham: Springer, 2018: 764-781.

[43] RIZWAN T, CAI Y Z, AHSAN M, et al. Neural network approach for 2-dimension person pose estimation with encoded mask and keypoint detection [J]. IEEE Access, 2020, 8: 107760-107771.

[44] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context[M]//European conference on computer vision. Cham: Springer, 2014: 740-755.

[45] GU Y L, ZHANG H Y, KAMIJO S. Multi-person pose estimation using an orientation and occlusion aware deep learning network [J]. Sensors, 2020, 20(6): 1593.

[46] WEI S H, RAMAKRISHNA V, KANADE T, et al. Convolutional pose machines [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 4724-4732.

[47] CHEN K, GABRIEL P, ALASFOUR A, et al. Patient-specific pose estimation in clinical environments [J]. IEEE Journal of Translational Engineering in Health and Medicine, 2018, 6: 1-11.

[48] ZHANG R, ZHU Z, LI P, et al. Exploiting offset-guided network for pose estimation and tracking[C]// IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 20-28.

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

References