J Shanghai Jiaotong Univ Sci ›› 2025, Vol. 30 ›› Issue (6): 1103-1113.doi: 10.1007/s12204-023-2658-z

• • 上一篇    下一篇

基于深度学习序列方法的多人姿态估计用来检测人体与关键点位置

  

  1. 上海交通大学 a. 自动化系;b. 系统控制与信息处理教育部重点实验室;c. 海洋智能装备与系统教育部重点实验室,上海200240
  • 收稿日期:2022-10-28 接受日期:2023-02-10 出版日期:2025-11-21 发布日期:2025-11-26

Multi-Human Pose Estimation by Deep Learning-Based Sequential Approach for Human Keypoint Position and Human Body Detection

TAHIR Rizwana,b, 蔡云泽a,b,c   

  1. a. Department of Automation; b. Key Laboratory of System Control and Information Processing of Ministry of Education; c. Key Laboratory of Marine Intelligent Equipment and System of Ministry of Education, Shanghai Jiao Tong University, Shanghai 200240, China
  • Received:2022-10-28 Accepted:2023-02-10 Online:2025-11-21 Published:2025-11-26

摘要: 多媒体和计算机视觉最新研究主要集中于利用图像分析人类行为和活动。骨架估计,又称姿态估计,受到广泛关注。对于人体姿态估计,深度学习方法主要强调关键点特征。相反,在遮挡或不完整姿势情况下,关键点特征不够丰富,尤其是当一个画面里有很多人的时候。除了关键点特征外,其他特征,如身体边界和可见性条件,也有助于姿态估计。利用掩码区域卷积神经网络(Mask-RCNN),模型框架集成了多个特征,即可以作为关键点位置估计约束的人体掩模特征,人体关键点特征,和关键点可见性。在整个结构中共享多个特征以形成一个连续的多特征学习设置,而在Mask-RCNN中,唯一可以通过系统共享的特征是区域感兴趣特征。共享权重过程的双向放大产生了掩码,另外解决了使用Mask-RCNN时分割不当、小入侵和对象丢失的问题,例如分割。准确率由正确关键点的百分比来表示,还有模型可以识别出86.1%正确关键点。

关键词: 多人, 姿态估计, 多特征学习, 掩码区域卷积神经网络, 深度学习

Abstract: Recent multimedia and computer vision research has focused on analyzing human behavior and activity using images. Skeleton estimation, known as pose estimation, has received a significant attention. For human pose estimation, deep learning approaches primarily emphasize on the keypoint features. Conversely, in the case of occluded or incomplete poses, the keypoint feature is insufficiently substantial, especially when there are multiple humans in a single frame. Other features, such as the body border and visibility conditions, can contribute to pose estimation in addition to the keypoint feature. Our model framework integrates multiple features, namely the human body mask features, which can serve as a constraint to keypoint location estimation, the body keypoint features, and the keypoint visibility via mask region-based convolutional neural network (Mask- RCNN). A sequential multi-feature learning setup is formed to share multi-features across the structure, whereas, in the Mask-RCNN, the only feature that could be shared through the system is the region of interest feature. By two-way up-scaling with the shared weight process to produce the mask, we have addressed the problems of improper segmentation, small intrusion, and object loss when Mask-RCNN is used, for instance, segmentation. Accuracy is indicated by the percentage of correct keypoint, and our model can identify 86.1% of the correct keypoints.

中图分类号: