基于语义动态超图卷积的三维手姿态估计

doi:10.1007/s12204-024-2697-0

摘要/Abstract

摘要： 由于手的自遮挡和高自由度变化，基于单张RGB图像进行3D手姿态估计是一个极具挑战性的问题。图卷积网络利用图描述手关节之间的结构关系，在一定程度上可以提高3D手姿态回归的准确性，然而，图卷积神经网络不能有效描述非相邻手部关节点间的关系。近来，广受关注的超图卷积网络能够通过超边描述节点之间的多元高维关系。因此，本文提出了一种基于超图卷积网络的手三维姿态估计框架，能够更好提取相邻和非相邻手关节之间的关联关系。为了克服预定义超图结构的缺点，提出了一种动态超图卷积网络（DHGCN），其中超边是基于手部关节特征相似性动态构建的。为了更好地探索节点之间的局部语义关系，提出了一种语义动态超图卷积（SDHGCN）。该方法在公开的基准数据集上进行了评估。本文在两个公开的基准数据集STB、RHD上评估了所提出的方法。定性定量的实验结果均表明，相较于图卷积网络，超图卷积网络更适用于手部姿态估计任务，与现有方法的对比实验表明本文所提出的网络框架达到了主流水平。

关键词: 手姿态估计, 超图卷积, 动态超图卷积, 语义动态超图卷积

Abstract: Due to self-occlusion and high degree of freedom, estimating 3D hand pose from a single RGB image is a great challenging problem. Graph convolutional networks (GCNs) use graphs to describe the physical connection relationships between hand joints and improve the accuracy of 3D hand pose regression. However, GCNs cannot effectively describe the relationships between non-adjacent hand joints. Recently, hypergraph convolutional networks (HGCNs) have received much attention as they can describe multi-dimensional relationships between nodes through hyperedges; therefore, this paper proposes a framework for 3D hand pose estimation based on HGCN, which can better extract correlated relationships between adjacent and non-adjacent hand joints. To overcome the shortcomings of predefined hypergraph structures, a kind of dynamic hypergraph convolutional network is proposed, in which hyperedges are constructed dynamically based on hand joint feature similarity. To better explore the local semantic relationships between nodes, a kind of semantic dynamic hypergraph convolution is proposed. The proposed method is evaluated on publicly available benchmark datasets. Qualitative and quantitative experimental results both show that the proposed HGCN and improved methods for 3D hand pose estimation are better than GCN, and achieve state-of-the-art performance compared with existing methods.

中图分类号:

TP183

. 基于语义动态超图卷积的三维手姿态估计[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(5): 855-865.

WU Yalei, LI Jinghua, KONG Dehui, LI Qianxing, YIN Baocai. 3D Hand Pose Estimation Using Semantic Dynamic Hypergraph Convolutional Networks[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(5): 855-865.

参考文献

[1] DOOSTI B, NAHA S, MIRBAGHERI M, et al. Hope-net: A graph-based model for hand-object pose estimation[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 6608-6617.

[2] GE L H, REN Z, LI Y C, et al. 3D hand shape and pose estimation from a single RGB image[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019:10833-10842.

[3] GUO S X, RIGALL E, QI L, et al. Graph-based CNNs with self-supervised module for 3d hand pose estimation from monocular RGB[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 31(4): 1514-1525.

[4] CHEN L J, LIN S Y, XIE Y S, et al. Temporal-aware self-supervised learning for 3d hand pose and mesh estimation in videos[C]// 2021 IEEE Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2021: 1050-1059.

[5] XIONG F, ZHANG B S, XIAO Y, et al. A2J: Anchor-to-joint regression network for 3d articulated pose estimation from a single depth image[C]// 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 793-802.

[6] YUAN S X, GARCIA-HERNANDO G, STENGER B, et al. Depth-based 3d hand pose estimation: from current achievements to future goals[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 2636-2645.

[7] ZIMMERMANN C, BROX T. Learning to estimate 3d hand pose from single RGB images[C]// 2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 4903-4911.

[8] PANTELERIS P, ARGYROS A. Back to RGB: 3d tracking of hands and hand-object interactions based on short-baseline stereo[C]// 2017 IEEE International Conference on Computer Vision Workshops. Venice: IEEE, 2017: 575-584.

[9] CAI Y J, GE L H, CAI J F, et al. Weakly-supervised 3d hand pose estimation from monocular RGB images[C]// Proceedings of the European Conference on Computer Vision, Munich: Springer, 2018: 666-682.

[10] GUO S X, RIGALL E, JU Y K, et al. 3D hand pose estimation from monocular RGB with feature interaction module[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(8): 5293-5306.

[11] SIMON T, JOO H, MATTHEWS I, et al. Hand keypoint detection in single images using multiview bootstrapping[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu: IEEE, 2017: 1145-1153.

[12] YU J, TAO D C, WANG M. Adaptive hypergraph learning and its application in image classification[J]. IEEE Transactions on Image Processing, 2012, 21(7): 3262-3272.

[13] JIANG J W, WEI Y X, FENG Y F, et al. Dynamic hypergraph neural networks[C]// Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. Freiburg: IJCAI, 2019: 2635-2641.

[14] CAI Y, GE L, LIU J, et al. Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks[C]// 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019:2272-2281.

[15] CAI Y J, GE L H, CAI J, et al. 3D hand pose estimation using synthetic data and weakly labeled RGB images[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 43(11): 3739-3753.

[16] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.

[17] KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks[C]// Proceedings of 5th International Conference on Learning Representations. Toulon: ICLR, 2017.

[18] FENG Y F, YOU H X, ZHANG Z Z, et al. Hypergraph neural networks[C]// The Thirty-Third AAAI Conference on Artiﬁcial Intelligence. Hilton Hawaiian Village: AAAI Press, 2019, 33(01): 3558-3565.

[19] LIU S, LV P, ZHANG Y, et al. Semi-dynamic hypergraph neural network for 3d pose estimation[C]// Proceedings of the twenty-ninth International Joint Conference on Artificial Intelligence, Yokohama, Japan: Local Organizing Committee, 2020: 782-788.

[20] XU X X, ZOU Q, LIN X. Adaptive hypergraph neural network for multi-person pose estimation[C]// Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver: AAAI Press, 2022, 36(3): 2955-2963.

[21] ZHAO L, PENG X, TIAN Y, et al. Semantic graph convolutional networks for 3d human pose regression[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 3425-3435.

[22] ZHANG J W, JIAO J B, CHEN M L, et al. A hand pose tracking benchmark from stereo matching[C]// 2017 IEEE International Conference on Image Processing. Beijing: IEEE, 2017: 982-986.

[23] ZIMMERMANN C, CEYLAN D, YANG J, et al. Freihand: a dataset for markerless capture of hand pose and shape from single RGB images[C]// 2019 IEEE/CVF International Conference on Computer Vision. Seoul：IEEE, 2019:813-822.

[24] GE L H, CAI Y J, WENG J W, et al. Hand Pointnet: 3D hand pose estimation using point sets[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE 2018: 8417-8426.

[25] YANG L X, LI J S, XU W Q, et al. Bihand: recovering hand mesh with multi-stage bisected hourglass networks[C]// Proceedings of the British Machine Vision Conference. Virtual: British Machine Vision Association, 2020.

[26] SPURR A, SONG J, PARK S, et al. Cross-modal deep variational hand pose estimation[C]// 018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 89-98.

[27] YANG L L, LI S L, LEE D, et al. Aligning latent spaces for 3d hand pose estimation[C]// 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 2335-2343.

[28] THEODORIDIS T, CHATZIS T, SOLACHIDIS V, et al. Cross-modal variational alignment of latent spaces[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Seattle: IEEE, 2020: 960-961.

[29] STERGIOULAS A, CHATZIS T, KONSTANTINIDIS D, et al. 3D Hand pose estimation via aligned latent space injection and kinematic losses[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Nashville: IEEE, 2021: 1730-1739.

[30] CUI Y, LI M, GAO Y, et al. Camera distance helps 3D hand pose estimated from a single RGB image[J]. Graphical Models, 2023, 127:101179.

[31] KOURBANE I, GENC Y. A hybrid classification-regression approach for 3D hand pose estimation using graph convolutional networks[J]. Signal Processing Image Communication, 2022:101.

[32] HASSON Y., VAROL G., TZIONAS D, et al. Learning joint reconstruction of hands and manipulated objects[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach：IEEE, 2019:11807–11816.