Journal of Shanghai Jiaotong University(Science) >
ListPose: Lightweight and Implicit Spatial-Temporal Modeling with TokenPose for Video-Based Pose Estimation
Received date: 2024-11-13
Accepted date: 2024-12-02
Online published: 2026-02-12
Wu Zhiyang, Zhang Zhicheng, Dang Yonghao, Yin Jianqin, Tang Jin . ListPose: Lightweight and Implicit Spatial-Temporal Modeling with TokenPose for Video-Based Pose Estimation[J]. Journal of Shanghai Jiaotong University(Science), 2026 , 31(1) : 143 -153 . DOI: 10.1007/s12204-025-2815-7
[1] SONG Y L, DEMIRDJIAN D, DAVIS R. Continuous body and hand gesture recognition for natural human-computer interaction [J]. ACM Transactions on Interactive Intelligent Systems, 2012, 2(1): 1-28.
[2] LIN H Y, CHEN T W. Augmented reality with human body interaction based on monocular 3D pose estimation [M]//Advanced concepts for intelligent vision systems. Berlin, Heidelberg: Springer, 2010: 321-331.
[3] IQBAL U, GARBADE M, GALL J, et al. Pose for action - action for pose [C]//2017 12th IEEE International Conference on Automatic Face & Gesture Recognition. Washington, DC: IEEE, 2017: 438-445.
[4] DOSOVITSKIY A, FISCHER P, ILG E, et al. FlowNet: Learning optical flow with convolutional networks [C]//2015 IEEE International Conference on Computer Vision. Santiago: IEEE, 2015: 2758-2766.
[5] LUO Y, REN J, WANG Z X, et al. LSTM pose machines [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 5207-5215.
[6] NIE X C, LI Y C, LUO L J, et al. Dynamic kernel distillation for efficient pose estimation in videos [C]//2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 6942-6950.
[7] DANG Y H, YIN J Q, ZHANG S J. Relation-based associative joint location for human pose estimation in videos [J]. IEEE Transactions on Image Processing, 2022, 31: 3973-3986.
[8] LIU Y, CHEN J S. PosePropagationNet: Towards accurate and efficient pose estimation in videos [J]. IEEE Access, 2020, 8: 100661-100669.
[9] DOSOVITSKIY A. An image is worth 16x16 words: Transformers for image recognition at scale [DB/OL]. (2020-10-22). https://arxiv.org/abs/2010.11929
[10] LIU Z, LIN Y T, CAO Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows [C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 9992-10002.
[11] TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[C]// 38th International Conference on Machine Learning. Online: PMLR, 2021: 10347-10357.
[12] YANG S, QUAN Z B, NIE M, et al. TransPose: Keypoint localization via transformer [C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 11782-11792.
[13] LI Y J, ZHANG S K, WANG Z C, et al. TokenPose: Learning keypoint tokens for human pose estimation [C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 11293-11302.
[14] XU Y, ZHANG J, ZHANG Q, et al. Vitpose: Simple vision transformer baselines for human pose estimation [C]// 36th Conference on Neural Information Processing Systems. New Orleans: NIPS, 2022: 38571-38584.
[15] MA H Y, WANG Z, CHEN Y F, et al. PPT: token-pruned pose transformer for monocular and multi-view human pose estimation [M]//Computer vision – ECCV 2022. Cham: Springer, 2022: 424-442.
[16] ZHANG W Y, ZHU M L, DERPANIS K G. From actemes to action: A strongly-supervised representation for detailed action understanding [C]//2013 IEEE International Conference on Computer Vision. Sydney: IEEE, 2013: 2248-2255.
[17] JHUANG H, GALL J, ZUFFI S, et al. Towards understanding action recognition [C]//2013 IEEE International Conference on Computer Vision. Sydney: IEEE, 2013: 3192-3199.
[18] ANDRILUKA M, ROTH S, SCHIELE B. Pictorial structures revisited: People detection and articulated pose estimation [C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami: IEEE, 2009: 1014-1021.
[19] PISHCHULIN L, ANDRILUKA M, GEHLER P, et al. Strong appearance and expressive spatial models for human pose estimation [C]//2013 IEEE International Conference on Computer Vision. Sydney: IEEE, 2013: 3487-3494.
[20] YANG Y, RAMANAN D. Articulated pose estimation with flexible mixtures-of-parts [C]//CVPR 2011. Colorado Springs: IEEE, 2011: 1385-1392.
[21] TOSHEV A, SZEGEDY C. DeepPose: Human pose estimation via deep neural networks [C]//2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 1653-1660.
[22] WEI S H, RAMAKRISHNA V, KANADE T, et al. Convolutional pose machines [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 4724-4732.
[23] NEWELL A, YANG K Y, DENG J. Stacked hourglass networks for human pose estimation [DB/OL]. (2016-03-22). https://arxiv.org/abs/1603.06937
[24] XIAO B, WU H P, WEI Y C. Simple baselines for human pose estimation and tracking [C]// Computer Vision – ECCV 2018. Cham: Springer, 2018: 472-487.
[25] SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 5693-5703.
[26] HOCHREITER S, SCHMIDHUBER J. Long short-term memory [J]. Neural Computation, 1997, 9(8): 1735-1780.
[27] ZHOU X M, YU X L, XU C. Fast and accurate pose estimation in videos based on knowledge distillation and pose propagation [C]//2022 International Joint Conference on Neural Networks. Padua: IEEE, 2022: 1-8.
[28] HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network [DB/OL]. (2015-03-09). https://arxiv.org/abs/1503.02531
[29] CHU X S, JI R R, GAO W, et al. An improved lightweight human pose estimation method in video [C]//2023 China Automation Congress. Chongqing: IEEE, 2023: 7133-7138.
[30] HE K M, CHEN X L, XIE S N, et al. Masked autoencoders are scalable vision learners [C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 15979-15988.
[31] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
[32] YANG Y, RAMANAN D. Articulated human detection with flexible mixtures of parts [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12): 2878-2890.
/
| 〈 |
|
〉 |