Intelligent Robots

ListPose: Lightweight and Implicit Spatial-Temporal Modeling with TokenPose for Video-Based Pose Estimation

Expand
  • School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications, Beijing 100876, China

Received date: 2024-11-13

  Accepted date: 2024-12-02

  Online published: 2026-02-12

Abstract

Video pose estimation has gained significant attention in the field of deep learning. Compared to traditional image-based pose estimation methods, video pose estimation leverages inter-frame relationships and temporal cues to provide more accurate and robust results. However, handling pose estimation in video still faces challenges in terms of modeling frames’ dependency and considering real-world applications’ latency. To address these issues, we propose a lightweight video pose estimation model based on the Transformer architecture. First, we discard the heavy pose-initialization module and retain only a lightweight frame encoder to simplify the model. Second, we introduce a novel residual token initialization module to model frame dependencies and implicitly capture the spatial-temporal correlations between adjacent frames. Additionally, we employ TokenPose as the feature extractor, which leverages self-attention mechanisms to implicitly model the spatial relationships between keypoints and effectively reduces model parameters and computational complexity. We evaluate our method on the Penn Action dataset and Sub-JHMDB dataset, two commonly used benchmarks for video pose estimation. The results demonstrate that our approach achieves comparable performance while significantly reducing the number of model parameters and computational complexity.

Cite this article

Wu Zhiyang, Zhang Zhicheng, Dang Yonghao, Yin Jianqin, Tang Jin . ListPose: Lightweight and Implicit Spatial-Temporal Modeling with TokenPose for Video-Based Pose Estimation[J]. Journal of Shanghai Jiaotong University(Science), 2026 , 31(1) : 143 -153 . DOI: 10.1007/s12204-025-2815-7

References

[1] SONG Y L, DEMIRDJIAN D, DAVIS R. Continuous body and hand gesture recognition for natural human-computer interaction [J]. ACM Transactions on Interactive Intelligent Systems, 2012, 2(1): 1-28.

[2] LIN H Y, CHEN T W. Augmented reality with human body interaction based on monocular 3D pose estimation [M]//Advanced concepts for intelligent vision systems. Berlin, Heidelberg: Springer, 2010: 321-331.

[3] IQBAL U, GARBADE M, GALL J, et al. Pose for action - action for pose [C]//2017 12th IEEE International Conference on Automatic Face & Gesture Recognition. Washington, DC: IEEE, 2017: 438-445.

[4] DOSOVITSKIY A, FISCHER P, ILG E, et al. FlowNet: Learning optical flow with convolutional networks [C]//2015 IEEE International Conference on Computer Vision. Santiago: IEEE, 2015: 2758-2766.

[5] LUO Y, REN J, WANG Z X, et al. LSTM pose machines [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 5207-5215.

[6] NIE X C, LI Y C, LUO L J, et al. Dynamic kernel distillation for efficient pose estimation in videos [C]//2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 6942-6950.

[7] DANG Y H, YIN J Q, ZHANG S J. Relation-based associative joint location for human pose estimation in videos [J]. IEEE Transactions on Image Processing, 2022, 31: 3973-3986.

[8] LIU Y, CHEN J S. PosePropagationNet: Towards accurate and efficient pose estimation in videos [J]. IEEE Access, 2020, 8: 100661-100669.

[9]  DOSOVITSKIY A. An image is worth 16x16 words: Transformers for image recognition at scale [DB/OL]. (2020-10-22). https://arxiv.org/abs/2010.11929

[10] LIU Z, LIN Y T, CAO Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows [C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 9992-10002.

[11] TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[C]// 38th International Conference on Machine Learning. Online: PMLR, 2021: 10347-10357.

[12] YANG S, QUAN Z B, NIE M, et al. TransPose: Keypoint localization via transformer [C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 11782-11792.

[13] LI Y J, ZHANG S K, WANG Z C, et al. TokenPose: Learning keypoint tokens for human pose estimation [C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 11293-11302.

[14] XU Y, ZHANG J, ZHANG Q, et al. Vitpose: Simple vision transformer baselines for human pose estimation [C]// 36th Conference on Neural Information Processing Systems. New Orleans: NIPS, 2022: 38571-38584.

[15] MA H Y, WANG Z, CHEN Y F, et al. PPT: token-pruned pose transformer for monocular and multi-view human pose estimation [M]//Computer vision – ECCV 2022. Cham: Springer, 2022: 424-442.

[16] ZHANG W Y, ZHU M L, DERPANIS K G. From actemes to action: A strongly-supervised representation for detailed action understanding [C]//2013 IEEE International Conference on Computer Vision. Sydney: IEEE, 2013: 2248-2255.

[17] JHUANG H, GALL J, ZUFFI S, et al. Towards understanding action recognition [C]//2013 IEEE International Conference on Computer Vision. Sydney: IEEE, 2013: 3192-3199.

[18] ANDRILUKA M, ROTH S, SCHIELE B. Pictorial structures revisited: People detection and articulated pose estimation [C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami: IEEE, 2009: 1014-1021.

[19] PISHCHULIN L, ANDRILUKA M, GEHLER P, et al. Strong appearance and expressive spatial models for human pose estimation [C]//2013 IEEE International Conference on Computer Vision. Sydney: IEEE, 2013: 3487-3494.

[20] YANG Y, RAMANAN D. Articulated pose estimation with flexible mixtures-of-parts [C]//CVPR 2011. Colorado Springs: IEEE, 2011: 1385-1392.

[21] TOSHEV A, SZEGEDY C. DeepPose: Human pose estimation via deep neural networks [C]//2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 1653-1660.

[22] WEI S H, RAMAKRISHNA V, KANADE T, et al. Convolutional pose machines [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 4724-4732.

[23] NEWELL A, YANG K Y, DENG J. Stacked hourglass networks for human pose estimation [DB/OL]. (2016-03-22). https://arxiv.org/abs/1603.06937

[24] XIAO B, WU H P, WEI Y C. Simple baselines for human pose estimation and tracking [C]// Computer Vision – ECCV 2018. Cham: Springer, 2018: 472-487.

[25] SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 5693-5703.

[26] HOCHREITER S, SCHMIDHUBER J. Long short-term memory [J]. Neural Computation, 1997, 9(8): 1735-1780.

[27] ZHOU X M, YU X L, XU C. Fast and accurate pose estimation in videos based on knowledge distillation and pose propagation [C]//2022 International Joint Conference on Neural Networks. Padua: IEEE, 2022: 1-8.

[28] HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network [DB/OL]. (2015-03-09). https://arxiv.org/abs/1503.02531

[29] CHU X S, JI R R, GAO W, et al. An improved lightweight human pose estimation method in video [C]//2023 China Automation Congress. Chongqing: IEEE, 2023: 7133-7138.

[30] HE K M, CHEN X L, XIE S N, et al. Masked autoencoders are scalable vision learners [C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 15979-15988.

[31] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.

[32] YANG Y, RAMANAN D. Articulated human detection with flexible mixtures of parts [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12): 2878-2890.

Outlines

/