ListPose: Lightweight and Implicit Spatial-Temporal Modeling with TokenPose for Video-Based Pose Estimation

doi:10.1007/s12204-025-2815-7

Abstract

Abstract: Video pose estimation has gained significant attention in the field of deep learning. Compared to traditional image-based pose estimation methods, video pose estimation leverages inter-frame relationships and temporal cues to provide more accurate and robust results. However, handling pose estimation in video still faces challenges in terms of modeling frames’ dependency and considering real-world applications’ latency. To address these issues, we propose a lightweight video pose estimation model based on the Transformer architecture. First, we discard the heavy pose-initialization module and retain only a lightweight frame encoder to simplify the model. Second, we introduce a novel residual token initialization module to model frame dependencies and implicitly capture the spatial-temporal correlations between adjacent frames. Additionally, we employ TokenPose as the feature extractor, which leverages self-attention mechanisms to implicitly model the spatial relationships between keypoints and effectively reduces model parameters and computational complexity. We evaluate our method on the Penn Action dataset and Sub-JHMDB dataset, two commonly used benchmarks for video pose estimation. The results demonstrate that our approach achieves comparable performance while significantly reducing the number of model parameters and computational complexity.

Key words: computer vision, video pose estimation, Transformer, spatio-temporal modeling

摘要： 视频姿态估计在深度学习领域受到了广泛关注。与传统的基于图像的姿态估计方法相比，视频姿态估计利用帧间关系和时间线索来提供更准确和鲁棒的结果。然而，在视频中处理姿态估计仍然面临着建模帧依赖性和考虑现实应用延迟方面的挑战。为了解决这些问题，提出一种基于变换器架构的轻量级视频姿态估计模型。首先，放弃了繁重的姿势初始化模块，只保留了一个轻量级的帧编码器来简化模型。其次，引入了一种新的残差令牌初始化模块来建模帧依赖关系，并隐式捕获相邻帧之间的时空相关性。此外，采用TokenPose作为特征提取器，它利用自关注机制隐式地建模关键点之间的空间关系，有效地降低了模型参数和计算复杂度。在Penn Action数据集和Sub-JHMDB数据集上评估了我们的方法，这两个数据集是视频姿态估计的常用基准。结果表明：我们的方法在显著减少模型参数数量和计算复杂度的同时，实现了可比较的性能。

关键词: 计算机视觉，视频姿态估计，Transformer，时空建模

CLC Number:

TP18

Wu Zhiyang, Zhang Zhicheng, Dang Yonghao, Yin Jianqin, Tang Jin. ListPose: Lightweight and Implicit Spatial-Temporal Modeling with TokenPose for Video-Based Pose Estimation[J]. J Shanghai Jiaotong Univ Sci, 2026, 31(1): 143-153.

References

[1] SONG Y L, DEMIRDJIAN D, DAVIS R. Continuous body and hand gesture recognition for natural human-computer interaction [J]. ACM Transactions on Interactive Intelligent Systems, 2012, 2(1): 1-28.

[2] LIN H Y, CHEN T W. Augmented reality with human body interaction based on monocular 3D pose estimation [M]//Advanced concepts for intelligent vision systems. Berlin, Heidelberg: Springer, 2010: 321-331.

[3] IQBAL U, GARBADE M, GALL J, et al. Pose for action - action for pose [C]//2017 12th IEEE International Conference on Automatic Face & Gesture Recognition. Washington, DC: IEEE, 2017: 438-445.

[4] DOSOVITSKIY A, FISCHER P, ILG E, et al. FlowNet: Learning optical flow with convolutional networks [C]//2015 IEEE International Conference on Computer Vision. Santiago: IEEE, 2015: 2758-2766.

[5] LUO Y, REN J, WANG Z X, et al. LSTM pose machines [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 5207-5215.

[6] NIE X C, LI Y C, LUO L J, et al. Dynamic kernel distillation for efficient pose estimation in videos [C]//2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 6942-6950.

[7] DANG Y H, YIN J Q, ZHANG S J. Relation-based associative joint location for human pose estimation in videos [J]. IEEE Transactions on Image Processing, 2022, 31: 3973-3986.

[8] LIU Y, CHEN J S. PosePropagationNet: Towards accurate and efficient pose estimation in videos [J]. IEEE Access, 2020, 8: 100661-100669.

[9] DOSOVITSKIY A. An image is worth 16x16 words: Transformers for image recognition at scale [DB/OL]. (2020-10-22). https://arxiv.org/abs/2010.11929

[10] LIU Z, LIN Y T, CAO Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows [C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 9992-10002.

[11] TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[C]// 38th International Conference on Machine Learning. Online: PMLR, 2021: 10347-10357.

[12] YANG S, QUAN Z B, NIE M, et al. TransPose: Keypoint localization via transformer [C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 11782-11792.

[13] LI Y J, ZHANG S K, WANG Z C, et al. TokenPose: Learning keypoint tokens for human pose estimation [C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 11293-11302.

[14] XU Y, ZHANG J, ZHANG Q, et al. Vitpose: Simple vision transformer baselines for human pose estimation [C]// 36th Conference on Neural Information Processing Systems. New Orleans: NIPS, 2022: 38571-38584.

[15] MA H Y, WANG Z, CHEN Y F, et al. PPT: token-pruned pose transformer for monocular and multi-view human pose estimation [M]//Computer vision – ECCV 2022. Cham: Springer, 2022: 424-442.

[16] ZHANG W Y, ZHU M L, DERPANIS K G. From actemes to action: A strongly-supervised representation for detailed action understanding [C]//2013 IEEE International Conference on Computer Vision. Sydney: IEEE, 2013: 2248-2255.

[17] JHUANG H, GALL J, ZUFFI S, et al. Towards understanding action recognition [C]//2013 IEEE International Conference on Computer Vision. Sydney: IEEE, 2013: 3192-3199.

[18] ANDRILUKA M, ROTH S, SCHIELE B. Pictorial structures revisited: People detection and articulated pose estimation [C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami: IEEE, 2009: 1014-1021.

[19] PISHCHULIN L, ANDRILUKA M, GEHLER P, et al. Strong appearance and expressive spatial models for human pose estimation [C]//2013 IEEE International Conference on Computer Vision. Sydney: IEEE, 2013: 3487-3494.

[20] YANG Y, RAMANAN D. Articulated pose estimation with flexible mixtures-of-parts [C]//CVPR 2011. Colorado Springs: IEEE, 2011: 1385-1392.

[21] TOSHEV A, SZEGEDY C. DeepPose: Human pose estimation via deep neural networks [C]//2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 1653-1660.

[22] WEI S H, RAMAKRISHNA V, KANADE T, et al. Convolutional pose machines [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 4724-4732.

[23] NEWELL A, YANG K Y, DENG J. Stacked hourglass networks for human pose estimation [DB/OL]. (2016-03-22). https://arxiv.org/abs/1603.06937

[24] XIAO B, WU H P, WEI Y C. Simple baselines for human pose estimation and tracking [C]// Computer Vision – ECCV 2018. Cham: Springer, 2018: 472-487.

[25] SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 5693-5703.

[26] HOCHREITER S, SCHMIDHUBER J. Long short-term memory [J]. Neural Computation, 1997, 9(8): 1735-1780.

[27] ZHOU X M, YU X L, XU C. Fast and accurate pose estimation in videos based on knowledge distillation and pose propagation [C]//2022 International Joint Conference on Neural Networks. Padua: IEEE, 2022: 1-8.

[28] HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network [DB/OL]. (2015-03-09). https://arxiv.org/abs/1503.02531

[29] CHU X S, JI R R, GAO W, et al. An improved lightweight human pose estimation method in video [C]//2023 China Automation Congress. Chongqing: IEEE, 2023: 7133-7138.

[30] HE K M, CHEN X L, XIE S N, et al. Masked autoencoders are scalable vision learners [C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 15979-15988.

[31] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.

[32] YANG Y, RAMANAN D. Articulated human detection with flexible mixtures of parts [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12): 2878-2890.