J Shanghai Jiaotong Univ Sci ›› 2026, Vol. 31 ›› Issue (1): 143-153.doi: 10.1007/s12204-025-2815-7

• Intelligent Robots • Previous Articles     Next Articles

ListPose: Lightweight and Implicit Spatial-Temporal Modeling with TokenPose for Video-Based Pose Estimation

ListPose:轻量级隐式时空建模的视频姿态估计模型

武志洋,张志成,党永浩,尹建芹,唐进   

  1. School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications, Beijing 100876, China
  2. 北京邮电大学 智能工程与自动化学院,北京100876
  • Received:2024-11-13 Accepted:2024-12-02 Online:2026-02-28 Published:2026-02-12

Abstract: Video pose estimation has gained significant attention in the field of deep learning. Compared to traditional image-based pose estimation methods, video pose estimation leverages inter-frame relationships and temporal cues to provide more accurate and robust results. However, handling pose estimation in video still faces challenges in terms of modeling frames’ dependency and considering real-world applications’ latency. To address these issues, we propose a lightweight video pose estimation model based on the Transformer architecture. First, we discard the heavy pose-initialization module and retain only a lightweight frame encoder to simplify the model. Second, we introduce a novel residual token initialization module to model frame dependencies and implicitly capture the spatial-temporal correlations between adjacent frames. Additionally, we employ TokenPose as the feature extractor, which leverages self-attention mechanisms to implicitly model the spatial relationships between keypoints and effectively reduces model parameters and computational complexity. We evaluate our method on the Penn Action dataset and Sub-JHMDB dataset, two commonly used benchmarks for video pose estimation. The results demonstrate that our approach achieves comparable performance while significantly reducing the number of model parameters and computational complexity.

Key words: computer vision, video pose estimation, Transformer, spatio-temporal modeling

摘要: 视频姿态估计在深度学习领域受到了广泛关注。与传统的基于图像的姿态估计方法相比,视频姿态估计利用帧间关系和时间线索来提供更准确和鲁棒的结果。然而,在视频中处理姿态估计仍然面临着建模帧依赖性和考虑现实应用延迟方面的挑战。为了解决这些问题,提出一种基于变换器架构的轻量级视频姿态估计模型。首先,放弃了繁重的姿势初始化模块,只保留了一个轻量级的帧编码器来简化模型。其次,引入了一种新的残差令牌初始化模块来建模帧依赖关系,并隐式捕获相邻帧之间的时空相关性。此外,采用TokenPose作为特征提取器,它利用自关注机制隐式地建模关键点之间的空间关系,有效地降低了模型参数和计算复杂度。在Penn Action数据集和Sub-JHMDB数据集上评估了我们的方法,这两个数据集是视频姿态估计的常用基准。结果表明:我们的方法在显著减少模型参数数量和计算复杂度的同时,实现了可比较的性能。

关键词: 计算机视觉,视频姿态估计,Transformer,时空建模

CLC Number: