J Shanghai Jiaotong Univ Sci ›› 2026, Vol. 31 ›› Issue (1): 143-153.doi: 10.1007/s12204-025-2815-7
• Intelligent Robots • Previous Articles Next Articles
武志洋,张志成,党永浩,尹建芹,唐进
Received:2024-11-13
Accepted:2024-12-02
Online:2026-02-28
Published:2026-02-12
CLC Number:
Wu Zhiyang, Zhang Zhicheng, Dang Yonghao, Yin Jianqin, Tang Jin. ListPose: Lightweight and Implicit Spatial-Temporal Modeling with TokenPose for Video-Based Pose Estimation[J]. J Shanghai Jiaotong Univ Sci, 2026, 31(1): 143-153.
|
[1] SONG Y L, DEMIRDJIAN D, DAVIS R. Continuous body and hand gesture recognition for natural human-computer interaction [J]. ACM Transactions on Interactive Intelligent Systems, 2012, 2(1): 1-28. [2] LIN H Y, CHEN T W. Augmented reality with human body interaction based on monocular 3D pose estimation [M]//Advanced concepts for intelligent vision systems. Berlin, Heidelberg: Springer, 2010: 321-331. [3] IQBAL U, GARBADE M, GALL J, et al. Pose for action - action for pose [C]//2017 12th IEEE International Conference on Automatic Face & Gesture Recognition. Washington, DC: IEEE, 2017: 438-445. [4] DOSOVITSKIY A, FISCHER P, ILG E, et al. FlowNet: Learning optical flow with convolutional networks [C]//2015 IEEE International Conference on Computer Vision. Santiago: IEEE, 2015: 2758-2766. [5] LUO Y, REN J, WANG Z X, et al. LSTM pose machines [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 5207-5215. [6] NIE X C, LI Y C, LUO L J, et al. Dynamic kernel distillation for efficient pose estimation in videos [C]//2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 6942-6950. [7] DANG Y H, YIN J Q, ZHANG S J. Relation-based associative joint location for human pose estimation in videos [J]. IEEE Transactions on Image Processing, 2022, 31: 3973-3986. [8] LIU Y, CHEN J S. PosePropagationNet: Towards accurate and efficient pose estimation in videos [J]. IEEE Access, 2020, 8: 100661-100669. [9] DOSOVITSKIY A. An image is worth 16x16 words: Transformers for image recognition at scale [DB/OL]. (2020-10-22). https://arxiv.org/abs/2010.11929 [10] LIU Z, LIN Y T, CAO Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows [C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 9992-10002. [11] TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[C]// 38th International Conference on Machine Learning. Online: PMLR, 2021: 10347-10357. [12] YANG S, QUAN Z B, NIE M, et al. TransPose: Keypoint localization via transformer [C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 11782-11792. [13] LI Y J, ZHANG S K, WANG Z C, et al. TokenPose: Learning keypoint tokens for human pose estimation [C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 11293-11302. [14] XU Y, ZHANG J, ZHANG Q, et al. Vitpose: Simple vision transformer baselines for human pose estimation [C]// 36th Conference on Neural Information Processing Systems. New Orleans: NIPS, 2022: 38571-38584. [15] MA H Y, WANG Z, CHEN Y F, et al. PPT: token-pruned pose transformer for monocular and multi-view human pose estimation [M]//Computer vision – ECCV 2022. Cham: Springer, 2022: 424-442. [16] ZHANG W Y, ZHU M L, DERPANIS K G. From actemes to action: A strongly-supervised representation for detailed action understanding [C]//2013 IEEE International Conference on Computer Vision. Sydney: IEEE, 2013: 2248-2255. [17] JHUANG H, GALL J, ZUFFI S, et al. Towards understanding action recognition [C]//2013 IEEE International Conference on Computer Vision. Sydney: IEEE, 2013: 3192-3199. [18] ANDRILUKA M, ROTH S, SCHIELE B. Pictorial structures revisited: People detection and articulated pose estimation [C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami: IEEE, 2009: 1014-1021. [19] PISHCHULIN L, ANDRILUKA M, GEHLER P, et al. Strong appearance and expressive spatial models for human pose estimation [C]//2013 IEEE International Conference on Computer Vision. Sydney: IEEE, 2013: 3487-3494. [20] YANG Y, RAMANAN D. Articulated pose estimation with flexible mixtures-of-parts [C]//CVPR 2011. Colorado Springs: IEEE, 2011: 1385-1392. [21] TOSHEV A, SZEGEDY C. DeepPose: Human pose estimation via deep neural networks [C]//2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 1653-1660. [22] WEI S H, RAMAKRISHNA V, KANADE T, et al. Convolutional pose machines [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 4724-4732. [23] NEWELL A, YANG K Y, DENG J. Stacked hourglass networks for human pose estimation [DB/OL]. (2016-03-22). https://arxiv.org/abs/1603.06937 [24] XIAO B, WU H P, WEI Y C. Simple baselines for human pose estimation and tracking [C]// Computer Vision – ECCV 2018. Cham: Springer, 2018: 472-487. [25] SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 5693-5703. [26] HOCHREITER S, SCHMIDHUBER J. Long short-term memory [J]. Neural Computation, 1997, 9(8): 1735-1780. [27] ZHOU X M, YU X L, XU C. Fast and accurate pose estimation in videos based on knowledge distillation and pose propagation [C]//2022 International Joint Conference on Neural Networks. Padua: IEEE, 2022: 1-8. [28] HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network [DB/OL]. (2015-03-09). https://arxiv.org/abs/1503.02531 [29] CHU X S, JI R R, GAO W, et al. An improved lightweight human pose estimation method in video [C]//2023 China Automation Congress. Chongqing: IEEE, 2023: 7133-7138. [30] HE K M, CHEN X L, XIE S N, et al. Masked autoencoders are scalable vision learners [C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 15979-15988. [31] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778. [32] YANG Y, RAMANAN D. Articulated human detection with flexible mixtures of parts [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12): 2878-2890. |
| [1] | DONG Zhaoxian, YU Shuo, SHEN Yanming. Multi-Scale Dynamic Hypergraph Convolutional Network for Traffic Flow Forecasting [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(5): 880-888. |
| [2] | BAO Qirui, MEI Haiyang, WEI Huilin, L Zheng, WANG Yuxin, YANG Xin. Generating Adversarial Patterns in Facial Recognition with Visual Camouflage [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(5): 911-922. |
| [3] | YU Nannan, WANG Chaoyi, QIAO Yu, WANG Yuxin, ZHENG Chenglin, ZHANG Qiang, YANG Xin. Hypergraph-Based Asynchronous Event Processing for Moving Object Classification [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(5): 952-961. |
| [4] | WU Yalei, LI Jinghua, KONG Dehui, LI Qianxing, YIN Baocai. 3D Hand Pose Estimation Using Semantic Dynamic Hypergraph Convolutional Networks [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(5): 855-865. |
| [5] | Ma Jin, Ren Ze, Zhang Tongtong, Ding Ying, Lu Yilei, Peng Yinghong. Transformer-Based Contrastive Learning Method for Automated Sleep Stages Classification [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(4): 720-732. |
| [6] | Mi Linhui, Yuan Junyi, Zhou Yankang, Hou Xumin. Text Structured Algorithm of Lung Cancer Cases Based on Deep Learning [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(4): 778-789. |
| [7] | Miao Jun, Chang Yiru, Chen Chen, Zhang Maoyuan, Liu Yan, Qi Honggang, Guo Zhijun, Xu Qian. Ground-Glass Lung Nodules Recognition Based on CatBoost Feature Selection and Stacking Ensemble Learning [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(4): 790-799. |
| [8] | Ma Yiyuan, Chen Huaiyuan, Chen Weidong. Real-Time Prediction of Elbow Motion Through sEMG-Based Hybrid BP-LSTM Network [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(3): 455-462. |
| [9] | Pan Xinrong, Liu Xuewen, Zhu Bo, Wang Yingyi. Physics-Guided Neural Network with Gini Impurity-Based Structural Optimizer for Prediction of Membrane-Type Acoustic Material Transmission Loss [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(3): 613-624. |
| [10] | Xiao Wenbo, Xiong Jiakai, Yu Lesheng, He Yinshui, Ma Guohong. Weld Defect Monitoring Based on Two-Stage Convolutional Neural Network [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(2): 291-299. |
| [11] | Diao Zijian, Cao Shuai, Li Wenwei, Liang Jianan, Wen Guilin, Huang Weixi, Zhang Shouming. Person Re-Identification Based on Spatial Feature Learning and Multi-Granularity Feature Fusion [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(2): 363-374. |
| [12] | Li Kai, Huang Wenhan, Li Chenchen, Deng Xiaotie. Exploiting a No-Regret Opponent in Repeated Zero-Sum Games [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(2): 385-398. |
| [13] | DING Lihui1, 2(丁黎辉), FU Lijun1, 3 (付立军), YANG Guang4(杨光), WAN Lin4, 5 (万林), CHANG Zhijun7(常志军). Video-Based Detection of Epileptic Spasms in IESS: Modeling, Detection, and Evaluation [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(1): 1-9. |
| [14] | KE Jing1(柯晶), ZHU Junchao2 (朱俊超), YANG Xin1(杨鑫), ZHANG Haolin3 (张浩林), SUN Yuxiang1(孙宇翔), WANG Jiayi1(王嘉怡), LU Yizhou4(鲁亦舟), SHEN Yiqing5(沈逸卿), LIU Sheng6(刘晟), JIANG Fusong7(蒋伏松), HUANG Qin8(黄琴). TshFNA-Examiner: A Nuclei Segmentation and Cancer Assessment Framework for Thyroid Cytology Image [J]. J Shanghai Jiaotong Univ Sci, 2024, 29(6): 945-957. |
| [15] | LI Mingai1, 2∗ (李明爱), WEI Lina1 (魏丽娜). Motor Imagery Classification Based on Plain Convolutional Neural Network and Linear Interpolation [J]. J Shanghai Jiaotong Univ Sci, 2024, 29(6): 958-966. |
| Viewed | ||||||
|
Full text |
|
|||||
|
Abstract |
|
|||||