Automation & Computer Science

Two-Stream Auto-Encoder Network for Unsupervised Skeleton-Based Action Recognition

Expand
  • Department of Automation, Shanghai Jiao Tong University, Shanghai 200240, China

Accepted date: 2022-09-16

  Online published: 2025-03-21

Abstract

Representation learning from unlabeled skeleton data is a challenging task. Prior unsupervised learning algorithms mainly rely on the modeling ability of recurrent neural networks to extract the action representations. However, the structural information of the skeleton data, which also plays a critical role in action recognition, is rarely explored in existing unsupervised methods. To deal with this limitation, we propose a novel twostream autoencoder network to combine the topological information with temporal information of skeleton data. Specifically, we encode the graph structure by graph convolutional network (GCN) and integrate the extracted GCN-based representations into the gate recurrent unit stream. Then we design a transfer module to merge the representations of the two streams adaptively. According to the characteristics of the two-stream autoencoder, a unified loss function composed of multiple tasks is proposed to update the learnable parameters of our model. Comprehensive experiments on NW-UCLA, UWA3D, and NTU-RGBD 60 datasets demonstrate that our proposed method can achieve an excellent performance among the unsupervised skeleton-based methods and even perform a similar or superior performance over numerous supervised skeleton-based methods.

Cite this article

Wang Gang, Guan Yaonan, Li Dewei . Two-Stream Auto-Encoder Network for Unsupervised Skeleton-Based Action Recognition[J]. Journal of Shanghai Jiaotong University(Science), 2025 , 30(2) : 330 -336 . DOI: 10.1007/s12204-023-2619-6

References

[1] SHI L, ZHANG Y F, CHENG J, et al. Two-stream adaptive graph convolutional networks for skeletonbased action recognition [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 12018-12027.
[2] KIM T S, REITER A. Interpretable 3D human action analysis with temporal convolutional networks [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Honolulu: IEEE, 2017: 1623-1631.
[3] LI B, DAI Y C, CHENG X L, et al. Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN [C]//2017 IEEE International Conference on Multimedia & Expo Workshops. Hong Kong: IEEE, 2017: 601-604.
[4] LI W, CHEN L, XU D, et al. Visual recognition in RGB images and videos by learning from RGB-D data [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(8): 2030-2036.
[5] LIU X, SHI H L, HONG X P, et al. 3D skeletal gesture recognition via hidden states exploration [J]. IEEE Transactions on Image Processing, 2020, 29: 4583- 4597.
[6] YAN S J, XIONG Y J, LIN D H. Spatial temporal graph convolutional networks for skeleton-based action recognition [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 12328.
[7] LIU J, SHAHROUDY A, WANG G, et al. Skeletonbased online action prediction using scale selection network [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(6): 1453-1467.
[8] PENG W, HONG X P, CHEN H Y, et al. Learning graph convolutional network for skeleton-based human action recognition by neural searching [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(3): 2669-2676.
[9] PENG W, SHI J G, ZHAO G Y. Spatial temporal graph deconvolutional network for skeleton-based human action recognition [J]. IEEE Signal Processing Letters, 2021, 28: 244-248.
[10] WANG X J, ZHANG L, JING F, et al. AnnoSearch: image auto-annotation by search [C]//2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2006: 1483- 1490.
[11] KULKARNI T, GUPTA A, IONESCU C, et al. Unsupervised learning of object keypoints for perception and control [C]//33rd Conference on Neural Information Processing Systems. Vancouver: NIPS, 2019: 10724-10734.
[12] ZHENG N G, WEN J, LIU R S, et al. Unsupervised representation learning with long-term dynamics for skeleton based action recognition [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 11853.
[13] SU K, LIU X L, SHLIZERMAN E. PREDICT & CLUSTER: Unsupervised skeleton based action recognition [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 9628-9637.
[14] AHSAN U, MADHOK R, ESSA I. Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition [C]//2019 IEEE Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2019: 179-189.
[15] LIN L L, SONG S J, YANG W H, et al. MS2L: Multitask self-supervised learning for skeleton based action recognition [C]//28th ACM International Conference on Multimedia. Seattle: ACM, 2020: 2490-2498.
[16] RAO H C, XU S H, HU X P, et al. Augmented skeleton based contrastive action learning with momentum LSTM for unsupervised action recognition [J]. Information Sciences, 2021, 569: 90-109.
[17] THOKER F M, DOUGHTY H, SNOEK C G M. Skeleton-contrastive 3D action representation learning [C]//29th ACM International Conference on Multimedia. Online: ACM, 2021: 1655-1663.
[18] YAO H, ZHAO S J, XIE C, et al. Recurrent graph convolutional autoencoder for unsupervised skeletonbased action recognition [C]//2021 IEEE International Conference on Multimedia and Expo. Shenzhen: IEEE, 2021: 1-6.
[19] SHI L, ZHANG Y F, CHENG J, et al. Skeleton-based action recognition with directed graph neural networks [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 7904-7913.
[20] LIU Z Y, ZHANG H W, CHEN Z H, et al. Disentangling and unifying graph convolutions for skeletonbased action recognition [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 140-149.
[21] TAY Y, BAHRI D, METZLER D, et al. Synthesizer: Rethinking self-attention for transformer models [C]//37th International Conference on Machine Learning. Online: IMLS, 2020: 10183-10192.
[22] KIPF T N, WELLING M. Variational graph auto-encoders [DB/OL]. (2016-11-21) [2022-05-16]. https://arxiv.org/ abs/1611.07308.
[23] WANG G, LI D W, JIA S. Mix-hops graph convolutional networks for skeleton-based action recognition [C]//2021 International Joint Conference on Neural Networks. Shenzhen: IEEE, 2021: 1-8.
[24] WANG J, NIE X H, XIA Y, et al. Cross-view action modeling, learning, and recognition [C]//2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 2649-2656.
[25] RAHMANI H, MAHMOOD A, Q HUYNH D, et al. HOPC: histogram of oriented principal components of 3D pointclouds for action recognition [M]//Computer vision – ECCV 2014. Cham: Springer, 2014: 742-757.
[26] SHAHROUDY A, LIU J, NG T T, et al. NTU RGB D: A large scale dataset for 3D human activity analysis [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 1010- 1019.
[27] VEMULAPALLI R, ARRATE F, CHELLAPPA R. Human action recognition by representing 3D skeletons as points in a lie group [C]//2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 588-595.
[28] DU Y, WANG W, WANG L. Hierarchical recurrent neural network for skeleton based action recognition [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 1110-1118.
[29] SI C Y, CHEN W T, WANG W, et al. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 1227-1236.
[30] LI J N,WONG Y, ZHAO Q, et al. Unsupervised learning of view-invariant action representations [C]//32nd Conference on Neural Information Processing Systems. Montr′eal: NIPS, 2018: 1262-1272.
[31] XIA L, CHEN C C, AGGARWAL J K. View invariant human action recognition using histograms of 3D joints [C]//2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. Providence: IEEE, 2012: 20-27.
[32] WANG L, HUYNH D Q, KONIUSZ P. A comparative review of recent kinect-based action recognition algorithms [J]. IEEE Transactions on Image Processing, 2020, 29: 15-28.
[33] ZHANG P F, LAN C L, XING J L, et al. View adaptive neural networks for high performance skeletonbased human action recognition [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(8): 1963-1978.
[34] LIU J, SHAHROUDY A, XU D, et al. Spatio-temporal LSTM with trust gates for 3D human action recognition [M]//Computer vision – ECCV 2016. Cham: Springer, 2016: 816-833.
[35] MISRA I, ZITNICK C L, HEBERT M. Shuffle and learn: Unsupervised learning using temporal order verification [M]//Computer vision – ECCV 2016. Cham: Springer, 2016: 527-544.
[36] XU S H, RAO H C, HU X P, et al. Prototypical contrast and reverse prediction: Unsupervised skeleton based action recognition [J]. IEEE Transactions on Multimedia, 2023, 25: 624-634.

Outlines

/