Action-aware Encoder-Decoder Network for Pedestrian Trajectory Prediction

doi:10.1007/s12204-023-2565-3

Abstract

Abstract: Accurate pedestrian trajectory predictions are critical in self-driving systems, as they are fundamental to the response- and decision-making of ego vehicles. In this study, we focus on the problem of predicting the future trajectory of pedestrians from a first-person perspective. Most existing trajectory prediction methods from the first-person view copy the bird’s-eye view, neglecting the differences between the two. To this end, we clarify the differences between the two views and highlight the importance of action-aware trajectory prediction in the first-person view. We propose a new action-aware network based on an encoder-decoder framework with an action prediction and a goal estimation branch at the end of the encoder. In the decoder part, bidirectional long short-term memory (Bi-LSTM) blocks are adopted to generate the ultimate prediction of pedestrians’ future trajectories. Our method was evaluated on a public dataset and achieved a competitive performance, compared with other approaches. An ablation study demonstrates the effectiveness of the action prediction branch.

Key words: pedestrian trajectory prediction, first-person view, action prediction, encoder-decoder, bidirectional long short-term memory (Bi-LSTM)

摘要： 准确的行人轨迹预测在自动驾驶系统中至关重要，因为它们对于自主车辆的响应和决策至关重要。在本研究中，我们关注从第一人称视角预测行人未来轨迹的问题。大多数现有的第一人称视角的轨迹预测方法采用了鸟瞰图下的预测方法，忽略了两者之间的差异。为此，我们澄清了两种视角之间的差异，并强调了第一人称视角中动作感知对于轨迹预测的重要性。我们提出了一种基于编码器–解码器框架的新动作感知网络，在编码器末端具有动作预测分支和目标估计分支。在解码器部分，采用双向长短期记忆块来生成行人未来轨迹的最终预测。与其他方法相比，我们的方法在公共数据集上进行了评估，并取得了有竞争力的表现。消融研究证明了动作预测分支的有效性。

关键词: 行人轨迹预测，第一人称视角，动作预测，编码器–解码器，双向长短期记忆网络

CLC Number:

TP391.4

FU Jiawei∗ (傅家威), ZHAO Xu (赵旭). Action-aware Encoder-Decoder Network for Pedestrian Trajectory Prediction[J]. J Shanghai Jiaotong Univ Sci, 2023, 28(1): 20-27.

References

[1] MALLA S, DARIUSH B, CHOI C. TITAN: future forecast using action priors [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA: IEEE, 2020: 11183-11193.
[2] ZHANG T L, TU H Z, QIU W. Developing highprecision maps for automated driving in China: Legal obstacles and the way to overcome them [J]. Journal of Shanghai Jiao Tong University (Science), 2021, 26(5): 658-669.
[3] GEIGER A, LENZ P, STILLER C, et al. Vision meets robotics: The KITTI dataset [J]. The International Journal of Robotics Research, 2013, 32(11): 1231-1237.
[4] SONG X B, WANG P, ZHOU D F, et al. Apollo-Car3D: A large 3D car instance understanding benchmark for autonomous driving [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, CA: IEEE, 2019: 5447-5457.
[5] HU Y K, WANG C X, YANG M. Decision-making method of intelligent vehicles: A survey [J]. Journal of Shanghai Jiao Tong University, 2021, 55(8): 1035-1048 (in Chinese).
[6] SHI Q, ZHANG J L, YANG M. Curvature adaptive control based path following for automatic driving vehicles in private area [J]. Journal of Shanghai Jiao Tong University (Science), 2021, 26(5): 690-698.
[7] RASOULI A, KOTSERUBA I, KUNIC T, et al. PIE: A large-scale dataset and models for pedestrian intention estimation and trajectory prediction [C]//2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 6261-6270.
[8] RASOULI A, KOTSERUBA I, TSOTSOS J K. Are they going to cross? A benchmark dataset and baseline for pedestrian crosswalk behavior [C]//2017 IEEE International Conference on Computer Vision Workshops. Venice: IEEE, 2017: 206-213.
[9] PELLEGRINI S, ESS A, SCHINDLER K, et al. You’ll never walk alone: Modeling social behavior for multitarget tracking [C]//2009 IEEE 12th International Conference on Computer Vision. Kyoto: IEEE, 2009: 261-268.
[10] LEAL-TAIX′E L, FENZI M, KUZNETSOVA A, et al. Learning an image-based motion context for multiple people tracking [C]//2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH: IEEE, 2014: 3542-3549.
[11] ALAHI A, GOEL K, RAMANATHAN V, et al. Social LSTM: Human trajectory prediction in crowded spaces [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV: IEEE, 2016: 961-971.
[12] LIANG J W, JIANG L, NIEBLES J C, et al. Peeking into the future: Predicting future person activities and locations in videos [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, CA: IEEE, 2019: 5718-5727.
[13] SIVARAMAN S, TRIVEDI M M. Dynamic probabilistic drivability maps for lane change and merge driver assistance [J]. IEEE Transactions on Intelligent Transportation Systems, 2014, 15(5): 2063-2073.
[14] LI N, YAO Y, KOLMANOVSKY I, et al. Gametheoretic modeling of multi-vehicle interactions at uncontrolled intersections [J]. IEEE Transactions on Intelligent Transportation Systems, 2022, 23(2): 1428-1442.
[15] YAO Y, ATKINS E, JOHNSON-ROBERSON M, et al. BiTraP: Bi-directional pedestrian trajectory prediction with multi-modal goal estimation [J]. IEEE Robotics and Automation Letters, 2021, 6(2): 1463-1470.
[16] WANG C H, WANG Y C, XU M Z, et al. Stepwise goal-driven networks for trajectory prediction [J]. IEEE Robotics and Automation Letters, 2022, 7(2): 2716-2723.
[17] MANGALAM K, GIRASE H, AGARWAL S, et al. It is not the journey but the destination: Endpoint conditioned trajectory prediction [M]//Computer Vision – ECCV 2020. Cham: Springer, 2020: 759-776.
[18] REHDER E, KLOEDEN H. Goal-directed pedestrian prediction [C]//2015 IEEE International Conference on Computer Vision Workshop. Santiago: IEEE, 2015: 139-147.
[19] RHINEHART N, MCALLISTER R, KITANI K, et al. PRECOG: Prediction conditioned on goals in visual multi-agent settings [C]//2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 2821-2830.
[20] HOCHREITER S, SCHMIDHUBER J. Long shortterm memory [J]. Neural Computation, 1997, 9(8): 1735-1780.
[21] GUPTA A, JOHNSON J, LI F F, et al. Social GAN: Socially acceptable trajectories with generative adversarial networks [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT: IEEE, 2018: 2255-2264.
[22] KOSARAJU V, SADEGHIAN A, MART′IN-MART′IN R, et al. Social-BiGAT: Multimodal trajectory forecasting using bicycle-GAN and graph attention networks [C]//Advances in Neural Information Processing Systems. Vancouver, BC: Neural Information Processing Systems Foundation, 2019: 137-146.
[23] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets [C]//Advancesin Neural Information Processing Systems. Montreal: Neural Information Processing Systems Foundation, 2014: 2672-2680.
[24] SHAFIEE N, PADIR T, ELHAMIFAR E. Introvert: Human trajectory prediction via conditional 3D attention [C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN: IEEE, 2021: 16810-16820.
[25] DU L, DING X, LIU T, et al. Modeling event background for if-then commonsense reasoning using context-aware variational autoencoder [C]//2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong: Association for Computational Linguistics, 2019: 2682-2691.
[26] ZHAO T C, ZHAO R, ESKENAZI M. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders [C]//55th Annual Meeting of the Association for Computational Linguistics. Vancouver: Association for Computational Linguistics, 2017: 654-664.
[27] SOHN K, LEE H, YAN X. Learning structured output representation using deep conditional generative models [C]//Advances in Neural Information Processing Systems. Montr′eal: Neural Information Processing Systems Foundation, 2015: 3483-3491.
[28] REYNOLDS D. Gaussian mixture models [M]//Encyclopedia of biometrics. Boston, MA: Springer, 2009: 659-663.
[29] QUAN R J, ZHU L C, WU Y, et al. Holistic LSTM for pedestrian trajectory prediction [J]. IEEE Transactions on Image Processing, 2021, 30: 3229-3239.
[30] NEUMANN L, VEDALDI A. Pedestrian and egovehicle trajectory prediction from monocular camera [C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN: IEEE, 2021: 10199-10207.
[31] RHINEHART N, KITANI K M, VERNAZA P. R2P2: A reparameterized pushforward policy for diverse, precise generative path forecasting [M]//Computer vision – ECCV 2018. Cham: Springer, 2018: 794-811.
[32] LI J C, MA H B, TOMIZUKA M. Conditional generative neural system for probabilistic trajectory prediction [C]//2019 IEEE/RSJ International Conference on Intelligent Robots and Systems. Macao: IEEE, 2019: 6150-6156.
[33] CHOI C, MALLA S, PATIL A, et al. DROGON: A causal reasoning framework for future trajectory forecast [EB/OL]. (2020-11-06) [2022-04-19]. https://arxiv.org/abs/1908.00024.
[34] DEO N, TRIVEDI M M. Trajectory forecasts in unknown environments conditioned on gridbased plans [EB/OL]. (2021-04-29) [2022-04-19]. https://arxiv.org/abs/2001.00735.
[35] FANG Z J, L′OPEZ A M. Is the pedestrian going to cross? Answering by 2D pose estimation [C]//2018 IEEE Intelligent Vehicles Symposium. Changshu: IEEE, 2018: 1271-1276.
[36] CAO Z, SIMON T, WEI S H, et al. Realtime multi-person 2D pose estimation using part affinity fields [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI: IEEE, 2017: 1302-1310.