Birds-Eye-View Semantic Segmentation and Voxels Semantic Segmentation Based on Frustum Voxels Modeling and Monocular Camera

doi:10.1007/s12204-023-2573-3

Abstract

Abstract: The semantic segmentation of a bird’s-eye view (BEV) is crucial for environment perception in autonomous driving, which includes the static elements of the scene, such as drivable areas, and dynamic elements such as cars. This paper proposes an end-to-end deep learning architecture based on 3D convolution to predict the semantic segmentation of a BEV, as well as voxel semantic segmentation, from monocular images. The voxelization of scenes and feature transformation from the perspective space to camera space are the key approaches of this model to boost the prediction accuracy. The effectiveness of the proposed method was demonstrated by training and evaluating the model on the NuScenes dataset. A comparison with other state-of-the-art methods showed that the proposed approach outperformed other approaches in the semantic segmentation of a BEV. It also implements voxel semantic segmentation, which cannot be achieved by the state-of-the-art methods.

Key words: semantic segmentation, voxel semantic segmentation, deep learning, convolution neural network, bird’s-eye view (BEV)

摘要： 自动驾驶场景中包含静态目标，如可驾驶区域，以及动态目标，如汽车, 而鸟瞰图的语义分割对于自主驾驶中的环境感知至关重要。本文提出了一个基于三维卷积的端到端深度学习模型以单目相机作为输入并预测鸟瞰图的语义分割和体素语义分割。场景的体素化建模和透视空间到相机空间的特征转换是提高本模型预测准确性的的关键方法。本模型在NuScenes数据集上进行训练并评估该方法的有效性。与其他经典模型的对比结果表明本文提出的模型在鸟瞰图的语义分割方面优于其他算法。此外本文模型还实现了体素语义分割，而其他模型并不具备体素语义分割的能力。

关键词: 语义分割, 体素语义分割, 深度学习, 卷积神经网络，鸟瞰图

CLC Number:

TP391.4

QIN Chao1 (秦超), WANG Yafei1 (王亚飞), ZHANG Yuchao2 (张宇超), YIN Chengliang1∗ (殷承良). Birds-Eye-View Semantic Segmentation and Voxels Semantic Segmentation Based on Frustum Voxels Modeling and Monocular Camera[J]. J Shanghai Jiaotong Univ Sci, 2023, 28(1): 100-113.

References

[1] BADRINARAYANAN V, KENDALL A, CIPOLLA R. SegNet: A deep convolutional encoder-decoder architecture for image segmentation [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2481-2495.
[2] READING C, HARAKEH A, CHAE J L, et al. Categorical depth distribution network for monocular 3D object detection [C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 8551-8560.
[3] ABBAS S A, ZISSERMAN A. A geometric approach to obtain a bird’s eye view from an image [C]//2019 IEEE/CVF International Conference on Computer Vision Workshop. Seoul: IEEE, 2019: 4095-4104.
[4] LIN C C, WANG M S. A vision based top-view transformation model for a vehicle parking assistant [J]. Sensors, 2012, 12(4): 4431-4446.
[5] DENG L Y, YANG M, LI H, et al. Restricted deformable convolution-based road scene semantic segmentation using surround view cameras [J]. IEEE Transactions on Intelligent Transportation Systems, 2020, 21(10): 4350-4362.
[6] S?MANN T, AMENDE K, MILZ S, et al. Efficient semantic segmentation for visual bird’s-eye view interpretation [M]//Intelligent autonomous systems 15. Cham: Springer, 2018: 679-688.
[7] PAN B W, SUN J K, LEUNG H Y T, et al. Crossview semantic segmentation for sensing surroundings [J]. IEEE Robotics and Automation Letters, 2020, 5(3): 4867-4873.
[8] LU C Y, VAN DE MOLENGRAFT M J G, DUBBELMAN G. Monocular semantic occupancy grid mapping with convolutional variational encoder–decoder networks [J]. IEEE Robotics and Automation Letters, 2019, 4(2): 445-452.
[9] SCHULTER S, ZHAI M H, JACOBS N, et al. Learning to look around objects for top-view representations of outdoor scenes [M]//Computer vision – ECCV 2018. Cham: Springer, 2018: 815-831.
[10] MANI K, DAGA S, GARG S, et al. MonoLayout: Amodal scene layout from a single image [C]//2020 IEEE Winter Conference on Applications of Computer Vision. Snowmass: IEEE, 2020: 1678-1686.
[11] RODDICK T, CIPOLLA R. Predicting semantic map representations from images using pyramid occupancy networks [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11135-11144.
[12] RONNEBERGER O, FISCHER P, BROX T. U-Net: Convolutional networks for biomedical image segmentation [M]//Medical image computing and computerassisted intervention – MICCAI 2015. Cham: Springer, 2015: 234-241.
[13] DING X H, ZHANG X Y, MA N N, et al. RepVGG: making VGG-style ConvNets great again [C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 13728-13737.
[14] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss fordense object detection [C]//2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2999-3007.
[15] CAESAR H, BANKITI V, LANG A H, et al. nuScenes: A multimodal dataset for autonomous driving [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11618-11628.
[16] KINGMA D P, BA J. Adam: A method for stochastic optimization[DB/OL]. (2017-01-30). https://arxiv.org/abs/1412.6980.
[17] GARCIA-GARCIA A, ORTS-ESCOLANO S, OPREA S, et al. A review on deep learning techniques applied to semantic segmentation [DB/OL]. (2017-04-22). https://arxiv.org/abs/1704.06857.