Recently, stereo matching algorithms based on end-to-end convolutional neural networks achieve excellent
performance far exceeding traditional algorithms. Current state-of-the-art stereo matching networks mostly
rely on full cost volume and 3D convolutions to regress dense disparity maps. These modules are computationally
complex and high consumption of memory, and difficult to deploy in real-time applications. To overcome this
problem, we propose multilevel disparity reconstruction network, MDRNet, a lightweight stereo matching network
without any 3D convolutions. We use stacked residual pyramids to gradually reconstruct disparity maps from
low-level resolution to full-level resolution, replacing common 3D computation and optimization convolutions. Our
approach achieves a competitive performance compared with other algorithms on stereo benchmarks and real-time
inference at 30 frames per second with 4×104 resolutions.
[1] SCHARSTEIN D, SZELISKI R. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms [J]. International Journal of Computer Vision, 2002, 47(1/2/3): 7-42.
[2] MENZE M, GEIGER A. Object scene flow for autonomous vehicles [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE 2015: 3061-3070.
[3] SCHMID K, TOMIC T, RUESS F, et al. Stereo vision based indoor/outdoor navigation for flying robots [C]//2013 IEEE/RSJ International Conference on Intelligent Robots and Systems. Tokyo: IEEE, 2013: 3955-3962.
[4] ZHANG L, SEITZ S M. Estimating optimal parameters for MRF stereo from a single image pair [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(2): 331-342.
[5] SUN J, ZHENG N N, SHUM H Y. Stereo matching using belief propagation [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2003, 25(7): 787-800.
[6] KOLMOGOROV V, ZABIH R. Computing visual correspondence with occlusions using graph cuts [C]//Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001. Vancouver: IEEE, 2001, 2: 508-515.
[7] YOON K J, KWEON I S. Adaptive support-weight approach for correspondence search [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006, 28(4): 650-656.
[8] HOSNI A, RHEMANN C, BLEYER M, et al. Fast cost-volume filtering for visual correspondence and beyond [J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013,35(2): 504-511.
[9] MIN D, LU J, DO M N. A revisit to cost aggregation in stereo matching: How far can we reduce its computational redundancy? [C]//2011 International Conference on Computer Vision. Barcelona: IEEE, 2011: 1567-1574.
[10] MAYER N, ILG E, H¨AUSSER P, et al. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 4040-4048.
[11] KENDALL A, MARTIROSYAN H, DASGUPTA S, et al. End-to-end learning of geometry and context for deep stereo regression [C]//2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 66-75.
[12] CHANG J R, CHEN Y S. Pyramid stereo matching network [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 5410-5418.
[13] ZHANG F, PRISACARIU V, YANG R, et al. Ga-net: Guided aggregation net for end-to-end stereo matching [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 185-194.
[14] GEIGER A, LENZ P, URTASUN R. Are we ready for autonomous driving? The KITTI vision benchmark suite [C]//2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence: IEEE, 2012: 3354-3361.
[15] XU H, ZHANG J. AANet: Adaptive aggregation network for efficient stereo matching [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 1956-1965.
[16] ˇZBONTAR J, LECUN Y. Computing the stereo matching cost with a convolutional neural network [C]//2015 Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 1592-1599. [17] LUOW, SCHWING A G, URTASUN R. Efficient deep learning for stereo matching [C]//2016 Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 5695-5703.
[18] CHEN Z, SUN X, WANG L, et al. A deep visual correspondence embedding model for stereo matching costs [C]//2015 IEEE International Conference on Computer Vision. Santiago: IEEE, 2015: 972-980.
[19] GIDARIS S, KOMODAKIS N. Detect, replace, refine: Deep structured prediction for pixel wise labeling [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 7187- 7196.
[20] SHAKED A,WOLF L. Improved stereo matching with constant highway networks and reflective confidence learning [C]//2017 Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 6901- 6910.
[21] LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 3431-3440.
[22] RONNEBERGER O, FISCHER P, BROX T. U-net: Convolutional networks for biomedical image segmentation [M]//Medical image computing and computerassisted intervention -MICCAI 2015. Cham: Springer, 2015: 234-241.
[23] LIU W, RABINOVICH A, BERG A C. Parsenet: Looking wider to see better [EB/OL]. (2015-06-15). https://arxiv.org/abs/1506.04579.
[24] ZHAO H, SHI J, QI X, et al. Pyramid scene parsing network [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 6230-6239.
[25] RANJAN A, BLACK M J. Optical flow estimation using a spatial pyramid network [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 2720-2729.
[26] SUN D, YANG X, LIU M Y, et al. PWC-net: CNNs for optical flow using pyramid, warping, and cost volume [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 8934-8943.
[27] TANKOVICH V, H¨ANE C, ZHANG Y, et al. HITNet: Hierarchical iterative tile refinement network for real-time stereo matching [C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 14357-14367.
[28] SCHARSTEIN D, HIRSCHM¨ULLER H, KITAJIMA Y, et al. High-resolution stereo datasets with subpixelaccurate ground truth [M]//Pattern recognition. Cham: Springer, 2014: 31-42.
[29] SCH¨OPS T, SCH¨ONBERGER J L, GALLIANI S, et al. A multi-view stereo benchmark with highresolution images and multi-camera videos [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 2538-2547.
[30] YANG G, MANELA J, HAPPOLD M, et al. Hierarchical deep stereo matching on high-resolution images [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 5510-5519.
[31] YANG G, SONG X, HUANG C, et al. Driving- Stereo: A large-scale dataset for stereo matching in autonomous driving scenarios [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 899-908.
[32] KHAMIS S, FANELLO S, RHEMANN C, et al. StereoNet: Guided hierarchical refinement for real-time edge-aware depth prediction [M]//Computer vision - ECCV 2018. Cham: Springer, 2018: 596-613.
[33] TONIONI A, TOSI F, POGGI M, et al. Real-time self-adaptive deep stereo [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 195-204.
[34] YIN Z, DARRELL T, YU F. Hierarchical discrete distribution decomposition for match density estimation [C]// 2019 IEEE Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 6037- 6046.