PROCEEDINGS OF SPIE Towards real-time monocular depth estimation for mobile systems - SisInfLab
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
PROCEEDINGS OF SPIE SPIEDigitalLibrary.org/conference-proceedings-of-spie Towards real-time monocular depth estimation for mobile systems Deldjoo, Yashar, Di Noia, Tommaso, Di Sciascio, Eugenio, Pernisco, Gaetano, Renò, Vito, et al. Yashar Deldjoo, Tommaso Di Noia, Eugenio Di Sciascio, Gaetano Pernisco, Vito Renò, Ettore Stella, "Towards real-time monocular depth estimation for mobile systems," Proc. SPIE 11785, Multimodal Sensing and Artificial Intelligence: Technologies and Applications II, 117850J (20 June 2021); doi: 10.1117/12.2596031 Event: SPIE Optical Metrology, 2021, Online Only Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 02 Jul 2021 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Towards Real-Time Monocular Depth Estimation For Mobile Systems Yashar Deldjooa , Tommaso Di Noiaa , Eugenio Di Sciascioa , Gaetano Perniscoa* , Vito Renòb , and Ettore Stellab a Polytechnic University of Bari, Via Amendola 126/b, Bari, Italy b National Research Council Institute of Intelligent Industrial Technologies and Systems for Advanced Manufacturing, Via Giovanni Amendola, 122, Bari, Italy ABSTRACT Nowadays, thanks to the development of new advanced driver-assistance systems (ADAS) able to help drivers in driving tasks, autonomous driving is becoming part of our lives. This massive development is mainly due to the great possibilities of guaranteeing higher safety levels that these systems can offer to vehicles that every day travel on our roads. At the heart of every application in this area, there stands the environment’s perception, guiding the vehicle’s behavior. This counts in autonomous driving field and all the applications characterized by a system that moves in the 3D real-world like robotics, augmented reality, etc. For this purpose, an effective 3D perception system is necessary to accurately localize all the objects that compose the scene and reconstruct it in a 3D model. This issue is often faced using LIDAR sensors, which allow an accurate 3D perception offering high robustness in unfavorable light and weather conditions. But these sensors are generally expensive, and thus do not represent the right choice for low-cost vehicles and robots. Moreover, they need to be used in a particular position that does not permit integrating it on the car changing both the appearance and the aerodynamics. Besides, their output is a point cloud data that, due to its structure is not easily manageable with deep learning models that promise outstanding results in various similar predictive tasks. For these reasons, in some applications, it is better to leverage other sensors like RGB cameras to estimate 3D perception. For this purpose, more classic approaches are based on stereo-cameras, RGB-D cameras, and stereo from motion, which generally can reconstruct the scene with less accuracy than LIDARS, but still produce acceptable results. In recent years, several approaches have been proposed in literature which aim to estimate the depth from a monocular camera leveraging deep learning models. Some of these methods use a supervised approach,1, 2 however they mainly rely on annotated datasets which in practice can be labor-expensive to be collected. Thus, some works3, 4 use, on the contrary, self-supervised training procedure leveraging reprojection error. Notwithstanding their good performance, most of the proposed approaches use very deep neural networks, that are power- and resource- consuming and need high-end GPUs to produce results in real-time. For these reasons, these approaches cannot be used in systems with power and computational limits. In this work, we propose a new approach based on a standard CNN proposed in the literature to deal with the image segmentation problem created not to be highly resource-dependent. For the training, we used the knowledge-distillation method using an out of shell pre-trained network as teacher network. We execute large scale experiments to qualitatively and quantitatively compare our results with those obtained with baselines. Moreover, we propose a deep study of inference times using both general-purpose and mobile architectures. Keywords: Depth prediction, Autonomous Driving, Monocular depth, Knowledge distillation *Authors are listed in alphabetical order. Corresponding author: Gaetano Pernisco: E-mail: gaetano.pernisco@poliba.it Multimodal Sensing and Artificial Intelligence: Technologies and Applications II edited by Ettore Stella, Proc. of SPIE Vol. 11785, 117850J · © 2021 SPIE CCC code: 0277-786X/21/$21 · doi: 10.1117/12.2596031 Proc. of SPIE Vol. 11785 117850J-1 Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 02 Jul 2021 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
1. INTRODUCTION Depth estimation task is a long studied problem in computer vision. Different sensors can provide the depth information relying on different technologies, e.g. LIDAR take advantages from laser technology while Depth cameras usually rely on structured light. Other methods dealt leveraging on geometric constraints between multi- view images. In particular, most of approaches relied on stereo-vision and stereo-from-motion. This approaches rely on the assumption that multiple views of the scene are a available, but in real application it is not even possible. The problem of the depth estimation is a crucial task for some applications in autonomous driving, robotics and augmented reality fields. In fact, this task permits an accurate three-dimensional reconstruction of the environment and consequently is at the base of many complex applications. Recently, an huge number of works accounts the monocular depth estimation as a supervised learning task leveraging on Convolutional Neural Networks. This approaches, even if are able to reach amazing results, needs a huge number of images with their respective ground truth depth image for the training. These datasets are very difficult to create and require too many effort. Estimate the depth of a scene from a single image is an ill-posed problem, since it cannot rely on the epipolar geometric constraints. Humans are able to perform well this task making unconsciously assumptions respect objects dimensions.5 Some recent works, account this problem in a self-supervised way exploiting the photo-metric reprojection constraints. This approaches, in contrast with the supervised methods, are trained on multi-views images of the same scene and do not need ground truth depth data. Furthermore, this datasets are more simple to acquire and require less effort. Most of the approaches proposed in literature rely on very deep CNN that are able to reach surprising results. But this methods are power- and resource- consuming and need high-end GPUs to produce results in real-time. For this reason, in applications where power and resources are not available, e.g. for augmented reality or mobile robotics applications, is not possible to rely on them. In this work we propose a new self-supervised approach based on knowledge distillation. We aim to propose a pipeline able to exploit the knowledge guarded in a deep pre-trained CNN to transfer it on smaller one. Our contributions are mainly three: i We propose an approach able to distill the knowledge of an out of shell CNN, Monodepth2,6 and transfer it in another network (DeepLabv3+7 ), to deal the monocular depth estimation task. ii We propose to use a DeepLabv3+ architecture to deal with a task different from the segmentation, for which it was created. iii Finally we executed large scale experiments to test the accuracy of different versions of our model on KITTI8 and measured the inference time of them on different hardware architecture to compare also their performances. 2. RELATED WORK Depth estimation from images is a long studied task in computer vision community. Most works focused either on the use of stereo images,9 multiple images acquired from different view points,10 at different time11 or making some assumption about the scene, e.g. static scene and viewpoint with different light.12, 13 These methods find space in many applications but rely on multiple images of the same scene. In this section we aim to present an overview of different works focused on monocular depth estimation, therefore methods leveraging on a single input image, with a focus on methods casting this as a learning task. 2.1 Supervised Monocular Depth Estimation Monocular depth estimation represents an ill-posed problem in which only one image is available at inference time, therefore is not possible rely on geometric constraints. For this reason, it represents a deeply studied task in last years. Proc. of SPIE Vol. 11785 117850J-2 Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 02 Jul 2021 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Saxena et al.14 proposed a model called Make3D which leverage over-segmentation of the input image to divide it in patches and then estimate 3D location and orientation of local planes that compose the scene. Planes’ parameters are predicted using a linear model and then they are combined together by means of a MRF model. This method, however, tends to fail with thin structures and furthermore, does not consider the global structure of the scene. Liu et al.2 to face up the monocular depth estimation by training a CNN, while Ladicky et al.15 leverage semantic information to improve the results. Karsch et al.16 rely on a database of images with the corresponding depth to more consistent image level depth estimation. They account the problem as a image matching problem and then they use temporal information to refine the result. Eigen et al.1 proposed a method based on a multi-scale CNN trained in a supervised manner on images and the corresponding depth-map. Differently from most of work explained until now, this approach infers depth directly from the raw pixels value without relying on any segmentation or hand-crafted features. Relying on Deep3D,17 Luo et al.18 modeled the monocular depth estimation task as a stereo matching problem, creating the right view in a synthetic way. Kumar et al.19 leveraged the ability of recurrent neural networks (RNNs) to learn spatio-temporal information to predict monocular depth from video sequences. Atapour at al.20 aimed to tackle the problem on a real applications point of view. They formulated the monocular depth estimation as an image-to-image translation problem and leveraging on adversarial training on synthetic data proposed a network able to generalize enough to be domain independent. Ummenhofer et al.21 proposed DeMoN, a model based on a chain of autoencoders that can estimate at the same time both depth and ego-motion from a sequence of monocular-images. Lastely, Chen et al.22 trained a model able to jointly estimate depth and semantic segmentation while ViP-DeepLab23 estimate it jointly with panoptic segmentation. All these methods need a large amount of images with their respective ground-truth on training set to learn to estimate the depth-maps from monocular image. The creation of this dataset is not trivial, therefore this and other works prefer to explore other possibilities. 2.2 Self-supervised Monocular Depth Estimation DeepStereo proposed by Flynn et al.24 is trained in an unsupervised way on images acquired from different point of view to synthesize new views. This model rely on the depth estimation to sample color from the neighboring images. Nevertheless, since this architecture needs multiple images also at inference time, it is not suitable for monocular depth estimation. Also Deep3D17 tackles the novel view estimation task in a binocular setup. This work aim to generate the right view given the left one of the same scene in the context of 3D movies. Their method leverages an image reconstruction loss to produce a distribution over all possible disparities for each pixel. Zhan et al.25 proposed an autoencoder framework for monocular depth estimation on a stereo setup leveraging the reconstruction loss. Their image synthesis is not fully differentiable, for this reason they linearized their loss using a Taylor approximation making the training more difficult. Godard et al. proposed Monodepth,4 in which overcome this problem using a bi-linear sampling. They proposed a framework that, at training time, estimate depth from both left and right image in a stereo setup in order to rely on left-right consistency as a training constraint. Inspired by this work, Aleotti et al.3 used an adversarial framework to tackle the monocular depth estimation problem on a stereo setup; meanwhile Poggi et al.26 developed 3Net a thin architecture to face the same task on CPU leveraging a trinocular assumption. Zhou et al.27 proposed a framework trained on monocular videos to synthesize depth-maps without the stereo constrain. At training time, it estimate depth by minimizing the reconstruction loss between temporal subsequent frames, using a pose network to estimate the relative pose between them. With a similar aim, Mahjourian et al.28 introduced an innovative ICP based loss to jointly estimate depth and ego-motion from unconstrained monocular videos, being the first depth-from-video algorithm to use 3D information in a loss function. Later, Godard et al. extended their previous work6 proposing an architecture fitting for both stereo pairs and monocular video training. In this work they focused on artifacts mainly due to occlusions and moving objects. 3. PROPOSED METHOD In this section, we will briefly introduce knowledge distillation29 concept, in particular the vanilla knowledge dis- tillation used in this work. Then we will review Monodepth26 which we used as teacher model and Deeplabv3plus7 architecture which we choose as student model to enable monocular depth estimation on mobile systems. Proc. of SPIE Vol. 11785 117850J-3 Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 02 Jul 2021 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Pretrained Teacher Net Backpropagation L1 Loss To train Student Net Figure 1: Proposed method for Knowledge Distillation to monocular depth estimation. The method use the output of a pre-trained network as training signal. In particular we used a pre-trained model of Monodepth26 as teacher network. We trained from scratch a student network (DeepLabv3+) relying on a simple L1-loss between the output of the teacher network and the one of the student network. 3.1 Knowledge Distillation Knowledge distillation represents one of the most diffused type of model compression and acceleration. It is based on the idea of train a small student model from a large teacher model. This technique is particularly helpful in overcoming the problem of use deep models on devices with limited resources, e.g. mobile phones or embedded systems. Gou et al.29 classify knowledge distillation algorithm from the perspective of knowledge categories, training schemes, teacher-student architecture and distillation algorithms. An exhaustive explanation of all the possible knowledge distillation schemes is outside the scope of this paper. In our work we used an offline distillation scheme, in which the knowledge is transferred from a pre-trained teacher model to a student model. In this approach the training process is divided in two stages: • Teacher model training: the teacher model is trained on a set of training data before distillation; • Student model training: the student model is trained from scratch using as supervision signal the knowledge extracted from the teacher model. The knowledge can be in the form of intermediate features or the output of the model. As supervision signal, student model training, we used only the output image of the teacher model comparing it with the output of the student model. More precisely, during the training, we aim to minimize an L1 loss between the teacher and the student output. Let Ft the teacher model, Fs the student model and i the input image, the L1-loss is defined as: N 1 X L1 = |Ft (i)p − Fs (i)p | (1) N p=0 where p is the pixel index, N is the total number of pixel in the image, Ft (i)p and Fs (i)p represent the p-th pixel of the output image respectively of the teacher and the student model. Proc. of SPIE Vol. 11785 117850J-4 Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 02 Jul 2021 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
3.2 The Teacher Network As we said on the beginning of this section, in the knowledge distillation pipeline we used an offline approach leveraging a pre-trained Monodepth26 model as teacher model. This model was presented some years ago by Godard et al.6 and extends their previous work.4 The model aim to use a Convolutional Neural Network in a self-supervised setup to tackle the monocular depth estimation task both in a stereo and monocular setup. This approach account the learning problem as a new view-synthesis problem, in which the network learns to reconstruct the appearance of a target image that represent the same scene represented in the input image from another point of view. This model accounts the monocular depth estimation problem as a photo-metric reprojection error minimization problem. Let Tt→t0 the relative pose of each source image It0 with respect to the target image It ’s pose, the model aims to predict the depth map Dt that minimizes the reprojection error Lp defined as: Lp = min 0 pe(It , It→t0 ) (2) t where It→t0 = It0 hproj(Dt , Tt→t0 , K)i (3) where pe is the photometric reconstruction error, proj() is the projection of the depth map Dt in the It0 point of view, the h.i operator indicates the bi-linear sampling operation30 and K is the camera intrinsics matrix. For the photometric reconstruction error they use L1 and SSIM;31 more precisely pe is defined as: α pe(Ia , Ib ) = (1 − SSIM (Ia , Ib )) + (1 − α)kIa − Ib k1 (4) 2 with α = 0.85. They used also a smoothness loss Ls defined as: Ls = |∂x d∗t |e−|∂x It | + |∂y d∗t |e−|∂y It | (5) where d∗t = dt /d¯∗t is the mean-normalized inverse depth.32 Finally, the final objective function is: L = µLp + λLs (6) For stereo training, It0 and It are the two images on the stereo pair, which has known relative pose. For the monocular setup, they use the two frames temporally adjacent to It as source images. In this setup, the relative poses is not known. For this reason they use a pose estimation network to predict the relative pose between frames and use it in the projection function. This pose and depth estimation networks are trained simultaneously. For mixed training setup, It0 includes both temporally adjacent images and the second view on the stereo setup. 3.3 The Student Network As said before, the aim of this work is to propose a model that can tackle the monocular depth estimation problem with limited computation resources. With this aim we choose to use as student network DeepLabv3+7 architecture, that on the best of our knowledge is the state of art on network architectures developed for mobile applications. This network was proposed to deal a semantic segmentation task, but since it is based on an autoencoder structure it’s possible to adapt it to reach our goal. DeepLabv3+ is an extension of DeepLabv333 in which a simple decoder is added to refine the results especially on boundaries. The main idea behind DeepLabv3 is the implementation of the Atrous Spatial Pyramid Pooling (ASPP) composed by several parallel atrous convolution with different rates. Atrous convolutions are a generalization of the the convolution concept that allows to adjust the field of view of the filters to capture multi-scale information. Let x, y the input and output feature maps respectively and w the convolutional kernel, the Atrous convolution in every position i is computed as: X y[i] = x[i + r · k]w[k] (7) k Proc. of SPIE Vol. 11785 117850J-5 Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 02 Jul 2021 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
where r is the the stride used to sample the input feature map. DeepLabv3 implements ASPP after a backbone deep convolutional neural network to extract features at multiple scales by applyng several atrous convolutions with different rates. In DeepLabv3+ the last feature map before logits of DeepLabv3 is used as encoder output; this feature map contains 256 channels. The decoder is very simple, but it it was thought to recover object details. The encoder output is firstly upsampled by a factor of 4 and then concatenated with the corresponding low-level features from the backbone network with the same resolution. Before the concatenation, a point-wise convolution is applied to the low-level features from the backbone just to reduce the channels number. After the concatenation, few 3x3 convolutions are applied and finally a 4x bilinear upsampling. This architecture permit to leverage both ASPP and autoencoders advantages for image-to-image translation with limited resources. 4. EXPERIMENTS Here we compere the performance and the execution time of our approach with the baseline Monodepth2.6 We train our models using only the teacher supervision without any ground truth. We train and test our models using the KITTI 20158 dataset. For the evaluation we compare different versions of our student model with Monodepth26 model focusing on inference times. 4.1 KITTI Dataset As in the work of Godard et al.4 we present our results on KITTI dataset8 using two different splits. This dataset contains 61 scenes with a total of 42.382 rectified stereo pairs 1242x375 pixel. The two used splits are the Eigen split1 and the Eigen Zhou.27 Eigen split1 uses a test split of 697 images from 29 different scenes. The remaining 32 scenes are splitted using 22.600 stereo images for training and 888 for evaluation. To generate the ground truth , in this split, we leverage LIDAR points reprojecting them on the left RGB camera. Eigen Zhou27 is suited for monocular training and remove from the dataset the static frames. This split contains a total of 39.810 monocular images for training and 4,424 for validation. In all our experiments we use the same intrinsics for all images setting the camera principal point as the image center and the focal length as the average of all the focal lengths on the KITTI dataset. Furthermore, we crop the maximum depth to 80m as standard practice. In all the tests, we tested our models on Eigen split. 4.2 Results In all our experiments we used as teacher network the pre-trained model of Monodepth26 trained with the monocular and stereo approach (MS). We choose to use it because it reach the best results on KITTI. We used an image resolution of 1024x320 by resizing the input images. Results are computed using the metrics proposed by Eigen et al.1 This approach rely different metrics like Abs relative difference, Squared Relative difference, RMSE and RMSE (log) which measure the difference in meter with the ground truth depth and other metrics based on the percentage of depths that are within some threshold from the ground truth value. This is useful because the non-thresholded metrics can be sensitive to large errors caused by prediction errors in depth caused by predictions at small disparity values. In table 1 are reported all the results obtained with the presented setup in comparison with our baseline with is represented by or teacher network. The quantitative results show that using our knowledge distillation approach with DeepLabv3+7 with ResNet10134 as backbone trained on Eigen split we are able to reach results very similar to our baseline and overcome it in some metrics. Furthermore, as imaginable, using MobileNet35 as backbone we reach results worse than other architectures (because this network uses less parameters) but them still remain quite good. Table 2 shows results about inference times. In particular are shown, for every network architecture shown before, the number trainable parameters and the inference time on different hardware architectures. DeepLabv3+ with MobileNet as backbone is the network with the lower number of parameters, with about 6 times less parameters respect Monodepth26 architecture (our baseline). For our experiments on general purpose hardware Proc. of SPIE Vol. 11785 117850J-6 Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 02 Jul 2021 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
RMSE Architecture Split Abs Rel Sq Rel RMSE δ < 1.25 δ < 1.252 δ < 1.253 log DeepLabv3+ Zhou 0.108 0.746 4.657 0.193 0.865 0.957 0.981 with MobileNet DeepLabv3+ Zhou 0.121 0.975 5.051 0.216 0.841 0.937 0.967 with ResNet50 DeepLabv3+ Zhou 0.109 0.869 4.649 0.250 0.873 0.958 0.980 with ResNet101 DeepLabv3+ Eigen 0.110 0.871 5.139 0.333 0.863 0.955 0.980 with MobileNet DeepLabv3+ Eigen 0.107 0.771 4.666 0.199 0.865 0.953 0.978 with ResNet50 DeepLabv3+ Eigen 0.104 0.752 4.552 0.193 0.875 0.958 0.980 with ResNet101 Monodepth2 Eigen 0.105 0.790 4.608 0.193 0.876 0.958 0.680 (baseline) Table 1: Comparison of different models. Results on KITTI 2015 stereo dataset with two different splits: Eigen1 and Eigen Zhou.27 Here for the first four metrics (purple) lower is better and for the last 3 (yellow) higher is better. we used a PC with an Intel i7-5720K CPU which has 6 Cores with 3.30 GHz each and a GPU Nvidia GTX 970. For the experiments on mobile, instead, we used a Xiaomi Poco X3 NFC which is an Android 10 phone with a Qualcomm SM7150-AC Snapdragon 732G Octa-core. On Mobile we implemented a simple application using Pytorch Mobile36 and we ran the model on the CPU with multi-threading. The reported results are expressed in seconds and they are the average of times on 10 different executions. As possible to see in the table, the DeepLabv3+ with MobileNet is the faster on single and multiple CPU setup. On the contrary, on GPU and Mobile setup the faster architecture is Monodepth2.6 # of trainable Single Multi GPU Mobile Architecture parameters CPU [s] CPU [s] [s] [s] DeepLabv3+ 2643281 0.4231 0.1445 0.0161 1.429 with MobileNet DeepLabv3+ 26609233 1.3956 0.3881 0,0369 4.65 with ResNet50 DeepLabv3+ 45601361 1.9663 0.5139 0.06051 9.425 with ResNet100 Monodepth2 14842236 0.7365 0.2611 0.0123 1.099 (baseline) Table 2: Inference time. The table shows the number of trainable parameters and the time (in seconds) necessary to inference a single image with each architecture. The time is calculated as the average of 10 consecutive executions. 5. CONCLUSION In this work we proposed a technique to deal the monocular depth estimation problem with a self supervised approach. We exploit an out of shell pre-trained model as supervisor signal in a very basic knowledge distillation pipeline. We adapted the DeepLabv3+ architecture to fit for our task and trained different versions of this model on different splits of KITTI dataset. We compared the results in terms of accuracy and inference times of all the analyzed models. We concluded that it is possible to deal the monocular depth estimation problem using Proc. of SPIE Vol. 11785 117850J-7 Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 02 Jul 2021 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
the DeepLabv3+ architecture and we reached results comparable with our baseline. Unexpectedly, even if our model contains a lower number of parameters, it results slower on Mobile setup compared with Monodepth2,6 because the latter is more optimizable for Mobile. In future will be useful to study a more complex method to knowledge distillation between this two network leveraging on connections between middle layers and introducing other loss functions typical of this domain based on image appearance and geometrical constrains. Furthermore, we would like to study a new custom model to deal the monocular depth estimation task able to run in real-time on Mobile hardware and test it on different indoor and outdoor datasets. REFERENCES [1] Eigen, D., Puhrsch, C., and Fergus, R., “Depth map prediction from a single image using a multi-scale deep network,” arXiv preprint arXiv:1406.2283 (2014). [2] Liu, F., Shen, C., Lin, G., and Reid, I., “Learning depth from single monocular images using deep convo- lutional neural fields,” IEEE transactions on pattern analysis and machine intelligence 38(10), 2024–2039 (2015). [3] Aleotti, F., Tosi, F., Poggi, M., and Mattoccia, S., “Generative adversarial networks for unsupervised monocular depth prediction,” in [Proceedings of the European Conference on Computer Vision (ECCV) Workshops], 0–0 (2018). [4] Godard, C., Mac Aodha, O., and Brostow, G. J., “Unsupervised monocular depth estimation with left- right consistency,” in [Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition], 270–279 (2017). [5] Howard, I. P., “Perceiving in depth, vol. 1: Basic mechanisms.,” (2012). [6] Godard, C., Mac Aodha, O., Firman, M., and Brostow, G. J., “Digging into self-supervised monocular depth estimation,” in [Proceedings of the IEEE/CVF International Conference on Computer Vision], 3828–3838 (2019). [7] Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H., “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in [Proceedings of the European conference on computer vision (ECCV)], 801–818 (2018). [8] Menze, M., Heipke, C., and Geiger, A., “Joint 3d estimation of vehicles and scene flow,” in [ISPRS Workshop on Image Sequence Analysis (ISA) ], (2015). [9] Scharstein, D. and Szeliski, R., “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” International journal of computer vision 47(1), 7–42 (2002). [10] Furukawa, Y. and Hernández, C., “Multi-view stereo: A tutorial,” Foundations and Trends® in Computer Graphics and Vision 9(1-2), 1–148 (2015). [11] Ranftl, R., Vineet, V., Chen, Q., and Koltun, V., “Dense monocular depth estimation in complex dynamic scenes,” in [Proceedings of the IEEE conference on computer vision and pattern recognition], 4058–4066 (2016). [12] Woodham, R. J., “Photometric method for determining surface orientation from multiple images,” Optical engineering 19(1), 191139 (1980). [13] Abrams, A., Hawley, C., and Pless, R., “Heliometric stereo: Shape from sun position,” in [European con- ference on computer vision], 357–370, Springer (2012). [14] Saxena, A., Sun, M., and Ng, A. Y., “Make3d: Learning 3d scene structure from a single still image,” IEEE transactions on pattern analysis and machine intelligence 31(5), 824–840 (2008). [15] Ladicky, L., Shi, J., and Pollefeys, M., “Pulling things out of perspective,” in [Proceedings of the IEEE conference on computer vision and pattern recognition], 89–96 (2014). [16] Karsch, K., Liu, C., and Kang, S. B., “Depth transfer: Depth extraction from video using non-parametric sampling,” IEEE transactions on pattern analysis and machine intelligence 36(11), 2144–2158 (2014). [17] Xie, J., Girshick, R., and Farhadi, A., “Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks,” in [European Conference on Computer Vision ], 842–857, Springer (2016). Proc. of SPIE Vol. 11785 117850J-8 Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 02 Jul 2021 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
[18] Luo, W., Schwing, A. G., and Urtasun, R., “Efficient deep learning for stereo matching,” in [Proceedings of the IEEE conference on computer vision and pattern recognition ], 5695–5703 (2016). [19] CS Kumar, A., Bhandarkar, S. M., and Prasad, M., “Depthnet: A recurrent neural network architecture for monocular depth prediction,” in [Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops ], 283–291 (2018). [20] Atapour-Abarghouei, A. and Breckon, T. P., “Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer,” in [Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ], 2800–2810 (2018). [21] Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., and Brox, T., “Demon: Depth and motion network for learning monocular stereo,” in [Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ], 5038–5047 (2017). [22] Chen, Y., Li, W., Chen, X., and Gool, L. V., “Learning semantic segmentation from synthetic data: A geometrically guided input-output adaptation approach,” in [Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition], 1841–1850 (2019). [23] Qiao, S., Zhu, Y., Adam, H., Yuille, A., and Chen, L.-C., “Vip-deeplab: Learning visual perception with depth-aware video panoptic segmentation,” arXiv preprint arXiv:2012.05258 (2020). [24] Flynn, J., Neulander, I., Philbin, J., and Snavely, N., “Deepstereo: Learning to predict new views from the world’s imagery,” in [Proceedings of the IEEE conference on computer vision and pattern recognition], 5515–5524 (2016). [25] Zhan, H., Garg, R., Weerasekera, C. S., Li, K., Agarwal, H., and Reid, I., “Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction,” in [Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition], 340–349 (2018). [26] Poggi, M., Tosi, F., and Mattoccia, S., “Learning monocular depth estimation with unsupervised trinocular assumptions,” in [2018 International conference on 3d vision (3DV)], 324–333, IEEE (2018). [27] Zhou, T., Brown, M., Snavely, N., and Lowe, D. G., “Unsupervised learning of depth and ego-motion from video,” in [Proceedings of the IEEE conference on computer vision and pattern recognition ], 1851–1858 (2017). [28] Mahjourian, R., Wicke, M., and Angelova, A., “Unsupervised learning of depth and ego-motion from monoc- ular video using 3d geometric constraints,” in [Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ], 5667–5675 (2018). [29] Gou, J., Yu, B., Maybank, S. J., and Tao, D., “Knowledge distillation: A survey,” International Journal of Computer Vision 129(6), 1789–1819 (2021). [30] Jaderberg, M., Simonyan, K., Zisserman, A., and Kavukcuoglu, K., “Spatial transformer networks,” arXiv preprint arXiv:1506.02025 (2015). [31] Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P., “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing 13(4), 600–612 (2004). [32] Wang, C., Buenaposada, J. M., Zhu, R., and Lucey, S., “Learning depth from monocular videos using direct methods,” in [Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition], 2022–2030 (2018). [33] Chen, L.-C., Papandreou, G., Schroff, F., and Adam, H., “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587 (2017). [34] He, K., Zhang, X., Ren, S., and Sun, J., “Deep residual learning for image recognition,” in [Proceedings of the IEEE conference on computer vision and pattern recognition ], 770–778 (2016). [35] Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H., “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861 (2017). [36] “Pytorch Mobile.” https://pytorch.org/mobile/home/. (Accessed: 27 May 2021). Proc. of SPIE Vol. 11785 117850J-9 Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 02 Jul 2021 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
You can also read