Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics Arnav Varmaa , Hemang Chawlaa , Bahram Zonooz and Elahe Arani Advanced Research Lab, NavInfo Europe, The Netherlands {arnav.varma, hemang.chawla, elahe.arani}@navinfo.eu, bahram.zonooz@gmail.com Keywords: Transformers, Convolutional Neural Networks, Monocular Depth Estimation, Camera Self-Calibration, Self-Supervised Learning arXiv:2202.03131v1 [cs.CV] 7 Feb 2022 Abstract: The advent of autonomous driving and advanced driver assistance systems necessitates continuous develop- ments in computer vision for 3D scene understanding. Self-supervised monocular depth estimation, a method for pixel-wise distance estimation of objects from a single camera without the use of ground truth labels, is an important task in 3D scene understanding. However, existing methods for this task are limited to con- volutional neural network (CNN) architectures. In contrast with CNNs that use localized linear operations and lose feature resolution across the layers, vision transformers process at constant resolution with a global receptive field at every stage. While recent works have compared transformers against their CNN counter- parts for tasks such as image classification, no study exists that investigates the impact of using transformers for self-supervised monocular depth estimation. Here, we first demonstrate how to adapt vision transform- ers for self-supervised monocular depth estimation. Thereafter, we compare the transformer and CNN-based architectures for their performance on KITTI depth prediction benchmarks, as well as their robustness to natural corruptions and adversarial attacks, including when the camera intrinsics are unknown. Our study demonstrates how transformer-based architecture, though lower in run-time efficiency, achieves comparable performance while being more robust and generalizable. 1 INTRODUCTION ion et al., 2020) and semantic segmentation (Zheng et al., 2021). This is also reflected in methods for There have been rapid improvements in scene un- monocular dense depth estimation, a pertinent task derstanding for robotics and advanced driver assis- for autonomous planning and navigation, where su- tance systems (ADAS) over the past years. This suc- pervised transformer-based methods (Li et al., 2020; cess is attributed to the use of Convolutional Neural Ranftl et al., 2021) have been proposed as an al- Networks (CNNs) within a mostly encoder-decoder ternative to supervised CNN-based methods (Lee paradigm. Convolutions provide spatial locality and et al., 2019; Aich et al., 2021). However, super- translation invariance which has proved useful for im- vised methods require extensive RGB-D ground truth age analysis tasks. The encoder, often a convolutional collected from costly LiDARs or multi-camera rigs. Residual Network (ResNet) (He et al., 2016), learns Instead, self-supervised methods have increasingly feature representations from the input and is followed utilized concepts of Structure from Motion (SfM) by a decoder which aggregates these features and con- with known camera intrinsics to train monocular verts them into final predictions. However, the choice depth and ego-motion estimation networks simultane- of architecture has a major impact on the performance ously (Guizilini et al., 2020; Lyu et al., 2020; Chawla and generalizability of the task. et al., 2021). While transformer ingredients such as While CNNs have been the preferred architec- attention have been utilized for self-supervised depth ture in computer vision, transformers have also re- estimation (Johnston and Carneiro, 2020), most meth- cently gained traction (Dosovitskiy et al., 2021) mo- ods are nevertheless limited to the use of CNNs that tivated by their success in natural language process- have localized linear operations and lose feature reso- ing (Vaswani et al., 2017). Notably, they have lution during downsampling to increase their limited also outperformed CNNs for object detection (Car- receptive field (Yang et al., 2021). a Equal contribution On the other hand, transformers with fewer in-
Depth Network T12 REASSEMBLE FUSION HEAD TRANSFORMER T9 REASSEMBLE FUSION Target HEAD T6 REASSEMBLE FUSION EMBED HEAD T2 T3 REASSEMBLE FUSION TRANSFORMER HEAD T1 TRANSFORMER Ego-Motion & Intrinsics Network T12 REASSEMBLE CONV Source ReLU TRANSFORMER View Synthesis CONV ReLU Target Source GLOBAL AVERAGE CONV EMBED POOLING ReLU T2 CONV CONV TRANSFORMER CONV SOFTPLUS T1 ReLU Cx , Cy TRANSFORMER Fx , Fy , Figure 1: An overview of Monocular Transformer Structure from Motion Learner (MT-SfMLearner) with learned intrinsics. We readapt modules from Dense Prediction Transformer (DPT) and Monodepth2 to be trained with appearance-based losses for self-supervised monocular depth, ego-motion, and intrinsics estimation. ductive biases allow for more globally coherent pre- • We also introduce a modular method that simulta- dictions with different layers attending to local and neously predicts camera focal lengths and princi- global features simultaneously (Touvron et al., 2021). pal point from the images themselves and can eas- However, transformers require more training data ily be utilized within both CNN- and transformer- and can be more computationally demanding (Caron based architectures. et al., 2021). While multiple studies have compared • We study the accuracy of intrinsics estimation as transformers against CNNs for tasks such as image well as its impact on the performance and robust- classification (Raghu et al., 2021; Bhojanapalli et al., ness of depth estimation. 2021), no study exists that evaluates the impact of transformers in self-supervised monocular depth es- • Finally, we also compare the run-time computa- timation, including when the camera intrinsics may tional and energy efficiency of the architectures be unknown. for depth and intrinsics estimation. In this work, we conduct a comparative study MT-SfMLearner provides real-time depth esti- between CNN- and transformer-based architectures mates and illustrates how transformer-based architec- for self-supervised monocular depth estimation. Our ture, though lower in run-time efficiency, can achieve contributions are as follows: comparable performance as the CNN-based architec- • We demonstrate how to adapt vision transform- tures while being more robust under natural corrup- ers for self-supervised monocular depth estima- tions and adversarial attacks, even when the cam- tion by implementing a method called Monocular- era intrinsics are unknown. Thus, our work presents Transformer SfMLearner (MT-SfMLearner). a way to analyze the trade-off between the perfor- • We compare MT-SfMLearner and CNNs for mance, robustness, and efficiency of transformer- and their performance on the KITTI monocular depth CNN-based architectures for depth estimation. Eigen Zhou split (Eigen et al., 2014) and the on- line depth prediction benchmark (Geiger et al., 2013). 2 RELATED WORKS • We investigate the impact of architecture choices for the individual depth and ego-motion networks Recently, transformer architectures such as Vision on performance as well as robustness to natural Transformer (ViT) (Ranftl et al., 2021) and Data- corruptions and adversarial attacks. efficient image Transformer (DeiT) (Touvron et al.,
2021) have outperformed CNN architectures in im- we simultaneously train depth and ego-motion pre- age classification. Studies comparing ViT and CNN diction networks. The inputs to the networks are a se- architectures like ResNet have further demonstrated quence of temporally consecutive RGB image triplets that transformers are more robust to natural corrup- {I−1 , I0 , I1 } ∈ RH×W ×3 . The depth network learns the tions and adversarial examples in classification (Bho- model fD : RH×W ×3 → RH×W to output dense depth janapalli et al., 2021; Paul and Chen, 2021). Mo- (or disparity) for each pixel coordinate p of a sin- tivated by their success, researchers have replaced gle image. Simultaneously, the ego-motion network CNN encoders with transformers in scene under- learns the model fE : R2×H×W ×3 → R6 to output rela- standing tasks such as object detection (Carion et al., tive translation (tx ,ty ,tz ) and rotation (rx , ry , rz ) form- 2020; Liu et al., 2021), semantic segmentation (Zheng ing the affine transformation R̂0 T̂1 ∈ SE(3) between et al., 2021; Strudel et al., 2021), and supervised a pair of overlapping images. The predicted depth D̂ monocular depth estimation (Ranftl et al., 2020; Yang and ego-motion T̂ are linked together via the perspec- et al., 2021). tive projection model, For supervised monocular depth estimation, Dense Prediction Transformer (DPT) (Ranftl et al., ps ∼ K R̂s←t D̂t (pt )K −1 pt + K T̂s←t , (1) 2020) uses ViT as the encoder with a convolutional that warps the source images Is ∈ {I−1 , I1 } to the tar- decoder and shows more coherent predictions than get image It ∈ {I0 }, with the camera intrinsics de- CNNs due to the global receptive field of transform- scribed by K. This process is called view synthesis, ers. TransDepth (Yang et al., 2021) additionally uses as shown in Figure 1. We train the networks us- a ResNet projection layer and attention gates in the ing the appearance-based photometric loss between decoder to induce the spatial locality of CNNs for the real and synthesized target images, as well as supervised monocular depth and surface-normal es- a smoothness loss on the depth predictions (Godard timation. Lately, some works have inculcated ele- et al., 2019). ments of transformers such as self-attention (Vaswani et al., 2017) in self-supervised monocular depth esti- 3.1 Architecture mation (Johnston and Carneiro, 2020; Xiang et al., 2021). However, there has been no investigation Here we describe Monocular Transformer Struc- of transformers to replace the traditional CNN-based ture from Motion Learner (MT-SfMLearner), our methods (Godard et al., 2019; Lyu et al., 2020) for transformer-based method for self-supervised monoc- self-supervised monocular depth estimation. ular depth estimation (Figure 1). Moreover, self-supervised monocular depth esti- mation still requires prior knowledge of the cam- 3.1.1 Depth Network era intrinsics (focal length and principal point) dur- ing training, which may be different for each data For the depth network, we readapt the Dense Pre- source, may change over time, or be unknown a pri- diction Transformer (DPT) (Ranftl et al., 2020) for ori (Chawla et al., 2020). While multiple approaches self-supervised learning, with a DeiT-Base (Touvron to supervised camera intrinsics estimation have been et al., 2021) in the encoder. There are five components proposed (Lopez et al., 2019; Zhuang et al., 2019), of the depth network: not many self-supervised approaches exist (Gordon • An Embed module, which is a part of the en- et al., 2019). coder, takes an image I ∈ RH×W ×3 , and converts Therefore, we investigate the impacts of trans- non-overlapping image patches of size p × p into former architectures on self-supervised monocular N p = H · W /p2 tokens ti ∈ Rd ∀i ∈ [1, 2, ...N p ], depth estimation for their performance, robustness, where d = 768 for DeiT-Base. This is imple- and run-time efficiency, even when intrinsics are un- mented as a large p × p convolution with stride known. s = p where p = 16. The output from this module is concatenated with a readout token of the same size as the remaining tokens. 3 METHOD • The Transformer block, that is also a part of the encoder, consists of 12 transformer layers Our objective is to study the effect of utilizing vision which process these tokens with multi-head self- transformers for self-supervised monocular depth es- attention (MHSA) (Vaswani et al., 2017) modules. timation in contrast with the contemporary methods MHSA processes inputs at constant resolution and that utilize Convolutional Neural Networks. can simultaneously attend to global and local fea- Given a set of n images from a video sequence, tures.
Table 1: Architecture details of the Reassemble modules. DN and EN refer to depth and ego-motion networks, respectively. The subscripts of DN refer to the transformer layer from which the respective Reassemble module takes its input (see Figure 1). Input image size is H ×W , p refers to the patch size, N p = H ·W /p2 refers to the number of patches from the image, and d refers to the feature dimension of the transformer features. Parameters Operation Input size Output size Function (DN3 , DN6 , DN9 , DN12 , EN) Read (N p + 1) × d Np × d Drop readout token − Concatenate Np × d d × H/p ×W /p Transpose and Unflatten − Pointwise Convolution d × H/p ×W /p Nc × H/p ×W /p Nc channels Nc = [96, 768, 1536, 3072, 2048] Strided Convolution Nc × H/p ×W /p Nc × H/2p ×W /2p k × k convolution, Stride= 2, Nc channels, padding= 1 k = [−, −, −, 3, −] Transpose Convolution Nc × H/p ×W /p Nc × H/s ×W /s p/s × p/s deconvolution, stride= p/s, Nc channels s = [4, 2, −, −, −] • Four Reassemble modules in the decoder, which 3.1.2 Ego-Motion Network are responsible for extracting image-like features from the 3rd , 6th , 9th , and 12th (final) transformer For the ego-motion network, we adapt DeiT-Base layers by dropping the readout token and concate- (Touvron et al., 2021) in the encoder. Since the input nating the remaining tokens in 2D. This is fol- to the transformer for the ego-motion network con- lowed by pointwise convolutions to change the sists of two images concatenated along the channel di- number of channels, and transpose convolution mension, we repeat the embedding layer accordingly. in the first two reassemble modules to upsample We use a Reassemble module to pass transformer to- the representations (corresponding to T3 and T6 in kens to the decoder. For details on the structure of Figure 1). To make the transformer network com- this Reassemble module, refer to Table 1. We adopt parable to its convolutional counterparts, we in- the decoder for the ego-motion network from Mon- crease the number of channels in the pointwise odepth2 (Godard et al., 2019). convolutions of the last three Reassemble mod- When both depth and ego-motion networks use ules by a factor of 4 with respect to DPT. The ex- transformers as described above, we refer to the re- act architecture of the Reassemble modules can be sulting architecture as Monocular Transformer Struc- found in Table 1. ture from Motion Learner (MT-SfMLearner). • Four Fusion modules in the decoder, based on 3.2 Appearance-based Losses RefineNet (Lin et al., 2017). They progressively fuse information from the Reassemble modules Following contemporary self-supervised monocular with information passing through the decoder, and depth estimation methods, we adopt the appearance- upsample the features by 2 at each stage. Un- based losses and an auto-masking procedure from the like DPT, we enable batch normalization in the CNN-based Monodepth2 (Godard et al., 2019) for decoder as it was found to be helpful for self- the above described transformer-based architecture supervised depth prediction. We also reduce the as well. We employ a photometric reprojection loss number of channels in the Fusion block to 96 from composed of the pixel-wise `1 distance and the Struc- 256 in DPT. tural Similarity (SSIM) between the real and synthe- sized target images, along with a multi-scale edge- aware smoothness loss on the depth predictions. We • Four Head modules at the end of each Fusion also use auto-masking to disregard the temporally sta- module to predict depth at 4 scales following tionary pixels in the image triplets. Furthermore, we previous self-supervised methods (Godard et al., reduce texture-copy artifacts by calculating the total 2019). Unlike DPT, the Head modules use 2 con- loss after upscaling the depths, predicted at 4 scales, volutions instead of 3 as we found no difference from intermediate decoder layers to the input resolu- in performance. For further details of the Head tion. modules, refer to Table 2. Table 2: Architecture details of Head modules in Figure 1. 3.3 Intrinsics Layers 32 3 × 3 Convolutions, stride=1, padding= 1 Accurate camera intrinsics given by ReLU Bilinear Interpolation to upsample by 2 fx 0 cx 32 Pointwise Convolutions K = 0 fy cy , (2) Sigmoid 0 0 1
are essential to self-supervised depth estimation as dense depth prediction without post-processing (Go- can be seem from Equation 1. However, the intrin- dard et al., 2019), unless stated otherwise. sics may vary within a dataset with videos collected from different camera setups, or over a long period 4.1.2 Training Settings of time. These parameters can also be unknown for crowdsourced datasets. The networks are implemented in PyTorch (Paszke We address this by introducing an intrinsics esti- et al., 2019) and trained on a TeslaV100 GPU for mation module. We modify the ego-motion network, 20 epochs at a resolution of 640 × 192 with batch- which takes a pair of consecutive images as input, and size 12. MT-SfMLearner is further trained at 2 more learns to estimate the focal length and principal point resolutions - 416 × 128 and 1024 × 320, with batch- along with the translation and rotation. Specifically, sizes of 12 and 2, respectively for experiments in Sec- we add a convolutional path in the ego-motion de- tion 4.2. The depth and ego-motion encoders are ini- coder to learn the intrinsics. The decoder features be- tialized with ImageNet (Deng et al., 2009) pre-trained fore activation from the penultimate layer are passed weights. We use the Adam (Kingma and Ba, 2014) through a global average pooling layer, followed by optimizer for CNN-based networks (in Sections 4.3 two branches of pointwise convolutions to reduce the and 4.4) and AdamW (Loshchilov and Hutter, 2017) number of channels from 256 to 2. One branch uses optimizer for transformer-based networks with initial a softplus activation to estimate focal lengths along learning rates of 1e−4 and 1e−5 , respectively. The x and y axes as the focal length is always positive. learning rate is decayed after 15 epochs by a factor The other branch doesn’t use any activation to esti- of 10. Both optimizers use β1 = 0.9 and β2 = 0.999. mate the principal point as it has no such constraint. Note that the ego-motion decoder is the same for both 4.2 Depth Estimation Performance the convolutional as well as transformer architectures. Consequently, the intrinsics estimation method can be First we compare MT-SfMLearner, where both depth modularly utilized with both architectures. Figure 1 and ego-motion networks are transformer-based, with demonstrates MT-SfMLearner with learned intrinsics. the existing fully convolutional neural networks for their accuracy on self-supervised monocular depth es- timation. Their performance is evaluated using met- 4 RESULTS rics from (Eigen et al., 2014) up to a fixed range of 80 m as shown in Table 3. We compare the meth- ods at different input image sizes clustered into cat- In this section, we perform a comparative anal- egories of Low-Resolution (LR), Medium-Resolution ysis between our transformer-based method, MT- (MR), and High-Resolution (HR). We do not compare SfMLearner, and the existing CNN-based methods against methods that use ground-truth semantic labels for self-supervised monocular depth estimation. We during training. All methods assume known ground- also perform a contrastive study on the architectures truth camera intrinsics. for the depth and ego-motion networks to evaluate We observe that MT-SfMLearner is able to their impact on the prediction accuracy, robustness to achieve comparable performance at all resolutions un- natural corruptions and adversarial attacks, and the der error as well as accuracy metrics. This includes run-time computational and energy efficiency. Fi- methods that also utilize a heavy encoder such as nally, we analyze the correctness and run-time effi- ResNet-101 (Johnston and Carneiro, 2020) and Pack- ciency of intrinsics predictions, and also study its im- Net (Guizilini et al., 2020). pact on the accuracy and robustness of depth estima- tion. Online Benchmark: We also measure the per- 4.1 Implementation Details formance of MT-SfMLearner on the KITTI Online Benchmark for depth prediction1 using the metrics from (Uhrig et al., 2017). We train on an image size 4.1.1 Dataset of 1024 × 320, and add the G2S loss (Chawla et al., 2021) for obtaining predictions at metric scale. Re- We report all results on the Eigen Split (Eigen et al., sults ordered by their rank are shown in Table 4. The 2014) of KITTI (Geiger et al., 2013) dataset after re- performance of MT-SfMLearner is on par with state- moving the static frames as per (Zhou et al., 2017), of-the-art self-supervised methods, and outperforms unless stated otherwise. This split contains 39, 810 training, 4424 validation, and 697 test images, respec- 1 http://www.cvlibs.net/datasets/kitti/eval depth.php? tively. All results are reported on the per-image scaled benchmark=depth prediction. See under MT-SfMLearner.
Table 3: Quantitative results comparing MT-SfMLearner with existing methods on KITTI Eigen split. For each category of image sizes, the best results are displayed in bold, and the second best results are underlined. Error↓ Accuracy↑ Methods Resolution Abs Rel Sq Rel RMSE RMSE log δ < 1.25 δ < 1.252 δ < 1.253 SfMLearner(Zhou et al., 2017) 416×128 0.208 1.768 6.856 0.283 0.678 0.885 0.957 GeoNet (Yin and Shi, 2018) 416×128 0.155 1.296 5.857 0.233 0.793 0.931 0.973 Vid2Depth (Mahjourian et al., 2018) 416×128 0.163 1.240 6.220 0.250 0.762 0.916 0.968 Struct2Depth (Casser et al., 2019) 416×128 0.141 1.026 5.291 0.215 0.816 0.945 0.979 LR Roussel et al. (Roussel et al., 2019) 416×128 0.179 1.545 6.765 0.268 0754 0916 0.966 VITW (Gordon et al., 2019) 416×128 0.129 0.982 5.230 0.213 0.840 0.945 0.976 Monodepth2 (Godard et al., 2019) 416×128 0.128 1.087 5.171 0.204 0.855 0.953 0.978 MT-SfMLearner (Ours) 416×128 0.125 0.905 5.096 0.203 0.851 0.952 0.980 CC (Ranjan et al., 2019) 832×256 0.140 1.070 5.326 0.217 0.826 0.941 0.975 SC-SfMLearner (Bian et al., 2019) 832×256 0.137 1.089 5.439 0.217 0.830 0.942 0.975 Monodepth2 (Godard et al., 2019) 640×192 0.115 0.903 4.863 0.193 0.877 0.959 0.981 SG Depth (Klingner et al., 2020) 640×192 0.117 0.907 4.844 0.194 0.875 0.958 0.980 PackNet-SfM (Guizilini et al., 2020) 640×192 0.111 0.829 4.788 0.199 0.864 0.954 0.980 MR Poggi et. al (Poggi et al., 2020) 640×192 0.111 0.863 4.756 0.188 0.881 0.961 0.982 Johnston & Carneiro (Johnston and Carneiro, 2020) 640×192 0.106 0.861 4.699 0.185 0.889 0.962 0.982 HR-Depth (Lyu et al., 2020) 640×192 0.109 0.792 4.632 0.185 0.884 0.962 0.983 MT-SfMLearner (Ours) 640×192 0.112 0.838 4.771 0.188 0.879 0.960 0.982 Packnet-SfM (Guizilini et al., 2020) 1280×384 0.107 0.803 4.566 0.197 0.876 0.957 0.979 HR-Depth (Lyu et al., 2020) 1024×320 0.106 0.755 4.472 0.181 0.892 0.966 0.984 HR G2S (Chawla et al., 2021) 1024×384 0.109 0.844 4.774 0.194 0.869 0.958 0.981 MT-SfMLearner (Ours) 1024×320 0.104 0.799 4.547 0.181 0.893 0.963 0.982 several supervised methods. This further confirms tures that are non-robust and generalize poorly to that the transformer-based method can achieve com- out-of-distribution (o.o.d.) datasets (Geirhos et al., parable performance to the convolutional neural net- 2020). Since self-supervised monocular depth es- works for self-supervised depth estimation. timation networks concurrently train an ego-motion network (see Equation 1), we investigate the impact 4.3 Contrastive Study of each network’s architecture on depth estimation. We consider both Convolutional (C) and Trans- We saw in the previous section that MT-SfMLearner former (T) networks for depth and ego-motion esti- performs competently on independent and identically mation. The resulting four combinations of (Depth distributed (i.i.d.) test set with respect to the state- Network, Ego-Motion Network) architectures are (C, of-the-art. However, networks that perform well C), (C, T), (T, C), and (T, T), ordered on the ba- on an i.i.d. test set may still learn shortcut fea- sis of their increasing influence of transformers on depth estimation. To compare our transformer-based Table 4: Quantitative comparison of unscaled dense depth method fairly with convolutional networks, we uti- prediction on the KITTI Depth Prediction Benchmark (on- lize Monodepth2 (Godard et al., 2019) with a ResNet- line server). Supervised training with ground truth depths is denoted by D. Use of monocular sequences or stereo pairs is 101 (He et al., 2016) encoder. All four combinations represented by M and S, respectively. Seg represents addi- are trained thrice using the settings described in Sec- tional supervised semantic segmentation training. The use tion 4.1 for an image size of 640 × 192. All combina- of GPS for multi-modal self-supervision is denoted by G. tions assume known ground-truth camera intrinsics. Method Train SILog↓ SqErrRel↓ DORN (Fu et al., 2018) D 11.77 2.23 4.3.1 Performance SORD (Diaz and Marathe, 2019) D 12.39 2.49 VNL (Yin et al., 2019) D 12.65 2.46 DS-SIDENet (Ren et al., 2019) D 12.86 2x.87 PAP (Zhang et al., 2019) D 13.08 2.72 For the four combinations, we report the best perfor- Guo et al. (Guo et al., 2018) D+S 13.41 2.86 mance on i.i.d. in Table 5, and visualize the depth G2S (Chawla et al., 2021) M+G 14.16 3.65 predictions for the same in Figure 2. The i.i.d. test set Ours M+G 14.25 3.72 Monodepth2 (Godard et al., 2019) M+S 14.41 3.67 used for comparison is same as in Section 4.2. DABC (Li et al., 2018b) D 14.49 4.08 We observe from Table 5 that the combination of SDNet (Ochs et al., 2019) D+Seg 14.68 3.90 APMoE (Kong and Fowlkes, 2019) D 14.74 3.88 transformer-based depth and ego-motion networks i.e CSWS (Li et al., 2018a) D 14.85 348 MT-SfMLearner performs best under two of the error HBC (Jiang and Huang, 2019) D 15.18 3.79 SGDepth (Klingner et al., 2020) M+Seg 15.30 5.00 metrics as well as two of the accuracy metrics. The DHGRL (Zhang et al., 2018) D 15.47 4.04 remaining combinations perform comparably on all PackNet-SfM (Guizilini et al., 2020) M+V 15.80 4.73 the metrics. MultiDepth (Liebel and Körner, 2019) D 16.05 3.89 LSIM (Goldman et al., 2019) S 17.92 6.88 From Figure 2, we observe more uniform esti- Monodepth (Godard et al., 2017) S 22.02 20.58 mates for larger objects like vehicles, vegetation, and
Table 5: Quantitative results on KITTI Eigen split for all four architecture combinations of depth and ego-motion networks. The best results are displayed in bold, the second best are underlined. Error↓ Accuracy↑ Architecture Abs Rel Sq Rel RMSE RMSE log δ < 1.25 δ < 1.252 δ < 1.253 C, C 0.111 0.897 4.865 0.193 0.881 0.959 0.980 C, T 0.113 0.874 4.813 0.192 0.880 0.960 0.981 T, C 0.112 0.843 4.766 0.189 0.879 0.960 0.982 T, T 0.112 0.838 4.771 0.188 0.879 0.960 0.982 Input C, C C, T T, C T, T Figure 2: Disparity maps on KITTI for qualitative comparison of all four architecture combinations of depth and ego-motion networks. Example areas where the global receptive field of transformer is advantageous are highlighted in green. Example areas where local receptive field of CNNs is advantageous are highlighted in white. buildings when depth is learned using transformers. examples. Adversarial attacks make impercepti- Transformers are also less affected by reflection from ble (to humans) changes to input images to create windows of vehicles and buildings. This is likely be- adversarial examples that fool networks. We gen- cause of the larger receptive fields of the self-attention erate adversarial examples from the i.i.d. test set layers, which lead to more globally coherent predic- using PGD (Madry et al., 2018) at attack strength tions. On the other hand, convolutional networks pro- ε ∈ {0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 16.0, 32.0}. The duce sharper boundaries, and perform better on thin- gradients are calculated with respect to the train- ner objects such as traffic signs and poles. This is ing loss. Following (Kurakin et al., 2016), likely because of the inherent inductive bias for spa- the perturbation is accumulated over min(ε + tial locality present in convolutional layers. 4, d1.25 · εe) iterations with a step-size of 1. When the test image is from the beginning or end of 4.3.2 Robustness a KITTI sequence, the appearance-based loss is only calculated for the feasible pair of images. While the different combinations perform compara- bly on the i.i.d. dataset, they may differ in robustness • Symmetrically flipped adversarial examples. and generalizability. Therefore, we study the impact Inspired by (Wong et al., 2020), we generate of natural corruptions and adversarial attacks on the these adversarial examples to fool the networks depth performance using the following: into predicting flipped estimates. For this, we use • Natural corruptions. Following (Hendrycks and the gradients of the RMSE loss, where the targets Dietterich, 2019) and (Michaelis et al., 2019), we are symmetrical horizontal and vertical flips of the generate 15 corrupted versions of the KITTI i.i.d. i.i.d. predictions. This evaluation is conducted at test set at the highest severity(= 5). These natural attack strength ε ∈ {1.0, 2.0, 4.0}, similar to the corruptions fall under 4 categories - noise (Gaus- PGD attack described above. sian, shot, impulse), blur (defocus, glass, motion, We report the mean RMSE across three training zoom), weather (snow, frost, fog, brightness), and runs on natural corruptions, PGD adversarial exam- digital (contrast, elastic, pixelate, JPEG). ples, and symmetrically flipped adversarial examples • Projected Gradient Descent (PGD) adversarial in Figures 3, 4, and 5, respectively.
C, C 4.855 9.27 8.721 9.396 9.76 5.73 5.279 8.907 6.176 7.612 8.129 8.221 6.592 10.88 5.056 5.517 10 8 C, T 4.81 9.327 8.802 9.449 9.503 5.702 5.109 8.545 6.196 7.946 8.821 8.425 6.676 10.42 5.091 5.618 6 RMSE T, C 4.765 6.216 5.508 6.237 5.858 5.252 4.918 5.494 6.011 6.37 6.176 6.192 6.942 6.85 4.947 5.326 4 2 T, T 4.783 6.302 5.588 6.335 5.875 5.237 4.899 5.51 6.095 6.411 6.126 6.087 6.952 6.987 5.002 5.363 0 n n ot e clea aussia sh impuls defocu s fog tness ntrast lastic frost glass otion zoom snow xelate jpeg h co e m pi G brig Figure 3: RMSE for natural corruptions of KITTI for all four combinations of depth and ego-motion networks. The i.i.d. evaluation is denoted by clean. 10 improvement in adversarial robustness when learn- C, C 4.855 5.128 5.356 5.736 6.269 6.742 7.192 7.882 8.768 ing either depth or ego-motion with transformers in- 8 stead of convolutional networks. Finally, Figure 5 C, T 4.81 5.069 5.293 5.596 6.052 6.472 6.892 7.786 9.084 6 shows an improvement in robustness to symmetri- RMSE 4 cally flipped adversarial attacks when depth is learned T, C 4.765 4.971 5.202 5.574 6.169 6.742 7.299 7.85 8.431 using transformers instead of convolutional networks. 2 Furthermore, depth estimation is most robust when T, T 4.783 4.963 5.176 5.475 5.983 6.486 6.971 7.464 7.884 0 both depth and ego-motion are learned using trans- 0.0 0.25 0.5 1.0 2.0 4.0 8.0 16.0 32.0 formers. Figure 4: RMSE for adversarial corruptions of KITTI gen- Therefore, MT-SfMLearner, where both depth and erated using PGD at all attack strengths (0.0 to 32.0) for all ego-motion and learned with transformers, provides four combinations of depth and ego-motion networks. At- the highest robustness and generalizability, in line tack strength 0.0 refers to i.i.d. evaluation. with the studies on image classification (Paul and Horizontal Flip attack Chen, 2021; Bhojanapalli et al., 2021). This can be 11 attributed to their global receptive field, which allows Epsilon 10 1.0 for better adjustment to the localized deviations by ac- 2.0 9 4.0 counting for the global context of the scene. RMSE 8 4.4 Intrinsics 7 6 Here, we evaluate the accuracy of our proposed 5 method for self-supervised learning of camera intrin- Vertical Flip attack sics and its impact on the depth estimation perfor- 10 mance. As shown in Table 6, the percentage error for 9 intrinsics estimation is low for both convolutional and transformer-based methods trained on an image size 8 of 640 × 192. Moreover, the depth error as well accu- RMSE 7 racy metrics are both similar to when the ground truth intrinsics are known a priori. This is also observed in 6 Figure 6 where the learning of intrinsics causes no ar- tifacts in depth estimation. 5 C, C C, T T, C T, T We also evaluate the models trained with learned Figure 5: RMSE for adversarial corruptions of KITTI gen- intrinsics on all 15 natural corruptions as well as on erated using horizontal and vertical flip attacks for all four PGD and symmetrically flipped adversarial examples. combinations of depth and ego-motion networks. We report the mean RMSE (µRMSE) across all cor- ruptions in Table 7. The RMSE for depth estimation Figure 3 demonstrates a significant improvement on adversarial examples generated by PGD method in the robustness to all the natural corruptions when for all strengths is shown in Figure 7. The mean learning depth with transformers instead of convo- RMSE (µRMSE) across all attack strengths for hor- lutional networks. Figure 4 further shows a general izontally flipped and vertically flipped adversarial ex-
Table 6: Percentage error for intrinsics prediction and impact on depth estimation for KITTI Eigen split. Intrinsics Error(%) ↓ Depth Error↓ Depth Accuracy↑ Network Intrinsics fx cx fy cy Abs Rel Sq Rel RMSE RMSE log δ < 1.25 δ < 1.252 δ < 1.253 Given - - - - 0.111 0.897 4.865 0.193 0.881 0.959 0.980 C,C Learned -1.889 -2.332 2.400 -9.372 0.113 0.881 4.829 0.193 0.879 0.960 0.981 Given - - - - 0.112 0.838 4.771 0.188 0.879 0.960 0.982 T,T Learned -1.943 -0.444 3.613 -16.204 0.112 0.809 4.734 0.188 0.878 0.960 0.982 INPUT GIVEN K C,C LEARNED K C,C GIVEN K T,T LEARNED K T,T Figure 6: Disparity maps for qualitative comparison on KITTI, when trained with and without intrinsics (K). The second and fourth rows are same as the second and the fifth rows in Figure 2. Table 7: Mean RMSE (µRMSE) for natural corruptions of KITTI, when trained with and without ground-truth intrin- 9 C, C; GIVEN K sics. C, C; LEARNED K T, T; GIVEN K Architecture Intrinsics µRMSE↓ 8 T, T; LEARNED K Given 7.683 C, C RMSE Learned 7.714 Given 5.918 7 T, T Learned 5.939 6 Table 8: Mean RMSE (µRMSE) for horizontal (H) and ver- tical (V) adversarial flips of KITTI, when trained with and without ground-truth intrinsics. 5 Architecture Intrinsics µRMSE↓ (H) µRMSE↓ (V) 0.0 0.25 0.5 1.0 2.0 4.0 8.0 16.0 32.0 C, C Given 7.909 7.354 Epsilon Learned 7.641 7.196 Given 7.386 6.795 Figure 7: Mean RMSE for adversarial corruptions of T, T KITTI generated using PGD, when trained with and without Learned 7.491 6.929 ground-truth intrinsics (K). amples is shown in Table 8. ity for real-time applications. We observe that the robustness to natural corrup- tions and adversarial attacks is maintained by both In Table 9, we report the mean inference speed in architectures when the intrinsics are learned simul- frames per second (fps) and the mean inference en- taneously. Furthermore, similar to the scenario with ergy consumption in Joules per frame for depth and known ground truth intrinsics, MT-SfMLearner with intrinsics estimation for both architectures. These learned intrinsics has higher robustness than its con- metrics are computed over 10, 000 forward passes at volutional counterpart. a resolution of 640 × 192 on an NVidia GeForce RTX 2080 Ti. 4.5 Efficiency Both architectures run depth and intrinsics esti- mation in real-time with an inference speed > 30 We further compare the networks on their computa- fps. However, the transformer-based method con- tional and energy efficiency to examine their suitabil- sumes higher energy and is computationally more de-
Table 9: Inference Speed (frames per second) and Energy Consumption (Joules per frame) for depth and intrinsics es- scale-consistent depth and ego-motion learning from timation using CNN- and transformer-based architectures. monocular video. In Advances in Neural Information Processing Systems, pages 35–45. Architecture Estimate Speed↑ Energy↓ Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, Depth 84.132 3.206 A., and Zagoruyko, S. (2020). End-to-end object de- C,C Intrinsics 97.498 2.908 tection with transformers. In European Conference on Depth 40.215 5.999 Computer Vision, pages 213–229. Springer. T,T Intrinsics 60.190 4.021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bo- janowski, P., and Joulin, A. (2021). Emerging prop- manding than its convolutional counterpart. erties in self-supervised vision transformers. In Pro- ceedings of the International Conference on Computer Vision (ICCV). 5 CONCLUSION Casser, V., Pirk, S., Mahjourian, R., and Angelova, A. (2019). Depth prediction without the sensors: Leveraging structure for unsupervised learning from This work is the first to investigate the im- monocular videos. In Proceedings of the AAAI Con- pact of transformer architecture on the SfM in- ference on Artificial Intelligence, volume 33, pages 8001–8008. spired self-supervised monocular depth estimation. Our transformer-based method MT-SfMLearner per- Chawla, H., Jukola, M., Brouns, T., Arani, E., and Zonooz, B. (2020). Crowdsourced 3d mapping: A combined forms comparably against contemporary convolu- multi-view geometry and self-supervised learning ap- tional methods on the KITTI depth prediction bench- proach. In 2020 IEEE/RSJ International Confer- mark. Our contrastive study additionally demon- ence on Intelligent Robots and Systems (IROS), pages strates that while CNNs provide local spatial bias, 4750–4757. IEEE. especially for thinner objects and around boundaries, Chawla, H., Varma, A., Arani, E., and Zonooz, B. transformers predict uniform and coherent depths, es- (2021). Multimodal scale consistency and awareness pecially for larger objects due to their global recep- for monocular self-supervised depth estimation. In tive field. We observe that transformers in the depth 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE. network result in higher robustness to natural corrup- tions, and transformers in both depth, as well as the Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei- Fei, L. (2009). Imagenet: A large-scale hierarchical ego-motion network, result in higher robustness to image database. In 2009 IEEE conference on com- adversarial attacks. With our proposed approach to puter vision and pattern recognition, pages 248–255. self-supervised camera intrinsics estimation, we also Ieee. demonstrate how the above conclusions hold even Diaz, R. and Marathe, A. (2019). Soft labels for ordinal when the focal length and principal point are learned regression. In Proceedings of the IEEE Conference along with depth and ego-motion. However, trans- on Computer Vision and Pattern Recognition, pages formers are computationally more demanding and 4738–4747. have lower energy efficiency than their convolutional Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, counterparts. Thus, we contend that this work assists D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, in evaluating the trade-off between performance, ro- M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. (2021). An image is worth 16x16 words: Trans- bustness, and efficiency of self-supervised monocu- formers for image recognition at scale. ICLR. lar depth estimation for selecting the suitable archi- Eigen, D., Puhrsch, C., and Fergus, R. (2014). Depth map tecture. prediction from a single image using a multi-scale deep network. In Advances in neural information pro- cessing systems, pages 2366–2374. REFERENCES Fu, H., Gong, M., Wang, C., Batmanghelich, K., and Tao, D. (2018). Deep ordinal regression network for monocular depth estimation. In Proceedings of the Aich, S., Vianney, J. M. U., Islam, M. A., Kaur, M., and IEEE Conference on Computer Vision and Pattern Liu, B. (2021). Bidirectional attention network for Recognition, pages 2002–2011. monocular depth estimation. In IEEE International Geiger, A., Lenz, P., Stiller, C., and Urtasun, R. (2013). Conference on Robotics and Automation (ICRA). Vision meets robotics: The kitti dataset. The Inter- Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Un- national Journal of Robotics Research, 32(11):1231– terthiner, T., and Veit, A. (2021). Understanding 1237. robustness of transformers for image classification. Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., arXiv preprint arXiv:2103.14586. Brendel, W., Bethge, M., and Wichmann, F. A. (2020). Bian, J., Li, Z., Wang, N., Zhan, H., Shen, C., Shortcut learning in deep neural networks. Nature Cheng, M.-M., and Reid, I. (2019). Unsupervised Machine Intelligence, 2(11):665–673.
Godard, C., Mac Aodha, O., and Brostow, G. J. (2017). Un- ance for monocular depth estimation. arXiv preprint supervised monocular depth estimation with left-right arXiv:1907.10326. consistency. In Proceedings of the IEEE Conference Li, B., Dai, Y., and He, M. (2018a). Monocular depth es- on Computer Vision and Pattern Recognition, pages timation with hierarchical fusion of dilated cnns and 270–279. soft-weighted-sum inference. Pattern Recognition, Godard, C., Mac Aodha, O., Firman, M., and Brostow, G. J. 83:328–339. (2019). Digging into self-supervised monocular depth Li, R., Xian, K., Shen, C., Cao, Z., Lu, H., and Hang, L. estimation. In Proceedings of the IEEE/CVF Interna- (2018b). Deep attention-based classification network tional Conference on Computer Vision, pages 3828– for robust depth prediction. In Asian Conference on 3838. Computer Vision (ACCV). Goldman, M., Hassner, T., and Avidan, S. (2019). Learn Li, Z., Liu, X., Drenkow, N., Ding, A., Creighton, stereo, infer mono: Siamese networks for self- F. X., Taylor, R. H., and Unberath, M. (2020). Re- supervised, monocular, depth estimation. In Proceed- visiting stereo depth estimation from a sequence- ings of the IEEE Conference on Computer Vision and to-sequence perspective with transformers. arXiv Pattern Recognition Workshops, pages 0–0. preprint arXiv:2011.02910. Gordon, A., Li, H., Jonschkowski, R., and Angelova, A. Liebel, L. and Körner, M. (2019). Multidepth: Single- (2019). Depth from videos in the wild: Unsupervised image depth estimation via multi-task regression and monocular depth learning from unknown cameras. classification. In 2019 IEEE Intelligent Transporta- Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., and tion Systems Conference (ITSC), pages 1440–1447. Gaidon, A. (2020). 3d packing for self-supervised IEEE. monocular depth estimation. In Proceedings of the Lin, G., Milan, A., Shen, C., and Reid, I. (2017). Refinenet: IEEE/CVF Conference on Computer Vision and Pat- Multi-path refinement networks for high-resolution tern Recognition, pages 2485–2494. semantic segmentation. In Proceedings of the IEEE Guo, X., Li, H., Yi, S., Ren, J., and Wang, X. (2018). Learn- conference on computer vision and pattern recogni- ing monocular depth by distilling cross-domain stereo tion, pages 1925–1934. networks. In Proceedings of the European Conference Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, on Computer Vision (ECCV), pages 484–500. S., and Guo, B. (2021). Swin transformer: Hierarchi- He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid- cal vision transformer using shifted windows. arXiv ual learning for image recognition. In Proceedings of preprint arXiv:2103.14030. the IEEE conference on computer vision and pattern Lopez, M., Mari, R., Gargallo, P., Kuang, Y., Gonzalez- recognition, pages 770–778. Jimenez, J., and Haro, G. (2019). Deep single image Hendrycks, D. and Dietterich, T. (2019). Benchmarking camera calibration with radial distortion. In Proceed- neural network robustness to common corruptions and ings of the IEEE/CVF Conference on Computer Vision perturbations. Proceedings of the International Con- and Pattern Recognition, pages 11817–11825. ference on Learning Representations. Loshchilov, I. and Hutter, F. (2017). Decoupled weight de- Jiang, H. and Huang, R. (2019). Hierarchical binary classi- cay regularization. arXiv preprint arXiv:1711.05101. fication for monocular depth estimation. In 2019 IEEE Lyu, X., Liu, L., Wang, M., Kong, X., Liu, L., Liu, Y., Chen, International Conference on Robotics and Biomimet- X., and Yuan, Y. (2020). Hr-depth: high resolution ics (ROBIO), pages 1975–1980. IEEE. self-supervised monocular depth estimation. CoRR abs/2012.07356. Johnston, A. and Carneiro, G. (2020). Self-supervised monocular trained depth estimation using self- Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and attention and discrete disparity volume. In Proceed- Vladu, A. (2018). Towards deep learning models ings of the ieee/cvf conference on computer vision and resistant to adversarial attacks. In 6th International pattern recognition, pages 4756–4765. Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Con- Kingma, D. P. and Ba, J. (2014). Adam: A ference Track Proceedings. OpenReview.net. method for stochastic optimization. arXiv preprint arXiv:1412.6980. Mahjourian, R., Wicke, M., and Angelova, A. (2018). Un- supervised learning of depth and ego-motion from Klingner, M., Termöhlen, J.-A., Mikolajczyk, J., and monocular video using 3d geometric constraints. In Fingscheidt, T. (2020). Self-Supervised Monocular Proceedings of the IEEE Conference on Computer Vi- Depth Estimation: Solving the Dynamic Object Prob- sion and Pattern Recognition, pages 5667–5675. lem by Semantic Guidance. In ECCV. Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bring- Kong, S. and Fowlkes, C. (2019). Pixel-wise attentional mann, O., Ecker, A. S., Bethge, M., and Brendel, W. gating for scene parsing. In 2019 IEEE Winter Con- (2019). Benchmarking robustness in object detection: ference on Applications of Computer Vision (WACV), Autonomous driving when winter is coming. arXiv pages 1024–1033. IEEE. preprint arXiv:1907.07484. Kurakin, A., Goodfellow, I., Bengio, S., et al. (2016). Ad- Ochs, M., Kretz, A., and Mester, R. (2019). Sdnet: Seman- versarial examples in the physical world. tically guided depth estimation network. In German Lee, J. H., Han, M.-K., Ko, D. W., and Suh, I. H. (2019). Conference on Pattern Recognition, pages 288–302. From big to small: Multi-scale local planar guid- Springer.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Xiang, X., Kong, X., Qiu, Y., Zhang, K., and Lv, N. (2021). Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Self-supervised monocular trained depth estimation Antiga, L., et al. (2019). Pytorch: An imperative style, using triplet attention and funnel activation. Neural high-performance deep learning library. Advances Processing Letters, pages 1–18. in neural information processing systems, 32:8026– Yang, G., Tang, H., Ding, M., Sebe, N., and Ricci, E. 8037. (2021). Transformer-based attention networks for Paul, S. and Chen, P.-Y. (2021). Vision transformers are continuous pixel-wise prediction. In ICCV. robust learners. arXiv preprint arXiv:2105.07581. Yin, W., Liu, Y., Shen, C., and Yan, Y. (2019). Enforc- Poggi, M., Aleotti, F., Tosi, F., and Mattoccia, S. (2020). On ing geometric constraints of virtual normal for depth the uncertainty of self-supervised monocular depth es- prediction. In Proceedings of the IEEE International timation. In Proceedings of the IEEE/CVF Conference Conference on Computer Vision, pages 5684–5693. on Computer Vision and Pattern Recognition, pages Yin, Z. and Shi, J. (2018). Geonet: Unsupervised learning 3227–3237. of dense depth, optical flow and camera pose. In Pro- Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., and ceedings of the IEEE Conference on Computer Vision Dosovitskiy, A. (2021). Do vision transformers see and Pattern Recognition, pages 1983–1992. like convolutional neural networks? arXiv preprint Zhang, Z., Cui, Z., Xu, C., Yan, Y., Sebe, N., and Yang, arXiv:2108.08810. J. (2019). Pattern-affinitive propagation across depth, Ranftl, R., Bochkovskiy, A., and Koltun, V. (2021). Vision surface normal and semantic segmentation. In Pro- transformers for dense prediction. ArXiv preprint. ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4106–4115. Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., and Zhang, Z., Xu, C., Yang, J., Tai, Y., and Chen, L. (2018). Koltun, V. (2020). Towards robust monocular depth Deep hierarchical guidance and regularization learn- estimation: Mixing datasets for zero-shot cross- ing for end-to-end depth estimation. Pattern Recogni- dataset transfer. IEEE Transactions on Pattern Analy- tion, 83:430–442. sis and Machine Intelligence (TPAMI). Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Ranjan, A., Jampani, V., Balles, L., Kim, K., Sun, D., Wulff, Y., Feng, J., Xiang, T., Torr, P. H., et al. (2021). Re- J., and Black, M. J. (2019). Competitive collaboration: thinking semantic segmentation from a sequence-to- Joint unsupervised learning of depth, camera motion, sequence perspective with transformers. In Proceed- optical flow and motion segmentation. In Proceed- ings of the IEEE/CVF Conference on Computer Vision ings of the IEEE conference on computer vision and and Pattern Recognition, pages 6881–6890. pattern recognition, pages 12240–12249. Zhou, T., Brown, M., Snavely, N., and Lowe, D. G. (2017). Ren, H., El-Khamy, M., and Lee, J. (2019). Deep robust Unsupervised learning of depth and ego-motion from single image depth estimation neural network using video. scene understanding. In CVPR Workshops, pages 37– Zhuang, B., Tran, Q.-H., Lee, G. H., Cheong, L. F., 45. and Chandraker, M. (2019). Degeneracy in self- Roussel, T., Van Eycken, L., and Tuytelaars, T. (2019). calibration revisited and a deep learning solution for Monocular depth estimation in new environments uncalibrated slam. In 2019 IEEE/RSJ International with absolute scale. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Conference on Intelligent Robots and Systems (IROS), pages 3766–3773. IEEE. pages 1735–1741. IEEE. Strudel, R., Garcia, R., Laptev, I., and Schmid, C. (2021). Segmenter: Transformer for semantic segmentation. arXiv preprint arXiv:2105.05633. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR. Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., and Geiger, A. (2017). Sparsity invariant cnns. In 2017 international conference on 3D Vision (3DV), pages 11–20. IEEE. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems, pages 5998– 6008. Wong, A., Cicek, S., and Soatto, S. (2020). Targeted adver- sarial perturbations for monocular depth prediction. In Advances in neural information processing systems.
You can also read